* chore(plugin-server): remove healthcheck topic references
Rather than doing an end to end produce/consume from this topic, we
instead rely on the intrumentation of KafkaJS to understand if the
consumer is ready.
Note that this code is not being used since the change to just return an
HTTP 200 from the liveness endpoint:
https://github.com/PostHog/posthog/pull/11234
This is just a cleanup of dead code.
* Remove Kafka healthcheck tests
* Include information on plugin/source in timeout messages
* Call __asyncGuard correctly
Previously __asyncGuard could be called without context. We now include
await/Promise.then rather than random promise objects as arguments
* Update transforms testing code
* Clean up `updateEventNamesAndProperties`
1. Add parallelization
2. Add error handling
3. Fix up weird handling of statsd timing
* Use EXCLUDED over interpolated value in team-manager.ts
* Convert teamManager to work via bulk inserts
* Refactor team-manager tests
* chore(plugin-server): Add test for $snapshot event never fetching person data
Follow-up to previous lazy person loading PR
* Verify person data loaded once
* fix(plugin-server): set right person_id if person created in a race
Previously (even after lazy-loading persons), person_id could be set to
undefined if for a new user, two events (A & B) were processed in parallel.
1. A runs buffer step, preloading person as undefined
2. B runs event pipeline, creating person
3. A runs person update, getting that person exists in pg
4. A sets person_id column as `undefined` since person was loaded in
step (1)
We now reset the personContainer instead if we suspect person might have
been created in a race.
* Fix test
* chore(plugin-server): Add test for $snapshot event never fetching person data
Follow-up to previous lazy person loading PR
* Verify person data loaded once
* fix issues with fetchPerson() and add tests
- fetchPerson() returned extra columns that were not needed
* Add LazyPersonContainer class
* Load person data lazily through the event pipeline
* Make webhooks and action matching lazy
* Update runAsyncHandlersStep
* Return own person properties in process-event.ts
* Remove snapshots that caused pain
* Handle serialization of LazyPersonContainer
* Merge: Handle LHS only existing
.get() would be cached in that case not to do a query, which we can
avoid
* Serialize result args as well
* Make personContainer functional
* Resolve feedback
* Include kafka topic for setup
* Sample runEventPipeline/runBufferEventPipeline less frequently comparatively
This is done by duration - we still want the long transactions, but not
the short ones
* Trace enqueue plugin jobs
* Trace node-fetch
* Trace worker creation
* Various fixes
* Line up query tags properly
* Make fetch mocking work
* Resolve typing-related issues
* Experimental tracing support for plugin server
* Add tag to postgresTransaction
* Track event pipeline steps as separate spans
* Track kafka queueMessage?
* Tracing for processEvent, onEvent, onSnapshot
* plugin.runTask
* Move sentry code
* Make tracing rate configurable
* feat(plugin-server): Use Snappy compression codec for kafka production
This helps avoid 'message too large' type errors (see
https://github.com/PostHog/posthog/pull/10968) by compressing in-flight
messages.
I would have preferred to use zstd, but the libraries did not compile
cleanly on my machine.
* Update tests
* chore(plugin-server): include extra information on kafka producer errors
We're failing to send batches of messages to kafka on a semi-regular
basis due to message sizes. It's unclear why this is the case as we try
to limit each message batch size.
This PR adds information on these failed batches to sentry error
messages.
Example error: https://sentry.io/organizations/posthog2/issues/3291755686/?project=6423401&query=is%3Aunresolved+level%3Aerror
* refactor(plugin-server): Remove Buffer.from from kafka messages
This allows us to be much more accurate estimating message sizes,
hopefully eliminating a class of errors
* estimateMessageSize
* Track histogram with message sizes
* Flush immediately for too large messages
* fud
This is currently one of our slowest queries. The query itself is fine
but the amount of data returned is causing issues.
`error` column constitutes >80% of data stored in the column - by
removing it in the query we should see significant speedups.
pganalyze page: https://app.pganalyze.com/databases/-675880137/queries/5301003665?t=7d
* chore(dev): use network mode host for docker-compose services
This removes the need to add kafka to /etc/hosts.
As far as I can tell this should be fine for peoples local dev except
they will be required to reset and re-migrate ClickHouse tables as they
will be trying to pull from `kafka` instead of `localhost`.
* remove ports from redis
* Update a few more references
* fix(autocapture): ensure `$elements` passed to `onEvent`
Before calling `onEvent` the plugin does, amoung other things, a delete
on `event.properties` of the `$elements` associated with `$autocapture`.
This means that for instance the S3 plugin doesn't include this data in
it's dump.
We could also include other data like `elements_chain` that we also
store in `ClickHouse` but I've gone for just including `elements` for
now as `elements_chain` is derived from `elements` anyhow.
* revert .env changes, I'll do that separately
* run prettier
* update to scaffold 1.3.0
* fix lint
* chore: update scaffold to 1.3.1
* update scaffold
* chore(plugin-server): Consume from buffer topic
* Refactor `posthog` extension for buffering
* Properly form `bufferEvent` and don't throw error
* Add E2E test
* Test buffer more end-to-end and properly
* Put buffer-enabled test in a separate file
* Update each-batch.test.ts
* Test that the event goes through the buffer topic
* Fix formatting
* Refactor out `spyOnKafka()`
* Ensure reliability batching-wise
* Send heartbeats every so often
* Make test less flaky
* Commit offsets if necessary before sleep too
* Update tests
* Use seek-based mechanism (with KafkaJS 2.0.2)
* Add comment to clarify seeking
* Update each-batch.test.ts
* Make minor improvements
* Remove onAction
* Avoid fetching actions that dont deal with REST - 99% reduction
* Plural hooks
* Avoid hook fetching where not needed
* Remove dead code
* Update lazy VM test
* Rename a function
* Update README
* Explicit reload actions in tests
* Only reload actions which are relevant for plugin server
* Remove excessive logging
* Reload actions when hooks are updated
* update action matching tests
* Remove commented code
* Solve naming issues
* perf(plugin-server): load less data for each team
Our `team` model is pretty fat and we were fetching columns not used in
plugin server from app. Reducing the no of columns will make lookups
faster.
* perf(plugin-server): cache team data for longer
We now cache team data for 2 minutes over 30 seconds. The trade-off is
that `anonymize_ips` setting will take longer to propagate, but we
already check that in capture endpoint
* WIP: Move person creation earlier
* WIP: move person updating, handle person property changing
* WIP: leverage person information
* Update `updatePersonDeprecated` signature
* Avoid (and test avoiding) unneeded lookups whether 'creating' person is needed
Note there were two tricky interactions within handleIdentify, which
again got solved by indirect message passing.
* Solve TODO
* Normalize event before updatePersonIfTouchedByPlugins
* Avoid another lookup for person in updatePersonProperties
* Avoid lookup for newPerson in handleIdentifyOrAlias
* Add kludge comments
* Fix runBufferEventPipeline
* Rename upsertPersonsStep => processPersonsStep
* Update emitToBufferStep tests
* Update some event pipeline step tests
* Update prepareEventStep tests
* Test processPersonStep
* Add tests for updatePersonIfTouchedByPlugins step
* Update runner tests
* verify person vesrion in event-pipeline-integration test
* Update process-event test suite
* Argument ordering for person state tests
* Update runner test snapshots
* Cast to UTC
* Fixup person-state tests
* Dont refetch persons needlessly on $identify
* Add missing version assertion
* Cast everything to UTC
* Remove version assertion
* Undo radical change to event pipeline - will re-add it later!
* Resolve comments