* Revert "chore(CI): revert "chore: syphar is deprecated. fangle python actions…"
This reverts commit 1f2ff4ac32.
* not on depot?
* was missing a syphar removal :/
* was missing a syphar removal :/
* Update query snapshots
* back to depot
---------
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Problem
We're not able to play back all of the recordings captured by the blob ingester, the offset high water mark processing is new and if it isn't working correctly would lead to us skipping data we should not have
Changes
Allow us to set an environment variable that uses a no-op high-water mark processor
sneaks in removal of the no-longer-used SESSION_RECORDING_BLOB_PROCESSING_TEAMS env var
* chore: print session messages as we go, save memory
Wanted to make sure we're not getting close to any memory limits when
generating, stream to Kafka asap
* wip
* wip
* chore(recordings): fix no team found on load test
For some reason it's not picking up the team from the token. I assume
there's some env var difference because it seems to work fine locally.
* wip
* wip
* fix ci
* increase session count
* no need for test data
* reduce session count
* chore(recordings): add command to generate session recording events
The intention of this is to be able to generate somewhat realistic
session recording events to test the ingestion pipeline performance,
such that we don't need to rely on pushing to production.
This command just outputs for stdout a single JSON encoded line per
event, which can then e.g. be piped into kafkacat or
kafka-console-producer, or put to a file to then be used by vegeta to
perform load testing on the capture endpoint.
* Python 3.10
Performance gains go brrr
* Add missing SAML deps
* Add missing dep to dockerfile
* Update mypy to 0.981 for 3.10.7 compatibility
Needed this bug to be fixed: https://github.com/python/mypy/issues/13627
This also incidentally fixed the mypy bug in csv_exporter.py
* bump to 3.10.10
* chore(person-merges): add person merges ClickHouse table
To enable speeding up queries we want to include a person merges table
which we can use to join on to be able to "correct" event person_ids
that may be needed due to person merges happening.
The specific difference from this implementation and the distinct_id
lookup is the cardinality of the table should be :fingerscrossed: much
lower, with the majority of events already having the correct person_id
associated with them.
The distinct_id joining is particularly problematic in that the join key
on the right hand side of a join needs to be loaded into memory as a
Hash map, for the purposes of ClickHouse performing a HashJoin.
1. distinct_ids are arbitrary length strings
2. every event needed to be joined via a distinct_id. With the
person_id join we can use an left outer join making the right hand
side much smaller if we simply omit person ids we think are the
"canonical" id.
* add some todos
* empty __init__
* wip
* Add Kafka and Materialized views
* add some comments
* exclude created_at field
* duplicate materialized view schema from kafka table
* fix default now
* whitespace
* remove EMPTY AS
* Use EMPTY AS
* Update schema snapshots
* update name extract
* fix table name escaping
* update snapshots
* fix escaping
* create kafka tables late
* Add posthog clickhouse migrations to github actions change list
* Add tombstone flag
* add missing comma
* update snapshots
* partition by oldest_event
* make kafka table names consistent with others
* Update posthog/models/person_overrides/sql.py
Co-authored-by: Xavier Vello <xavier@posthog.com>
* Align naming with existing variables and add a test
* add missing override changes
* Update test
* Update snapshots
* delete kafka and mv tables at end of tests
* fix date formatting issue
* assert results is list
* Update snapshots
* Update snapshots
* Add comment re. using EMPTY AS SELECT
* Add comment about where version will come from
* Update posthog/clickhouse/test/test_person_overrides.py
Co-authored-by: James Greenhill <fuziontech@gmail.com>
* Update snapshots
Co-authored-by: Xavier Vello <xavier@posthog.com>
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: James Greenhill <fuziontech@gmail.com>
* feat(person-on-events): add option to delay all events
This change implements the option outlined in
https://github.com/PostHog/product-internal/pull/405
Here I do not try to do any large structural changes to the code, I'll
leave that for later although it does mean the code has a few loose
couplings between pipeline steps that probably should be strongly
coupled. I've tried to comment these to try to make it clear about the
couplings.
I've also added a workflow to run the functional tests against both
configurations, which we can remove once we're happy with the new
implementation.
Things of note:
1. We can't enable this for all users yet, not without the live events
view and not without verifying that the buffer size is sufficiently
large. We can however enable this for the test team and verify that
it functions as expected.
2. I have not handled the case mentioned in the above PR regarding
guarding against processing the delayed events before all events in
the delay window have been processed.
wip
test(person-on-events): add currently failing test for person on events
This test doesn't work with the previous behaviour of the
person-on-events implementation, but should pass with the new delay all
events behaviour.
* add test for KafkaJSError behaviour
* add comment re delay
* add test for create_alias
* chore: increase exports timeout
It seems to fail in CI, but only for the delayed events enabled tests.
I'm not sure why, but I'm guessing it's because the events are further
delayed by the new implementation.
* chore: fix test
* add test for ordering of person properties
* use ubuntu-latest-8-cores runner
* add tests for plugin processEvent
* chore: ensure plugin processEvent isn't run multiple times
* expand on person properties ordering test
* wip
* wip
* add additional test
* change fullyProcessEvent to onlyUpdatePersonIdAssociations
* update test
* add test to ensure person properties do not propagate backwards in time
* simplicfy person property tests
* weaken guarantee in test
* chore: make sure we don't update properties on the first parse
We should only be updating person_id and asociated distinct_ids on first
parse.
* add tests for dropping events
* increase export timeout
* increase historical exports timeout
* increase default waitForExpect interval to 1 second