Add kafka rack aware injection on docker-server
Currently we only inject this in plugin-server (aka reading from kafka)
Let's add it as well to docker-server (aka capture/writing to kafka)
Marius nerd-sniped us to try this out. Recording of Cypress runs.
Co-authored-by: Paul D'Ambra <paul@posthog.com>
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
* Swapped to use KAFKA_HOSTS everywhere
* Fixed up type of kafka config options and setup separate kafka hosts for blob consumer
* allow session recordings to have its own kafka security protocol
* remove slash commands from this pr
* syntax must be obeyed
* Update UI snapshots for `chromium` (2)
* Update UI snapshots for `chromium` (2)
* fix
* Update query snapshots
* no empty strings in kafka hosts
* fix snapshot
* fix test
---------
Co-authored-by: Paul D'Ambra <paul@posthog.com>
Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
* ci: fix waiting for temporal to be up in backend tests
I think it got merged before because we weren't running the backend
tests on these script changes so I've also added them to the list of
paths to watch for changes.
* increase timeout to 180 seconds
* ci: wait for temporal to be up before running backend tests
If we don't wait, then there are some tests that fail because temporal
isn't up yet. These tests ideally shouldn't be using the real temporal
server, but rather the test server that is spun up when e.g. using
`temporalio.testing.WorkflowEnvironment` although for the sake of
getting tests to not be flaky, this is a good enough solution for now.
If you only change the plugin server you spend a long time waiting for e2e CI to run.
It doesn't use the plugin server (I don't think)
So, don't run it...
* feat: Add Temporal to the dev and hobby stacks
* disable elastic for hobby because of resources
* checkpoint
* update requirements
* worker is up, but without the sandbox
* ensure temporal does not depend on elastic
* Feedbacked
* pip-compile dev
* mypy fixes
* add a bit of colorful logging
* add django temporal worker to the mix
* checkpoint for dev-full docker
* Working on docker-full, but checkpointing for now
* add migration bits for full
* chore: use pnpm to manage dependencies
* Fix CI errors
* Don't report Docker image size for external PRs
* Fix pnpm-lock.yaml formatting
* Fix module versions
* Ignore pnpm-lock.yaml
* Upgrade Cypress action for pnpm support
* Set up node and pnpm before Cypress
* Fix typescript issues
* Include patches directory in Dockerfile
* Fix Jest tests in CI
* Update lockfile
* Update lockfile
* Clean up Dockerfile
* Update pnpm-lock.yaml to reflect current package.json files
* remove yarn-error.log from .gitignore
* formatting
* update data exploration readme
* type jest.config.ts
* fix @react-hook issues for jest
* fix react-syntax-highlighter issues for jest
* fix jest issues from query-selector-shadow-dom
* fix transform ignore patterns and undo previous fixes
* add missing storybook peer dependencies
* fix nullish coalescing operator for storybook
* reorder storybook plugins
* update editor-update-tsd warning to new npm script
* use legacy ssl for chromatic / node 18 compatibility
* use pnpm for visual regression testing workflow
* use node 16 for chromatic
* add @babel/plugin-proposal-nullish-coalescing-operator as direct dependency
* try fix for plugin-server
* cleanup
* fix comment and warning
* update more comments
* update playwright dockerfile
* update plugin source types
* conditional image size reporting
* revert react-native instructions
* less restrictive pnpm verions
* use ref component name in line with style guide
Co-authored-by: Jacob Gillespie <jacobwgillespie@gmail.com>
* feat: Have a local docker-compose stack for developing 100% in docker
* lol found a docker-compose bug where you can't have volumes created in root dir
* scale -> deploy.replicas
* don't forget to add image for asyncmigrationscheck
* env vars to each component
* Rename local to dev-full
* fix(test-account-filter): use 'event' type for filter
* fix(test-account-filter): sanitize filter for old bad data
* nope
* fix type
* fix test
Co-authored-by: Michael Matloka <dev@twixes.com>
Co-authored-by: Marius Andra <marius.andra@gmail.com>
Co-authored-by: Michael Matloka <dev@twixes.com>
* feat: remove version from docker compose to support new spec
* feat: simplify the docker-compose setup so we do less version coordinations
* update hobby bin
* bump docker-compose version for hobby for extends compat
* move ci to ubuntu-latest
* Revert "move ci to ubuntu-latest"
This reverts commit a0462adfec.
* use docker compose for github ci
* correct comments on base
* feat: hobby ci script to e2e test hobby deploy on DO
* checkpoint
* script in a decent spot
* fixes
* more fixes for script
* github action to run python
* Update env var for DO token
* hobby test name
* run on pull requests
* action fixes
* actions fixes
* actions fixes
* actions fixes
* support release testing as well
* actions fixes
* retry tweaks. GHA are a pain
* exit 0 for success and 1 for failure
* handle signals better
* fixes
* retry deletion
* Don't import packages that don't exist
* kwargs the args
* break out of retries
* done
* don't run on master pushes
* Use staging for lets encrypt
* feedback
* feedback timeout context
* feedbacks
* fix issue where response could be referenced before initialization
* fix: Update description for hobby upgrade script
In the original warning's language (written by yours truly), I don't mention which version number nor which dates are relevant to the issue.
[x] Added a check to see if the containers exist or not (`docker volume ls | grep -Pzoq 'root_cl2ickhouse-data\n(.|\n)*root_postgres-data\n'`)
[x] Added the dates and version number now allows users to better understand if the warning is relevant to them.
[x] Minor improvements to the language and instructions
* feat: track warning states for better user prompts
Use state from pre-update volume presence check to show post-install warning about data loss in addition to pre-upgrade warning.
* Improve references to docker container names
Previously instructions erroneously included my particular instance's containing folder. Changed grep check and added 'wildcards' in the instructions to clarify that it might be named differently depending on the environment.
We are running yarn install and starting the frontend dev server for
hobby. This isn't necessary or desired as the image is prebuilt.
Note we also need to update the /compose/start as well, in the upgrade.
It could probably do with a refactor but I'm not going to do that now.
* feat(async-migrations): add auto complete trivial migrations option
This change is to ensure that the `run_async_migrations --check` command
option is a light operation such that we can safely use this e.g. in an
init container to gate the starting of a K8s Pod.
Previous to this change, the command was also handling the
auto-completion of migrations that it deamed to not be required, via the
`is_required` migration function. Aside from this heaviness issue, it's
good to avoid having side effects from a check command.
One part of the code that I wasn't sure about is the version checking,
so any comments here would be much appreciated.
Note that `./bin/migrate` is the command we call from both ECS migration
task and the Helm chart migration job.
* update comment re versions
* wip
* wip
* wip
* update hobby
* rename to noop migrations
* chore(gunicorn): increase thread count for workers
We currently run with 2 worker processes each with 4 threads. It seems
occasionally we see spikes in the number of pending requests within a
worker in the 20-25 region. I suspect this is due to 4 slow requests
blocking the thread pool.
I'd suggest that the majority of work is going to be IO bound, thus it's
hopefully going to not peg the CPU. If it does then it should end up
triggering the HPA and hopefully resolve itself :fingerscrossed:
There is however gzip running on events, which could be intensive
(suggest we offload this to a downstream at some point). If lots of
large requests come in this could be an issue. Some profiling here
wouldn't go amiss.
Note these are the same settings between web and event ingestion
workload. At some point we may want to split.
I've added a Dashboard for gunicorn worker stats
[here](https://github.com/PostHog/charts-clickhouse/pull/559) which we
can monitor to see the effects.
Aside: if would be wise to be able to specify these settings from the
chart itself such that we do not have to consider chart/posthog version
combos, and allow tweaking according to the deployment.
* reduce to 8 threads
* feat(data-management): add custom events list
* remove dead code
* fix test
* assert what matters
* this seems flakey, even locally, though the interface shows the right data locally... testing a timeout
* new script
* fix test
* remove frontend changes (PR incoming)
* describe meaning behind symbols
It looks like the plugin-server isn't shutting down cleanly, from
looking at the logs. They abruptly stop.
We have a trap to pick kill the yarn command on EXIT, however, yarn v1
doesn't propagate SIGTERM to subprocesses, hence node never recieves it.
Separately it looks like the shutdown ends up being called multiple
times which results in a force shutdown. I'm not entirely sure what is
going on here but I'll leave that to another PR.
Prior to this the containing script was recieving the TERM signal e.g.
from Kubernetes eviction. This was as a result terminating the root
process without waiting on gunicorn.
We solve this by avoiding spawning a new process and rather have
gunicorn replace the current process.