* feat(async-migrations): add auto complete trivial migrations option
This change is to ensure that the `run_async_migrations --check` command
option is a light operation such that we can safely use this e.g. in an
init container to gate the starting of a K8s Pod.
Previous to this change, the command was also handling the
auto-completion of migrations that it deamed to not be required, via the
`is_required` migration function. Aside from this heaviness issue, it's
good to avoid having side effects from a check command.
One part of the code that I wasn't sure about is the version checking,
so any comments here would be much appreciated.
Note that `./bin/migrate` is the command we call from both ECS migration
task and the Helm chart migration job.
* update comment re versions
* wip
* wip
* wip
* update hobby
* rename to noop migrations
* chore(gunicorn): increase thread count for workers
We currently run with 2 worker processes each with 4 threads. It seems
occasionally we see spikes in the number of pending requests within a
worker in the 20-25 region. I suspect this is due to 4 slow requests
blocking the thread pool.
I'd suggest that the majority of work is going to be IO bound, thus it's
hopefully going to not peg the CPU. If it does then it should end up
triggering the HPA and hopefully resolve itself :fingerscrossed:
There is however gzip running on events, which could be intensive
(suggest we offload this to a downstream at some point). If lots of
large requests come in this could be an issue. Some profiling here
wouldn't go amiss.
Note these are the same settings between web and event ingestion
workload. At some point we may want to split.
I've added a Dashboard for gunicorn worker stats
[here](https://github.com/PostHog/charts-clickhouse/pull/559) which we
can monitor to see the effects.
Aside: if would be wise to be able to specify these settings from the
chart itself such that we do not have to consider chart/posthog version
combos, and allow tweaking according to the deployment.
* reduce to 8 threads
* feat(data-management): add custom events list
* remove dead code
* fix test
* assert what matters
* this seems flakey, even locally, though the interface shows the right data locally... testing a timeout
* new script
* fix test
* remove frontend changes (PR incoming)
* describe meaning behind symbols
It looks like the plugin-server isn't shutting down cleanly, from
looking at the logs. They abruptly stop.
We have a trap to pick kill the yarn command on EXIT, however, yarn v1
doesn't propagate SIGTERM to subprocesses, hence node never recieves it.
Separately it looks like the shutdown ends up being called multiple
times which results in a force shutdown. I'm not entirely sure what is
going on here but I'll leave that to another PR.
Prior to this the containing script was recieving the TERM signal e.g.
from Kubernetes eviction. This was as a result terminating the root
process without waiting on gunicorn.
We solve this by avoiding spawning a new process and rather have
gunicorn replace the current process.
* chore: Allow instrumentation of gunicorn with statsd (#11372)
* chore: Allow instrumentation of gunicorn with statsd
In order to ensure that gunicorn is performing optimally, it helps to
monitor it with statsd.
This change allows us to include the flags needed to send UDP packets to
a statsd instance.
Docs: https://docs.gunicorn.org/en/stable/instrumentation.html
* Update bin/docker-server
Co-authored-by: Harry Waye <harry@posthog.com>
* Update bin/docker-server
Co-authored-by: Harry Waye <harry@posthog.com>
Co-authored-by: Harry Waye <harry@posthog.com>
* Include the STATSD_PORT correctly
Co-authored-by: Harry Waye <harry@posthog.com>
* chore: Allow instrumentation of gunicorn with statsd
In order to ensure that gunicorn is performing optimally, it helps to
monitor it with statsd.
This change allows us to include the flags needed to send UDP packets to
a statsd instance.
Docs: https://docs.gunicorn.org/en/stable/instrumentation.html
* Update bin/docker-server
Co-authored-by: Harry Waye <harry@posthog.com>
* Update bin/docker-server
Co-authored-by: Harry Waye <harry@posthog.com>
Co-authored-by: Harry Waye <harry@posthog.com>
* Add persistent volumes to docker-compose-hobby
Per the discussion in https://github.com/PostHog/posthog/issues/10792, implemented the "Kessel Fix" in less than a parsec.
* Add warning text to user prompts to avoid data loss
Following discussion with PH team, we wanted to give users the information needed to properly manage the data in their installation and avoid potential data loss.
* feat: test a11y with Cypress
* axe test more pages
* archive a11y violations on success too
* remove date from file path
* don't warn if no accessibility files to upload... they're not on all test jobs
* chore(web): add django-prometheus exposed on /_metrics
This exposes a number of metrics, see
97d5748664/documentation/exports.md
for details. It includes histogram of timings by viewname before and
after middleware.
I'm not particularly interested in these right now, but rather would
like to expose Kafka Producer metrics as per
https://github.com/PostHog/posthog/pull/10997
* Refactor to use gunicorn server hooks
* also add expose to dockerfile
* wip
* feat: add libmaxminddb0 as dependency. C library will speed things up significantly
* pin libmaxminddb to 1.5 for whats available from APK
* get geolite2 db during build
* add settings for geoip2 django contrib library
* black formatting
* consistently use share director
* isort fixes
* remove GeoLite2-City.mmdb from git and add script to ./bin/start to download it if file does not exist
* remove GeoLite2-City.mmdb from git
* add doc for share directory expaining why it exists
* relative path for curl in build
* shared vs share consistency
* Update snapshots
* brotli decompress
* ..everywhere
Co-authored-by: Neil Kakkar <neilkakkar@gmail.com>
Co-authored-by: neilkakkar <neilkakkar@users.noreply.github.com>
* chore(dev): use network mode host for docker-compose services
This removes the need to add kafka to /etc/hosts.
As far as I can tell this should be fine for peoples local dev except
they will be required to reset and re-migrate ClickHouse tables as they
will be trying to pull from `kafka` instead of `localhost`.
* remove ports from redis
* Update a few more references
We were for instance calling trap at a point where it wouldn't get
called, and giving special status to some processes to run in the
foreground.
Instead we:
1. wait for any process exit
2. use it's exit code for the calling process
3. kill background processes on EXIT
* chore(dockerfile): make docker build multistage
The built image is >4GB uncompressed atm, I'm pretty sure there is a lot
of cruft.
Plan is to split out the django, frontend, plugin-server builds and
hopefully get some gains in there to not include build deps.
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* wip
* fix dockerfile lints
* cache from branch
* add load :true
* Update production.Dockerfile
Co-authored-by: James Greenhill <fuziontech@gmail.com>
* Update production.Dockerfile
* update to use compressed size from remote repo
* tag with branch and sha
* add ref on pull_request events
* install python
* be a bit more lax with python version
* fix image size calc
* hardcode lower case image name
* use @
* only add sha on master branch, add master tag on master
* chore: use docker image in e2e tests
This is to try to add some guarantees to the docker image without having
to manually test it, so we can be a bit more aggressive with
improvements without, e.g. having to push to the playground or run
locally.
* wip
* add to summary
* wip
* chore: put cypress tests after docker build
I couldn't figure out a way to get workflow_run to run without merging
in, so I'm just putting after the build.
* wip
* wip
* wip
* remove quotes
* remove separate cypress install
* wip
* wip
* wip
* add gunicorn.config.py
* ci: run docker image build on master as well
This way we get the caching from the master build.
* wip
* wip
Co-authored-by: James Greenhill <fuziontech@gmail.com>
* update cypress
* really click something that's actually there
* obey cypress and use done
* run cypress 9 in CI
* no need for before each when only one test
* no need to set window size to the default
* get tests passing file by file
* delay checking for a graph in a test
* be more specific cypress
* use cy command
* select text like a human
* silly cypress
* try and avoid cypress deciding that a visible field is not valid
* select delete button correctly
* find save button differently
* try and avoid not always typing the first character
* better trends selections
* use cy command to navigate
* conitnue trying to get tests to pass in CI
* another try at setting feature flag names in CI
* can CI find undo button without a wait?
* better assertion for cypress
* up to v10
* fix splitting specs with v10 path
* show cypress how to wait for the test to finish
* remove redundant file
* change return to satisfy new cypress
* move import
* fix dayjs
* fix timeouts (we're not strictly speaking running in nodejs)
* export unexported type
* consolidate on a single FormInstance
* no need to rename
* fuse
* forminstance 2
* locationChanged
* BuiltLogic
* remove Type.ts exception
* fix duh
* lay off the bin/check-typescript/strict script
* don't think this is ever used or useful
* no real need to hide the output
* make typescript:check do what the name says
* we're already strict
Co-authored-by: Michael Matloka <dev@twixes.com>
* chore(hobby deployments): various fixes
* default do not check versions for current hobby release
Co-authored-by: James Greenhill <fuziontech@gmail.com>
* feat(object_storage): add unused object storage with health checks
* only prompt debug users if object storage not available at preflight
* safe plugin server health check for unused object storage
* explicit object storage settings
* explicit object storage settings
* explicit object storage settings
* downgrade pip tools
* without spaces?
* like this?
* without updating pip?
* remove object_storage from dev volumes
* named volume on hobby
* lazily init object storage
* simplify conditional check
* reproduced error locally
* reproduced error locally
* object_storage_endpoint not host and port
* log more when checking kafka and clickhouse
* don't filter docker output
* add kafka to hosts before starting stack?
* silly cloud tests (not my brain)
* More secure secret generation
Use a random source designed for secrets/crypto and apply a stronger hash function as MD5 is broken. This shouldn't have too much of an impact in this context, but better safe than sorry.
* Tune `head` params
Co-authored-by: Michael Matloka <dev@twixes.com>
* remove tests that have been off for a year
* remove component tests that are covered by main cypress tests
* remove a bunch of component based test setup and upgrade cypress
* get tests running but not all passing on Cypress 9
* don't upgrade yet
* don't upgrade yet
* feat(sharding): add command to sync tables onto new nodes
clickhouse-operator only syncs some tables onto new nodes. This new
command ensures that when adding new shards, they are automatically
synced up on redeploying
Note that there might be timing concerns here as resharding on altinity
cloud does not redeploy automatically. In practice however what this
means is that new nodes just won't ingest any data until another deploy
* Add test to the new command
* Improve non-replicated test