85bb582cee
This _should_ give us better performance and reliability, but it's hard to tell without a lot of testing. Will monitor closely on rollout. Note that this will require a delete on the old consumer members as they are using round eager robin partition strategy, whereas this is using the cooperative sticky partition strategy. librdkafka doesn't support mixing the two, unlike the Java Kafka Client. --------- Co-authored-by: Tiina Turban <tiina303@gmail.com> |
||
---|---|---|
.. | ||
.devcontainer | ||
.vscode | ||
bin | ||
functional_tests | ||
src | ||
tests | ||
.babelrc | ||
.editorconfig | ||
.eslintrc.js | ||
.gitattributes | ||
.gitignore | ||
.prettierignore | ||
jest.config.functional.js | ||
jest.config.js | ||
jest.setup.fetch-mock.js | ||
package.json | ||
pnpm-lock.yaml | ||
README.md | ||
tsconfig.eslint.json | ||
tsconfig.json |
PostHog Plugin Server
This service takes care of processing events with plugins and more.
Get started
Let's get you developing the plugin server in no time:
-
Have virtual environment from the main PostHog repo active.
-
Install dependencies and prepare for takeoff by running command
pnpm i
. -
Start a development instance of PostHog - instructions here. After all, this is the PostHog Plugin Server, and it works in conjuction with the main server.
-
Make sure that the plugin server is configured correctly (see Configuration). The following settings need to be the same for the plugin server and the main server:
DATABASE_URL
,REDIS_URL
,KAFKA_HOSTS
,CLICKHOUSE_HOST
,CLICKHOUSE_DATABASE
,CLICKHOUSE_USER
, andCLICKHOUSE_PASSWORD
. Their default values should work just fine in local development though. -
Start the plugin server in autoreload mode with
pnpm start:dev
, or in compiled mode withpnpm build && pnpm start:dist
, and develop away! -
Prepare for running tests with
pnpm setup:test
, which will run the necessary migrations. Run the tests themselves withpnpm test:{1,2}
. -
Prepare for running functional tests. See notes below.
## Functional tests
Functional tests are provided located in functional_tests
. They provide tests
for high level functionality of the plugin-server, i.e. functionality that any
client of the plugin-server should be able to use. It attempts to assume nothing
of the implementation details of the plugin-server.
At the time of writing it assumes:
- that events are pushed into Kafka topics.
- that side-effects of the plugin-server are updates to ClickHouse table data.
- that the plugin-server reads certain data from Postgres tables e.g.
posthog_team
,posthog_pluginsource
etc. These would ideally be wrapped in some implementation detail agnostic API.
It specifically doesn't assume details of the running plugin-server process e.g. runtime stack.
See bin/ci_functional_tests.sh
for how these tests are run in CI. For local
testing:
- run docker
docker compose -f docker-compose.dev.yml up
(in posthog folder) - setup the test DBs
pnpm setup:test
- start the plugin-server with
CLICKHOUSE_DATABASE='default' DATABASE_URL=postgres://posthog:posthog@localhost:5432/test_posthog pnpm start:dev
- run the tests with
CLICKHOUSE_DATABASE='default' DATABASE_URL=postgres://posthog:posthog@localhost:5432/test_posthog pnpm functional_tests --watch
CLI flags
There are also a few alternative utility options on how to boot plugin-server. Each one does a single thing. They are listed in the table below, in order of precedence.
Name | Description | CLI flags |
---|---|---|
Help | Show plugin server configuration options | -h , --help |
Version | Only show currently running plugin server version | -v , --version |
Healthcheck | Check plugin server health and exit with 0 or 1 | --healthcheck |
Migrate | Migrate Graphile Worker | --migrate |
Alternative modes
By default, plugin-server is responsible for and executes all of the following:
- Ingestion (calling plugins and writing event and person data to ClickHouse and Postgres, buffering events)
- Scheduled tasks (runEveryX type plugin tasks)
- Processing plugin jobs
- Async plugin tasks (onEvent, onSnapshot plugin tasks)
Ingestion can be split into its own process at higher scales. To do so, you need to run two different instances of plugin-server, with the following environment variables set:
Env Var | Description |
---|---|
PLUGIN_SERVER_MODE=ingestion |
This plugin server instance only runs ingestion (1) |
PLUGIN_SERVER_MODE=async |
This plugin server processes all async tasks (2-4). Note that async plugin tasks are triggered based on ClickHouse events topic |
If PLUGIN_SERVER_MODE
is not set the plugin server will execute all of its tasks (1-4).
Configuration
There's a multitude of settings you can use to control the plugin server. Use them as environment variables.
Name | Description | Default value |
---|---|---|
DATABASE_URL | Postgres database URL | 'postgres://localhost:5432/posthog' |
REDIS_URL | Redis store URL | 'redis://localhost' |
BASE_DIR | base path for resolving local plugins | '.' |
WORKER_CONCURRENCY | number of concurrent worker threads | 0 – all cores |
TASKS_PER_WORKER | number of parallel tasks per worker thread | 10 |
REDIS_POOL_MIN_SIZE | minimum number of Redis connections to use per thread | 1 |
REDIS_POOL_MAX_SIZE | maximum number of Redis connections to use per thread | 3 |
SCHEDULE_LOCK_TTL | how many seconds to hold the lock for the schedule | 60 |
PLUGINS_RELOAD_PUBSUB_CHANNEL | Redis channel for reload events | 'reload-plugins' |
CLICKHOUSE_HOST | ClickHouse host | 'localhost' |
CLICKHOUSE_OFFLINE_CLUSTER_HOST | ClickHouse host to use for offline workloads. Falls back to CLICKHOUSE_HOST | null |
CLICKHOUSE_DATABASE | ClickHouse database | 'default' |
CLICKHOUSE_USER | ClickHouse username | 'default' |
CLICKHOUSE_PASSWORD | ClickHouse password | null |
CLICKHOUSE_CA | ClickHouse CA certs | null |
CLICKHOUSE_SECURE | whether to secure ClickHouse connection | false |
KAFKA_HOSTS | comma-delimited Kafka hosts | null |
KAFKA_CONSUMPTION_TOPIC | Kafka incoming events topic | 'events_plugin_ingestion' |
KAFKA_CLIENT_CERT_B64 | Kafka certificate in Base64 | null |
KAFKA_CLIENT_CERT_KEY_B64 | Kafka certificate key in Base64 | null |
KAFKA_TRUSTED_CERT_B64 | Kafka trusted CA in Base64 | null |
KAFKA_PRODUCER_MAX_QUEUE_SIZE | Kafka producer batch max size before flushing | 20 |
KAFKA_FLUSH_FREQUENCY_MS | Kafka producer batch max duration before flushing | 500 |
KAFKA_MAX_MESSAGE_BATCH_SIZE | Kafka producer batch max size in bytes before flushing | 900000 |
LOG_LEVEL | minimum log level | 'info' |
SENTRY_DSN | Sentry ingestion URL | null |
STATSD_HOST | StatsD host - integration disabled if this is not provided | null |
STATSD_PORT | StatsD port | 8125 |
STATSD_PREFIX | StatsD prefix | 'plugin-server.' |
DISABLE_MMDB | whether to disable MMDB IP location capabilities | false |
INTERNAL_MMDB_SERVER_PORT | port of the internal server used for IP location (0 means random) | 0 |
DISTINCT_ID_LRU_SIZE | size of persons distinct ID LRU cache | 10000 |
PISCINA_USE_ATOMICS | corresponds to the piscina useAtomics config option (https://github.com/piscinajs/piscina#constructor-new-piscinaoptions) | true |
PISCINA_ATOMICS_TIMEOUT | (advanced) corresponds to the length of time (in ms) a piscina worker should block for when looking for tasks - instances with high volumes (100+ events/sec) might benefit from setting this to a lower value | 5000 |
HEALTHCHECK_MAX_STALE_SECONDS | 'maximum number of seconds the plugin server can go without ingesting events before the healthcheck fails' | 7200 |
MAX_PENDING_PROMISES_PER_WORKER | (advanced) maximum number of promises that a worker can have running at once in the background. currently only targets the exportEvents buffer. | 100 |
KAFKA_PARTITIONS_CONSUMED_CONCURRENTLY | (advanced) how many kafka partitions the plugin server should consume from concurrently | 1 |
RECORDING_PARTITIONS_CONSUMED_CONCURRENTLY | (advanced) how many kafka partitions the recordings consumer should consume from concurrently | 1 |
PLUGIN_SERVER_MODE | (advanced) see alternative modes section | null |
Releasing a new version
Just bump up version
in package.json
on the main branch and the new version will be published automatically,
with a matching PR in the main PostHog repo created.
It's advised to use bump patch/minor/major
label on PRs - that way the above will be done automatically when the PR is merged.
Courtesy of GitHub Actions.
Walkthrough
The story begins with pluginServer.ts -> startPluginServer
, which is the main thread of the plugin server.
This main thread spawns WORKER_CONCURRENCY
worker threads, managed using Piscina. Each worker thread runs TASKS_PER_WORKER
tasks (concurrentTasksPerWorker).
Main thread
Let's talk about the main thread first. This has:
-
pubSub
– Redis powered pub-sub mechanism for reloading plugins whenever a message is published by the main PostHog app. -
hub
– Handler of connections to required DBs and queues (ClickHouse, Kafka, Postgres, Redis), holds loaded plugins. Created viahub.ts -> createHub
. Every thread has its own instance. -
piscina
– Manager of tasks delegated to threads.makePiscina
creates the manager, whilecreateWorker
creates the worker threads. -
pluginScheduleControl
– Controller of scheduled jobs. Responsible for adding Piscina tasks for scheduled jobs, when the time comes. The schedule information makes it into the controller when plugin VMs are created.Scheduled tasks are controlled with Redlock (redis-based distributed lock), and run on only one plugin server instance in the entire cluster.
-
jobQueueConsumer
– The internal job queue consumer. This enables retries, scheduling jobs in the future (once) (Note: this is the difference betweenpluginScheduleControl
and this internaljobQueue
). WhilepluginScheduleControl
is triggered viarunEveryMinute
,runEveryHour
tasks, thejobQueueConsumer
deals withmeta.jobs.doX(event).runAt(new Date())
.Jobs are enqueued by
job-queue-manager.ts
, which is backed by Postgres-based Graphile-worker (graphile-queue.ts
). -
queue
– Event ingestion queue. This is a Celery (backed by Redis) or Kafka queue, depending on the setup (EE/Cloud is Kafka due to high volume). These are consumed by thequeue
above, and sent off to the Piscina workers (src/main/ingestion-queues/queue.ts -> ingestEvent
). Since all of the actual ingestion happens inside worker threads, you'll find the specific ingestion code there (src/worker/ingestion/ingest-event.ts
). There the data is saved into Postgres (and ClickHouse via Kafka on EE/Cloud).It's also a good idea to see the producer side of this ingestion queue, which comes from
posthog/posthog/api/capture.py
. The plugin server gets theprocess_event_with_plugins
Celery task from there, in the Postgres pipeline. The ClickHouse via Kafka pipeline gets the data by way of Kafka topicevents_plugin_ingestion
. -
mmdbServer
– TCP server, which works as an interface between the GeoIP MMDB data reader located in main thread memory and plugins ran in worker threads of the same plugin server instance. This way the GeoIP reader is only loaded in one thread and can be used in all. Additionally this mechanism ensures thatmmdbServer
is ready before ingestion is started (database downloaded from http-mmdb and read), and keeps the database up to date in the background.
Worker threads
This begins with worker.ts
and createWorker()
.
hub
is the same setup as in the main thread.
New functions called here are:
-
setupPlugins
– Loads plugins and prepares them for lazy VM initialization. -
createTaskRunner
– Creates a Piscina task runner that allows to operate on plugin VMs.
Note: An
organization_id
is tied to a company and its installed plugins, ateam_id
is tied to a project and its plugin configs (enabled/disabled+extra config).