Skip to content

feat(run-ops): ClickHouse multi-source replication fan-in + admin ops#4119

Draft
d-cs wants to merge 4 commits into
runops/pr06-write-pathfrom
runops/pr07-replication
Draft

feat(run-ops): ClickHouse multi-source replication fan-in + admin ops#4119
d-cs wants to merge 4 commits into
runops/pr06-write-pathfrom
runops/pr07-replication

Conversation

@d-cs

@d-cs d-cs commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

What

Extends the ClickHouse runs-replication service to fan in from multiple Postgres sources (the control-plane DB and the run-ops DB) instead of a single source, plus the admin operations to run and observe it.

  • Multi-source fan-in (services/runsReplicationService.server.ts, new runsReplicationInstance.server.ts, runsReplicationGlobal.server.ts): factors the replication service into per-source instances and a coordinator so a single ClickHouse target is fed from more than one Postgres source.
  • Admin ops (routes/admin.api.v1.runs-replication.status.ts, admin.api.v1.runs-replication.backfill.ts, v3/services/adminWorker.server.ts): adds a status endpoint reporting per-source replication state and updates the backfill entrypoint for the multi-source shape.

Why

PR7 of the run-ops split stack, and the final piece: once run state can live in a separate run-ops DB (earlier PRs), the analytics replication into ClickHouse has to consume both sources so runs remain queryable regardless of residency. Behavior-changing for the replication service internals; the ClickHouse-facing output is unchanged (still one runs stream), and single-source operation is preserved when the split is not enabled.

Tests

New vitest coverage: runsReplicationInstance.test.ts (per-source instance behavior) and runsReplicationService.part8/part9 suites exercising the multi-source coordinator. Testcontainers-backed (ClickHouse + Postgres); no mocks.

Notes

Draft, stacked on #4118 (runops/pr06-write-path). Review that first; this diff is against it.

Server-change / changeset note to be added at stack-assembly time.

🤖 Generated with Claude Code

@changeset-bot

changeset-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 2bba3b8

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2485cbc2-c0a6-4814-bc28-b6e50bbf03a8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch runops/pr07-replication

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

Open in Devin Review

Comment thread apps/webapp/app/services/runsReplicationService.server.ts Outdated
Comment thread apps/webapp/app/services/runsReplicationInstance.server.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (4)
apps/webapp/test/runsReplicationInstance.test.ts (1)

221-425: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Heavy duplication of org/project/env/PG17-container setup boilerplate across both integration blocks.

Both replicationContainerTest blocks (lines 227-336 and 347-422) repeat nearly identical org/project/runtimeEnvironment creation, createPostgresContainer/REPLICA IDENTITY FULL setup, and try/finally teardown. Extracting a shared test fixture helper (e.g., in test/utils/) for "seed org/project/env/run" and "spin up dual PG sources with REPLICA IDENTITY FULL" would reduce this to a few lines per test and cut duplication with the similar helpers in runsReplicationService.part8.test.ts.

apps/webapp/test/runsReplicationService.part8.test.ts (2)

96-144: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

seedFkRows/seedRun fixture-seeding logic is duplicated three times in this file (and again across sibling test files).

Each of the three tests re-implements essentially the same organization/project/runtimeEnvironment/taskRun creation logic with minor parameter differences. This is the same pattern duplicated in runsReplicationInstance.test.ts and runsReplicationService.part9.test.ts (seedRun). Extracting a shared "seed FK chain" helper into a test util module would reduce this to one implementation reused across all three files here and the sibling test files.

Also applies to: 275-323, 443-495


152-167: 🩺 Stability & Availability | 🔵 Trivial | 🏗️ Heavy lift

Fixed sleeps (setTimeout(500/2000/3000)) instead of polling risk CI flakiness.

All three tests here wait a fixed duration for replication to flush ("Wait for BOTH streams to flush into ClickHouse"), unlike runsReplicationService.part9.test.ts's waitForLagHistogram, which polls with a deadline. Under container/CPU contention a fixed 3s sleep can be too short and cause intermittent failures. Consider adopting the same poll-until-condition pattern used in part9 for these ClickHouse assertions.

Also applies to: 325-336, 497-507

apps/webapp/test/runsReplicationService.part9.test.ts (1)

11-51: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

makeMetricReaders is explicitly copied from part4.test.ts.

The comment at lines 11-12 acknowledges this is copy-pasted from another test file. Since this helper (and the polling helper waitForLagHistogram) is now needed in at least two part-test files, consider moving it into test/utils/tracing.ts alongside createInMemoryMetrics to avoid drift between the two copies.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: cbfa6793-3aeb-419f-b19b-0c52a1d7cf8e

📥 Commits

Reviewing files that changed from the base of the PR and between 515b897 and 5de29cb.

📒 Files selected for processing (9)
  • apps/webapp/app/routes/admin.api.v1.runs-replication.backfill.ts
  • apps/webapp/app/routes/admin.api.v1.runs-replication.status.ts
  • apps/webapp/app/services/runsReplicationGlobal.server.ts
  • apps/webapp/app/services/runsReplicationInstance.server.ts
  • apps/webapp/app/services/runsReplicationService.server.ts
  • apps/webapp/app/v3/services/adminWorker.server.ts
  • apps/webapp/test/runsReplicationInstance.test.ts
  • apps/webapp/test/runsReplicationService.part8.test.ts
  • apps/webapp/test/runsReplicationService.part9.test.ts

Comment thread apps/webapp/app/services/runsReplicationInstance.server.ts
Comment thread apps/webapp/test/runsReplicationInstance.test.ts
Comment thread apps/webapp/test/runsReplicationService.part8.test.ts Outdated
@d-cs d-cs force-pushed the runops/pr06-write-path branch from 515b897 to cb97148 Compare July 2, 2026 18:02
@d-cs d-cs force-pushed the runops/pr07-replication branch from 5de29cb to 5c2d010 Compare July 2, 2026 18:02
@pkg-pr-new

pkg-pr-new Bot commented Jul 2, 2026

Copy link
Copy Markdown

Open in StackBlitz

@trigger.dev/build

npm i https://pkg.pr.new/@trigger.dev/build@11dc0b7

trigger.dev

npm i https://pkg.pr.new/trigger.dev@11dc0b7

@trigger.dev/core

npm i https://pkg.pr.new/@trigger.dev/core@11dc0b7

@trigger.dev/python

npm i https://pkg.pr.new/@trigger.dev/python@11dc0b7

@trigger.dev/react-hooks

npm i https://pkg.pr.new/@trigger.dev/react-hooks@11dc0b7

@trigger.dev/redis-worker

npm i https://pkg.pr.new/@trigger.dev/redis-worker@11dc0b7

@trigger.dev/rsc

npm i https://pkg.pr.new/@trigger.dev/rsc@11dc0b7

@trigger.dev/schema-to-json

npm i https://pkg.pr.new/@trigger.dev/schema-to-json@11dc0b7

@trigger.dev/sdk

npm i https://pkg.pr.new/@trigger.dev/sdk@11dc0b7

commit: 11dc0b7

@d-cs d-cs force-pushed the runops/pr06-write-path branch from c59d9c5 to d5d7fa1 Compare July 2, 2026 19:25
@d-cs d-cs force-pushed the runops/pr07-replication branch from 8fb6e8a to 4e08dc7 Compare July 2, 2026 19:25
@d-cs d-cs force-pushed the runops/pr06-write-path branch from a1ff262 to a8068e9 Compare July 2, 2026 20:23
@d-cs d-cs force-pushed the runops/pr07-replication branch from 4e08dc7 to 11dc0b7 Compare July 2, 2026 20:23
@d-cs d-cs force-pushed the runops/pr06-write-path branch from a8068e9 to 0ef3a6b Compare July 2, 2026 20:38
@d-cs d-cs force-pushed the runops/pr07-replication branch from 11dc0b7 to a9bc9e6 Compare July 2, 2026 20:38
@d-cs d-cs force-pushed the runops/pr06-write-path branch from 0db90f0 to d5415e8 Compare July 2, 2026 21:44
@d-cs d-cs force-pushed the runops/pr07-replication branch from a9bc9e6 to 277ecea Compare July 2, 2026 21:44
d-cs and others added 4 commits July 3, 2026 01:16
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…comments/test names

Remove the internal plan-enumeration labels from runs-replication
comments and test names, keeping the behavioral descriptions intact.
Comment/label hygiene only; no product logic or test behavior changed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…, fix test status type

- Register the implicit single source with id "legacy" so its leader-lock key
  matches the id the admin status route probes; otherwise leadership always
  reads false in the non-split config.
- Guard the shutdown-path client.stop() fan-out against re-firing per incoming
  transaction and add a catch so rejections don't surface as unhandled.
- Use the TaskRunStatus type alias (not the const value) for status annotations
  in the dual-source dedup tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@d-cs d-cs force-pushed the runops/pr07-replication branch from 277ecea to 2bba3b8 Compare July 3, 2026 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant