feat(run-ops): webapp write path — trigger/batch minting, idempotency routing, run lifecycle#4118
feat(run-ops): webapp write path — trigger/batch minting, idempotency routing, run lifecycle#4118d-cs wants to merge 11 commits into
Conversation
|
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThis PR migrates numerous webapp services from direct Prisma queries to a RunStore/controlPlaneResolver abstraction, enabling split reads/writes between legacy and new ("run-ops") Postgres databases. It introduces residency-aware idempotency dedup client resolution, waitpoint read-through resolution, and mint-kind-based friendly ID generation (ksuid vs cuid) for run, batch, and failed-task creation that preserves parent/child residency lineage. Trigger, batch trigger, alert, session serialization, run listing, and cancellation services are updated to resolve environments and locked workers via controlPlaneResolver instead of eager Prisma includes, and to read/write batch and run records through RunStore methods rather than direct Prisma calls. Extensive new integration tests validate store routing behavior across heterogeneous Postgres containers (legacy vs new), residency inheritance, and passthrough single-database scenarios. Changes
Related PRs: None identified. Suggested labels: run-engine, database, testing Suggested reviewers: matt-aitken, ericallam 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
413a945 to
99643f8
Compare
515b897 to
cb97148
Compare
@trigger.dev/build
trigger.dev
@trigger.dev/core
@trigger.dev/python
@trigger.dev/react-hooks
@trigger.dev/redis-worker
@trigger.dev/rsc
@trigger.dev/schema-to-json
@trigger.dev/sdk
commit: |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts (1)
245-294: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
blockRunWithWaitpointstill writes viathis.prisma, not the resolveddedupClient.
dedupClient(computed above at line 155-174) is derived usingparentRunFriendlyId: request.body.options?.parentRunId— the exact sameparentRunIdused here to block the parent run's waitpoint (line 246, 279).dedupClientis precisely the client that owns this parent run's residency, yet the transaction at line 290 still passestx: this.prisma, the (possibly wrong) fallback client.If split mode is enabled and the parent run resides on the "new" store while
this.prismatargets the legacy store (or vice versa), this write would target the wrong database, failing to find the parent run row or silently writing state to a store that doesn't own it — contradicting the PR's core objective of routing writes to the store that owns the target run.🐛 Proposed fix
await this.engine.blockRunWithWaitpoint({ runId: RunId.fromFriendlyId(parentRunId), waitpoints: associatedWaitpoint!.id, spanIdToComplete: spanId, batch: request.options?.batchId ? { id: request.options.batchId, index: request.options.batchIndex ?? 0, } : undefined, projectId: request.environment.projectId, organizationId: request.environment.organizationId, - tx: this.prisma, + tx: dedupClient, });
🧹 Nitpick comments (8)
apps/webapp/app/runEngine/concerns/resolveWaitpointThroughReadThrough.server.ts (1)
44-49: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low valueConsider forwarding
logger/onLegacyReplicaReadfor parity with other read-through consumers.
ReadThroughDepssupportsloggerandonLegacyReplicaRead(saturation-signal hook), butResolveWaitpointDeps/this wrapper drop both, so legacy-replica reads for waitpoints won't emit the saturation signal that other read-through call sites presumably rely on for monitoring split-read health.apps/webapp/app/runEngine/services/batchTrigger.server.ts (2)
92-99: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDiscards the minted
idand re-derives it viaBatchId.fromFriendlyId.
mintBatchFriendlyIdreturns{ id, friendlyId }, but onlyfriendlyIdis kept here;idis recomputed later viaBatchId.fromFriendlyId(batchId)(lines 175, 266).createBatch.server.tsuses the returnediddirectly instead of re-deriving it. Functionally likely equivalent ifBatchId.fromFriendlyIdis a lossless decode, but it's redundant work and an inconsistency between two services from the same PR doing the same job.♻️ Proposed consistency fix
- const { friendlyId } = await mintBatchFriendlyId({ + const { id, friendlyId } = await mintBatchFriendlyId({ environment: { organizationId: environment.organizationId, id: environment.id, orgFeatureFlags: environment.organization.featureFlags, }, parentRunFriendlyId: body.parentRunId, });Then thread
idthrough to#createAndProcessBatchTaskRunand use it directly instead ofBatchId.fromFriendlyId(batchId).Please confirm
BatchId.fromFriendlyIdreliably reconstructs the sameidfor both ksuid- and cuid-shaped friendly ids before treating this purely as a style nit.Also applies to: 169-184, 265-275
359-374: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winMissing batch is silently ignored — no log emitted.
When
findBatchTaskRunByIdreturns nothing, the function returns silently, unlike theenvironmentmiss two lines below which logs an error. Given store-routing bugs could make a batch invisible from the wrong store, this failure mode deserves the same observability.🔍 Proposed fix
const batch = await this._engine.runStore.findBatchTaskRunById(options.batchId); if (!batch) { + logger.error("[RunEngineBatchTrigger][processBatchTaskRun] Batch not found", { + options, + }); return; }apps/webapp/test/engine/streamBatchItems.test.ts (1)
655-662: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winRepeated
PostgresRunStorewiring across 7 test cases.The same 4-line block assigning
engine.runStore = new PostgresRunStore({ prisma: racingPrisma, readOnlyPrisma: racingPrisma })appears 7 times. A small helper (e.g.attachRacingRunStore(engine, racingPrisma)) would reduce duplication and centralize any future changes to how the racing store is wired.Also applies to: 787-794, 919-926, 1052-1059, 1272-1279, 1411-1418, 1600-1604
apps/webapp/app/v3/services/createCheckpoint.server.ts (1)
149-154: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winBatch lookup correctly routed through RunStore.
Both
WAIT_FOR_BATCHlookups now callrunStore.findBatchTaskRunByFriendlyId(friendlyId, environmentId), matching the upstream contract that probes NEW then LEGACY sub-stores scoped by(friendlyId, environmentId). This correctly fixes the pre-existing gap where a raw single-DB Prisma query would miss a NEW-resident (ksuid) batch.The identical 5-line lookup + comment block is duplicated at both call sites (149-154 and 364-369). Extracting a small private helper would remove the duplication and centralize any future changes to the routing logic.
♻️ Proposed extraction
+ // Routed by friendlyId so a ksuid (NEW-resident) batch is found on the owning DB; + // env-scoped to the dependent attempt's run (a batch shares its dependent's env). + private async findWaitForBatchRun(batchFriendlyId: string, environmentId: string) { + return this.runStore.findBatchTaskRunByFriendlyId(batchFriendlyId, environmentId); + }Then replace both call sites with:
- // Routed by friendlyId so a ksuid (NEW-resident) batch is found on the owning DB; - // env-scoped to the dependent attempt's run (a batch shares its dependent's env). - const batchRun = await this.runStore.findBatchTaskRunByFriendlyId( - reason.batchFriendlyId, - attempt.taskRun.runtimeEnvironmentId - ); + const batchRun = await this.findWaitForBatchRun( + reason.batchFriendlyId, + attempt.taskRun.runtimeEnvironmentId + );Also applies to: 364-369
apps/webapp/app/v3/services/executeTasksWaitingForDeploy.ts (1)
74-111: 🩺 Stability & Availability | 🔵 TrivialSolid defense-in-depth split; consider a quarantine path for stuck NEW-resident runs.
The NEW/legacy split correctly prevents a control-plane
updateMany/enqueue from touching runs it can't actually own. One residual risk: if a NEW-resident run keeps getting selected byfindRuns(e.g. from a real misconfiguration), it will never transition out ofWAITING_FOR_DEPLOY, so this job will re-log the same error and potentially keep rescheduling itself (via therunsWaitingForDeploy.length > maxCountreschedule) on every poll, indefinitely.Consider adding a metric/alert-worthy signal or a way to skip re-selecting known-stuck NEW-resident runs.
apps/webapp/test/engine/triggerTask.test.ts (1)
2393-2402: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winRedundant dynamic
import()—generateKsuidIdis already statically imported.
generateKsuidIdis imported statically at the top of the file (line 33) and used directly a few lines later (line 2432:generateKsuidId()). The dynamicawait import("@trigger.dev/core/v3/isomorphic")here is unnecessary and inconsistent with the static-import usage elsewhere in the same file.♻️ Proposed fix
const parentFriendlyId = RunId.toFriendlyId( - // 27-char ksuid → classifies NEW - (await import("`@trigger.dev/core/v3/isomorphic`")).generateKsuidId() + // 27-char ksuid → classifies NEW + generateKsuidId() );As per coding guidelines: "Prefer static imports over dynamic
import(), and only use dynamic imports for unresolved circular dependencies, genuine code-splitting needs, or conditional runtime loading."Source: Coding guidelines
apps/webapp/test/idempotencyDedupResidency.test.ts (1)
45-103: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winDuplicate seeding helpers across split-seam test files.
seedOrgProjectEnv/seedRunhere are re-implemented nearly verbatim inapps/webapp/test/idempotencyKeyConcernLegacyAuthority.test.tsandapps/webapp/test/resetIdempotencyKeyLegacyAuthority.test.ts. Consider extracting a sharedheteroPostgresTestfixture-seeding helper module as this residency test suite keeps growing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: d490b21c-676d-49ba-83a2-67771105b181
📒 Files selected for processing (50)
apps/webapp/app/runEngine/concerns/idempotencyKeys.server.tsapps/webapp/app/runEngine/concerns/idempotencyResidency.server.test.tsapps/webapp/app/runEngine/concerns/idempotencyResidency.server.tsapps/webapp/app/runEngine/concerns/resolveWaitpointThroughReadThrough.server.tsapps/webapp/app/runEngine/services/batchTrigger.server.tsapps/webapp/app/runEngine/services/createBatch.server.tsapps/webapp/app/runEngine/services/streamBatchItems.server.tsapps/webapp/app/runEngine/services/triggerFailedTask.server.tsapps/webapp/app/runEngine/services/triggerTask.server.test.tsapps/webapp/app/runEngine/services/triggerTask.server.tsapps/webapp/app/services/archiveBranch.server.tsapps/webapp/app/services/dashboardAgent.server.tsapps/webapp/app/services/deleteProject.server.tsapps/webapp/app/services/realtime/runReader.server.tsapps/webapp/app/services/realtime/sessions.server.tsapps/webapp/app/services/runsRepository/clickhouseRunsRepository.server.tsapps/webapp/app/services/runsRepository/runsRepository.server.tsapps/webapp/app/v3/services/alerts/deliverAlert.server.tsapps/webapp/app/v3/services/alerts/performTaskRunAlerts.server.tsapps/webapp/app/v3/services/batchTriggerV3.server.tsapps/webapp/app/v3/services/bulk/BulkActionV2.batchReadThrough.server.test.tsapps/webapp/app/v3/services/bulk/BulkActionV2.batchReadThrough.server.tsapps/webapp/app/v3/services/bulk/BulkActionV2.server.tsapps/webapp/app/v3/services/cancelDevSessionRuns.server.tsapps/webapp/app/v3/services/createCheckpoint.server.tsapps/webapp/app/v3/services/createTaskRunAttempt.server.tsapps/webapp/app/v3/services/enqueueDelayedRun.server.tsapps/webapp/app/v3/services/executeTasksWaitingForDeploy.tsapps/webapp/app/v3/services/expireEnqueuedRun.server.tsapps/webapp/app/v3/services/finalizeTaskRun.server.tsapps/webapp/app/v3/services/resumeBatchRun.server.tsapps/webapp/test/batchTriggerV3ResidencyInheritance.test.tsapps/webapp/test/batchTriggerV3StoreRouting.test.tsapps/webapp/test/bulkActionV2ReadRouting.test.tsapps/webapp/test/cancelDevSessionRunsStoreRouting.test.tsapps/webapp/test/engine/streamBatchItems.test.tsapps/webapp/test/engine/triggerFailedTask.test.tsapps/webapp/test/engine/triggerTask.test.tsapps/webapp/test/idempotencyDedupResidency.test.tsapps/webapp/test/idempotencyKeyConcernLegacyAuthority.test.tsapps/webapp/test/performTaskRunAlertsStoreRouting.test.tsapps/webapp/test/realtime/runReaderReadThrough.test.tsapps/webapp/test/realtime/streamRegistrationRouting.test.tsapps/webapp/test/resetIdempotencyKeyLegacyAuthority.test.tsapps/webapp/test/resolveWaitpointThroughReadThrough.readthrough.test.tsapps/webapp/test/runEngineBatchTriggerStoreRouting.test.tsapps/webapp/test/runsRepository.readthrough.test.tsapps/webapp/test/runsRepositoryCpres.test.tsapps/webapp/test/sessions.readthrough.test.tsapps/webapp/test/streamLoader.controlPlane.test.ts
|
Addressed the outside-diff note on |
26871d5 to
cdc4eb9
Compare
c59d9c5 to
d5d7fa1
Compare
cdc4eb9 to
e0b35d5
Compare
a8068e9 to
0ef3a6b
Compare
… routing, run lifecycle Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e in TriggerFailedTaskService TriggerFailedTaskService read the parent run via the ambient module-singleton store while the engine wrote the run through its own store, so a ksuid parent's row was not found and parentTaskRunId came back null. Add an optional injected runStore (defaults to the shared singleton, preserving production behaviour) and resolve the parent through it at both call sites, mirroring triggerTask.server.ts. Align the three affected webapp tests to read through the same store the engine wrote to: triggerFailedTask.test.ts passes engine.runStore; performTaskRunAlerts routing passes a passthrough store over the seeded container; triggerTask.test.ts stubs the run-ops db handles and pins split mode off so the idempotency dedup uses the container client. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…id-shape only Migration is deferred, so child/batch residency is a pure id-shape check. Remove the isKnownMigrated (and mint-only isSplitEnabled) deps from the mint sites (triggerTask, triggerFailedTask, batchTriggerV3) and call the now- synchronous resolveInheritedMintKind(parentFriendlyId) with no deps arg. Read paths: drop the isKnownMigrated re-probe-avoidance from the ClickHouse runs hydrate (probe all missing on legacy), the runsRepository readThrough options type, resolveWaitpointThroughReadThrough deps, and the BulkActionV2 batch seam adapter — keeping the genuine cross-seam fallback that reads NEW first for unclassifiable/legacy-candidate ids. Delete the injected-marker test cases; the remaining residency tests assert pure id-shape inheritance. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s and test names Review hygiene only: remove the NEW-1 label, Test X: name prefixes, and [TEST-NEWSEED] comment label. No product logic or test behavior changed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…o safe run-ops clients The read-through concern defaulted both newClient and legacyReplica to $replica (control-plane), so a bare caller that omits `deps` — the waitpoints wait route — never queried the dedicated run-ops replica. A co-located, NEW-resident waitpoint minted by streams.input().wait() lives on the run-ops-new DB, so the read missed, returned null, and the route 404'd (re-serialized to 500). Match the deps the complete/callback routes pass: default newClient to runOpsNewReplica, legacyReplica to $replica, and splitEnabled to runOpsSplitReadEnabled — mirroring readThroughRun's own self-defaulting. This immunizes any bare caller (present or future) against the control-plane pin, without touching the wait route. The wait/complete/callback call sites live on a higher branch and are unchanged; complete/callback keep their explicit deps (now redundant but harmless). Adds a heteroRunOps regression case driving the concern with no `deps` via the `defaults` DI seam: proves the old $replica default misses a NEW-resident waitpoint (null) while the safe run-ops default finds it. No mocks; the fallback is exercised against real PG14/PG17 containers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rvice prisma to the resolved store - Block the idempotent parent run's waitpoint via the residency-resolved dedup client instead of the fallback prisma, so the write lands on the store that owns the parent run. - Pass the caller-provided _prisma into WithRunEngine so a custom store isn't silently overridden by the module singleton. - Throw when a run-backed alert's environment can't be resolved instead of marking it SENT, so a transient replica miss doesn't permanently suppress the alert. - Pin splitEnabled:false in the waitpoint passthrough test so it exercises single-DB behaviour rather than relying on ksuid residency. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The write-path split added static `runOpsLegacyPrisma`/`runOpsNewPrisma` imports to idempotencyKeys.server.ts, which this test loads. vitest validates every named import against the `~/db.server` mock, so the mock now errored on the missing run-ops singletons. Add the four run-ops exports (empty stubs, same boundary pattern as the batchTriggerV3 residency test) and pin isSplitEnabled() to false so the dedup routing deterministically returns the injected fake prisma regardless of the ambient RUN_OPS_SPLIT_ENABLED. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…setup Worker/engine/marqs/pubsub/socket singletons each construct an ioredis client at import time (singleton() + no lazyConnect), so any test importing the service graph opened real Redis connections on import. In CI there is no Redis, so these accumulate infinite-retry clients across a shard and take the suite down (locally they pass only because dev Redis is up). Globally mock the eager-Redis modules to no-op stubs in test/setup.ts: commonWorker, batchTriggerWorker, legacyRunEngineWorker, alertsWorker, the RunEngine and MarQS singletons, devPubSub and the socket.io server. Only these singletons are mocked — never the run store (~/v3/runStore.server, ~/db.server), which store-routing/residency tests need real against testcontainer Postgres. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0db90f0 to
d5415e8
Compare
…yConnect + stub runtime Redis singletons The setup-file mocks of the six eager worker/engine singletons were not enough: CI shards still flooded ECONNREFUSED/maxRetries. Two further classes of env-Redis usage survived them, reproduced locally by running the failing shards with REDIS_PORT pointed at a dead port: 1. Import-time construction: ~15 more singletons (platform cache, billing-limit reconcile queue, alerts rate limiter, DevPresence, auto-increment counter, s2 token cache, v1 streams cache, ...) build ioredis clients at module import, and ioredis dials on construction. A global ioredis mock now forces lazyConnect: true so clients only dial on first command — testcontainer-backed tests are unaffected (their first command connects as before). 2. Runtime commands inside code under test: tracePubSub.publish() (eventRepository writes), alertsRateLimiter.check() (deliverAlert) and the task metadata cache each issue commands against env-configured Redis mid-test; every command burns ~20 reconnect cycles before its error surfaces, which times the tests out. These three modules are now stubbed (metadata cache pinned to its Noop implementation, which is what CI's unset env resolves to anyway). Verified: webapp shards 2/5/6/8 (the ones failing on the pr06+ stack) run green with Redis pointed at a dead port, and shards 2/8 stay green against live Redis (store-routing suites still exercise the real run store). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…in CI CI runners have no .env, no REDIS_HOST/REDIS_PORT, and no Postgres at localhost:5432, which surfaced two failure layers that local runs mask (the dev stack answers on both): - suites transitively importing triggerTaskV1.server failed to collect because autoIncrementCounter.server.ts throws at import when REDIS_HOST/REDIS_PORT are unset (shards 2/5/6). Default the pair in test/setup.ts — the global ioredis lazyConnect mock means nothing dials. - TriggerFailedTaskService.call() resolved its event repository via getEventRepository → global prisma (feature-flag read + Prisma event repo), so in CI the swallowed connect error returned null friendlyIds (shard 8). Allow injecting the repository/store pair and bind the test to an EventRepository over the testcontainer DB. - once the cancelDevSessionRuns suite could collect, findLatestSession's hardwired global $replica was the next masked layer; give it an injectable client (defaulting to $replica) and pass the service's _replica through. Verified by replaying the exact CI env locally (.env hidden, workflow env vars, dead localhost DB, GITHUB_ACTIONS set): all four failing suites and full shards 2/5/6/8 reproduce the CI failures before and pass after. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What
Routes the webapp write path through the run-ops split seam: trigger/batch minting, idempotency-key resolution, and the run-lifecycle services now determine residency and dispatch writes to the correct store.
runEngine/services/triggerTask.server.ts,batchTrigger.server.ts,createBatch.server.ts,streamBatchItems.server.ts,v3/services/batchTriggerV3.server.ts): mint ids with the run-ops-aware minting and route creation/streaming through the store; batch children inherit the parent's residency.runEngine/concerns/idempotencyKeys.server.ts+ newidempotencyResidency.server.ts): idempotency-key lookup/dedup is residency-aware so a keyed retrigger resolves against the store that owns the original run.createCheckpoint,createTaskRunAttempt,enqueueDelayedRun,expireEnqueuedRun,finalizeTaskRun,resumeBatchRun,cancelDevSessionRuns,executeTasksWaitingForDeploy,triggerFailedTask): resolve their target run through the store rather than a fixed client.runsRepository+clickhouseRunsRepository,BulkActionV2+ batch read-through, realtimesessions/runReader, alertsdeliverAlert/performTaskRunAlerts): route through the read-through resolver.9535ae63d— resolves the parent run through an injectable run store inTriggerFailedTaskService.bf8f7c881— drops the "known-migrated" concept from write-path and read repos; residency is id-shape only.515b897ea— self-defaultsresolveWaitpointThroughReadThroughto the safe run-ops clients.Why
PR6 of the run-ops split stack. This is the write-path counterpart to the read foundation in the previous PRs: with it in place, both reads and writes route through the seam. Additive when the split is disabled (id-shape resolution collapses to the control-plane client); behavior-changing on the minting, idempotency, and lifecycle paths when enabled.
Tests
Large new/expanded vitest suite under
apps/webapp/test/and colocated service tests: trigger-task and batch-trigger store routing, residency inheritance, idempotency dedup residency + legacy-authority, bulk-action read routing, cancel-dev-session routing, alerts store routing, runs-repository read-through, realtime session/run-reader read-through and stream-registration routing, and the waitpoint read-through default. Testcontainers-backed; no mocks.Notes
Draft, stacked on #4117 (
runops/pr05-webapp-foundation). Review that first; this diff is against it.Server-change / changeset note to be added at stack-assembly time.
🤖 Generated with Claude Code