Skip to content

fix(webapp): harden the realtime session routes#3890

Open
ericallam wants to merge 4 commits into
mainfrom
fix/session-route-hardening
Open

fix(webapp): harden the realtime session routes#3890
ericallam wants to merge 4 commits into
mainfrom
fix/session-route-hardening

Conversation

@ericallam

Copy link
Copy Markdown
Member

Summary

Reliability and authorization fixes for realtime chat sessions:

  • Session-stream waitpoint delivery is scoped to the environment, so two environments using the same session externalId can no longer complete each other's waitpoints.
  • The session snapshot-url routes now enforce per-session authorization, and appending to a session's out channel requires secret-key auth, so a session-scoped token can't read another session's snapshot or forge assistant output.
  • Appends that carry an X-Part-Id header are deduplicated on retry, so a retried send can't duplicate a message.
  • Session creation rejects expired sessions (instead of triggering a run that can never receive input), externalId is immutable after creation, and the sessions list endpoint returns friendly run_* ids to match the single-session routes.

Rollout

The waitpoint cache key gains an environment prefix. To keep waitpoints registered by the previous deploy working across the boundary, the drain reads both the new and the previous key for this release; the legacy read can be removed a release later once no pre-deploy waitpoints remain.

Scope session stream waitpoint delivery to the environment so two
environments using the same session externalId can never complete each
other's waitpoints. Add the missing authorization checks to the
session snapshot-url routes and restrict out-channel appends to secret
key auth, so a session-scoped token cannot read other sessions'
snapshots or forge assistant output. Appends that carry an X-Part-Id
header are now deduplicated on retry, session creation rejects expired
sessions, externalId is immutable after creation, and the sessions
list endpoint returns friendly run ids.
Drain both the environment-scoped session-stream waitpoint key and the
previous unscoped key, so a waitpoint registered by the prior deploy still
wakes its run across the deploy boundary. The legacy read can be dropped a
release later once no pre-deploy waitpoints remain.
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR hardens realtime session handling across multiple endpoints. The core change introduces environment-scoped session-stream cache keying and per-part append claim/release idempotency. Session list uses a batched friendly-run serializer; session creation rejects already-expired requests; PATCH rejects changes to externalId. Snapshot GET/PUT and out-channel appends enforce explicit authorization, out-channel appends require PRIVATE auth and support X-Part-Id idempotency, and all waitpoint operations are environment-scoped.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description provides a clear summary of changes and rollout considerations, but does not follow the repository's required template structure with checklist, testing, and changelog sections. Add the required template sections including checklist confirmation, testing steps, and structured changelog entry as specified in CONTRIBUTING.md guidelines.
Docstring Coverage ⚠️ Warning Docstring coverage is 78.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main change: hardening realtime session routes with security and reliability improvements.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/session-route-hardening

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ericallam ericallam marked this pull request as ready for review June 10, 2026 14:55
coderabbitai[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

Scope the batched session run-id lookup to the caller environment and
project so a stale currentRunId pointer cannot resolve a run in another
tenant. Escape the user-supplied segments of the append idempotency key
so a colon in an externalId or X-Part-Id cannot collide and falsely
suppress an append. Keep the waitpoint drain running on an idempotent
retry: a duplicate append is skipped but still drains, so a retry whose
first attempt died before waking the waitpoint can still recover it.
@changeset-bot

changeset-bot Bot commented Jun 10, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: 4593ef8

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

coderabbitai[bot]

This comment was marked as resolved.

Claim the part id with SET NX before appending instead of a read then
write, so two concurrent or retried POSTs with the same X-Part-Id can
never both write a record. The claim is released when the append fails
so a genuine retry still proceeds.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
apps/webapp/app/services/sessionStreamWaitpointCache.server.ts (2)

186-227: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Make the append claim owner-aware before releasing it.

claimSessionStreamPart() stores a constant "1", and releaseSessionStreamPart() later deletes by key only. If the first request runs past the 5-minute TTL, a retry can re-claim the same partId, and the original request's failure path will then delete the newer claim. That reopens the key and lets a third append win, which defeats the idempotency this PR is trying to add. Store a unique claim token per winner and release with a compare-and-delete script so only the current owner can clear its own claim.


120-126: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clear the legacy waitpoint key on remove during the rollout window too.

drainSessionStreamWaitpoints() still reads and deletes buildLegacyKey(addressingKey, io), but removeSessionStreamWaitpoint() only touches the new env-scoped key. A pre-deploy waitpoint that gets completed by the .wait() race-check cleanup will stay stranded in the legacy set until the next append re-drains it, which can trigger a second completeWaitpoint attempt. Mirror the legacy cleanup here until the fallback read is removed.

Possible fix
 export async function removeSessionStreamWaitpoint(
   environmentId: string,
   addressingKey: string,
   io: "out" | "in",
   waitpointId: string
 ): Promise<void> {
   if (!redis) return;

   try {
     const key = buildKey(environmentId, addressingKey, io);
-    await redis.srem(key, waitpointId);
+    const legacyKey = buildLegacyKey(addressingKey, io);
+    await redis.multi().srem(key, waitpointId).srem(legacyKey, waitpointId).exec();
   } catch (error) {

Also applies to: 239-249


ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: aff7664e-3c4d-467c-8f1b-8de14055d624

📥 Commits

Reviewing files that changed from the base of the PR and between 22680b7 and 4593ef8.

📒 Files selected for processing (2)
  • apps/webapp/app/routes/realtime.v1.sessions.$session.$io.append.ts
  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (15)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (5, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (1, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (6, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (4, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (10, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (7, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (3, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (8, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (2, 10)
  • GitHub Check: webapp / 🧪 Unit Tests: Webapp (9, 10)
  • GitHub Check: e2e-webapp / 🧪 E2E Tests: Webapp
  • GitHub Check: typecheck / typecheck
  • GitHub Check: 🛡️ E2E Auth Tests (full)
  • GitHub Check: Analyze (actions)
  • GitHub Check: Analyze (javascript-typescript)
🧰 Additional context used
📓 Path-based instructions (7)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

Import from @trigger.dev/sdk when writing Trigger.dev tasks. Never use @trigger.dev/sdk/v3 or deprecated client.defineJob

Files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/routes/realtime.v1.sessions.$session.$io.append.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/routes/realtime.v1.sessions.$session.$io.append.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

**/*.{ts,tsx,js,jsx}: Prefer static imports over dynamic imports. Only use dynamic import() when circular dependencies cannot be resolved, code splitting is needed for performance, or the module must be loaded conditionally at runtime
Import subpaths only from packages/core (@trigger.dev/core), never import from the root

Files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/routes/realtime.v1.sessions.$session.$io.append.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/routes/realtime.v1.sessions.$session.$io.append.ts
apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: Access environment variables through the env export of env.server.ts instead of directly accessing process.env
Use subpath exports from @trigger.dev/core package instead of importing from the root @trigger.dev/core path

Use named constants for sentinel/placeholder values (e.g. const UNSET_VALUE = '__unset__') instead of raw string literals scattered across comparisons

Files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/routes/realtime.v1.sessions.$session.$io.append.ts
apps/webapp/**/*.server.ts

📄 CodeRabbit inference engine (apps/webapp/CLAUDE.md)

apps/webapp/**/*.server.ts: Never use request.signal for detecting client disconnects. Use getRequestAbortSignal() from app/services/httpAsyncStorage.server.ts instead, which is wired directly to Express res.on('close') and fires reliably
Access environment variables via env export from app/env.server.ts. Never use process.env directly
Always use findFirst instead of findUnique in Prisma queries. findUnique has an implicit DataLoader that batches concurrent calls and has active bugs even in Prisma 6.x (uppercase UUIDs returning null, composite key SQL correctness issues, 5-10x worse performance). findFirst is never batched and avoids this entire class of issues

Files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
**/*.{js,ts,tsx,jsx,css,json,md}

📄 CodeRabbit inference engine (AGENTS.md)

Use Prettier for code formatting and run pnpm run format before committing

Files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/routes/realtime.v1.sessions.$session.$io.append.ts
🧠 Learnings (9)
📚 Learning: 2026-03-22T13:26:12.060Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3244
File: apps/webapp/app/components/code/TextEditor.tsx:81-86
Timestamp: 2026-03-22T13:26:12.060Z
Learning: In the triggerdotdev/trigger.dev codebase, do not flag `navigator.clipboard.writeText(...)` calls for `missing-await`/`unhandled-promise` issues. These clipboard writes are intentionally invoked without `await` and without `catch` handlers across the project; keep that behavior consistent when reviewing TypeScript/TSX files (e.g., usages like in `apps/webapp/app/components/code/TextEditor.tsx`).

Applied to files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/routes/realtime.v1.sessions.$session.$io.append.ts
📚 Learning: 2026-03-22T19:24:14.403Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 3187
File: apps/webapp/app/v3/services/alerts/deliverErrorGroupAlert.server.ts:200-204
Timestamp: 2026-03-22T19:24:14.403Z
Learning: In the triggerdotdev/trigger.dev codebase, webhook URLs are not expected to contain embedded credentials/secrets (e.g., fields like `ProjectAlertWebhookProperties` should only hold credential-free webhook endpoints). During code review, if you see logging or inclusion of raw webhook URLs in error messages, do not automatically treat it as a credential-leak/secrets-in-logs issue by default—first verify the URL does not contain embedded credentials (for example, no username/password in the URL, no obvious secret/token query params or fragments). If the URL is credential-free per this project’s conventions, allow the logging.

Applied to files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/routes/realtime.v1.sessions.$session.$io.append.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma error P1001 ("Can't reach database server") in TypeScript, don’t assume a single error shape. Prisma can surface P1001 via two different error classes/fields: `PrismaClientKnownRequestError` exposes it as `err.code === "P1001"` (common during mid-query connection drops), while `PrismaClientInitializationError` exposes it as `err.errorCode === "P1001"` (common on client startup failure). Therefore, predicates should use `err.code === "P1001" || err.errorCode === "P1001"`. Do not flag `err.code === "P1001"` as “unreachable/never matches,” as it is expected in production.

Applied to files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/routes/realtime.v1.sessions.$session.$io.append.ts
📚 Learning: 2026-05-18T08:21:27.694Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3632
File: apps/webapp/sentry.server.ts:4-21
Timestamp: 2026-05-18T08:21:27.694Z
Learning: When handling Prisma errors for P1001 ("Can't reach database server"), do not assume it only appears under a single property name. Prisma may surface P1001 via either `PrismaClientKnownRequestError` (`err.code === "P1001"`, e.g., mid-query connection drops) or `PrismaClientInitializationError` (`err.errorCode === "P1001"`, e.g., client startup connection failure). To reliably detect the condition, check `err.code === "P1001" || err.errorCode === "P1001"`, and avoid review rules that would incorrectly flag `err.code === "P1001"` as unreachable/never-matching.

Applied to files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/routes/realtime.v1.sessions.$session.$io.append.ts
📚 Learning: 2026-03-26T09:02:07.973Z
Learnt from: myftija
Repo: triggerdotdev/trigger.dev PR: 3274
File: apps/webapp/app/services/runsReplicationService.server.ts:922-924
Timestamp: 2026-03-26T09:02:07.973Z
Learning: When parsing Trigger.dev task run annotations in server-side services, keep `TaskRun.annotations` strictly conforming to the `RunAnnotations` schema from `trigger.dev/core/v3`. If the code already uses `RunAnnotations.safeParse` (e.g., in a `#parseAnnotations` helper), treat that as intentional/necessary for atomic, schema-accurate annotation handling. Do not recommend relaxing the annotation payload schema or using a permissive “passthrough” parse path, since the annotations are expected to be written atomically in one operation and should not contain partial/legacy payloads that would require a looser parser.

Applied to files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
📚 Learning: 2026-05-05T09:38:02.512Z
Learnt from: d-cs
Repo: triggerdotdev/trigger.dev PR: 3523
File: apps/webapp/app/routes/api.v3.batches.ts:178-181
Timestamp: 2026-05-05T09:38:02.512Z
Learning: When reviewing code that catches `ServiceValidationError` in `*.server.ts` files, do not blindly forward `error.status` to HTTP responses, because SVEs may be thrown with non-default statuses (e.g., 400/500) and forwarding them can cause client-visible behavioral regressions (e.g., surfacing 500s to clients). Prefer a safe default response status of `error.status ?? 422`, but only after confirming via the reachable call graph that the caught `ServiceValidationError` instances are expected to carry those non-default statuses; otherwise, normalize to `422` to avoid unexpected client-visible 5xx behavior.

Applied to files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
📚 Learning: 2026-05-12T21:04:05.815Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 3542
File: apps/webapp/app/components/sessions/v1/SessionStatus.tsx:1-3
Timestamp: 2026-05-12T21:04:05.815Z
Learning: In this Remix + TypeScript codebase, do not flag a server/client boundary violation when a file imports only types from a module matching `*.server`.

Specifically, it’s safe to import types using `import type { Foo } from "*.server"` or `import { type Foo } from "*.server"` because TypeScript erases type-only imports at compile time and they emit no JavaScript, so they won’t cross the Remix server/client bundle boundary.

Only raise the boundary concern for value imports (e.g., `import { Foo }` without `type`, or `import Foo`), since those produce JavaScript output.

Applied to files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/routes/realtime.v1.sessions.$session.$io.append.ts
📚 Learning: 2026-06-04T18:16:35.386Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 3836
File: apps/supervisor/src/backpressure/backpressureMonitor.ts:3-5
Timestamp: 2026-06-04T18:16:35.386Z
Learning: When reviewing TypeScript in this repo, apply the rule “prefer type aliases over interfaces” only to data/object shapes and union/intersection type modeling. If an interface is being used as a behavioral contract for collaborators to implement (e.g., method-shape interfaces that define required behavior, such as `BackpressureLogger` / `BackpressureSignalSource` in `apps/supervisor/src/backpressure/backpressureMonitor.ts`), keep it as an `interface` and do not flag it as a type-alias-vs-interface violation.

Applied to files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/routes/realtime.v1.sessions.$session.$io.append.ts
📚 Learning: 2026-06-09T17:58:04.699Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 3879
File: apps/webapp/app/models/vercelIntegration.server.ts:619-630
Timestamp: 2026-06-09T17:58:04.699Z
Learning: In this codebase, outbound raw `fetch` calls should typically rely on Node/undici’s default request timeout (about ~300s) rather than adding a per-call `AbortController` + `setTimeout` wrapper inside individual functions (e.g. in files like `apps/webapp/app/models/vercelIntegration.server.ts`). During code review, do not flag the absence of a per-call timeout on a single `fetch` as an issue; if per-call timeouts are needed, they should be implemented via a codebase-wide convention (e.g., a shared fetch wrapper or documented pattern) rather than ad-hoc per-function changes.

Applied to files:

  • apps/webapp/app/services/sessionStreamWaitpointCache.server.ts
  • apps/webapp/app/routes/realtime.v1.sessions.$session.$io.append.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant