Tags: triggerdotdev/trigger.dev
Tags
fix: suffix retried instance create names to dodge stale registrations A failed create can leave its instance name registered gateway/fcrun-side until async cleanup runs, so a same-name retry can 409 against our own residue (observed: tap-EBUSY 500 at 18:29Z followed by 409 name_conflict on the retry 2.7s later, costing the full redrive anyway). Give retry attempts a deterministic -rN suffix; attempt 1 keeps the unsuffixed name so the non-retry path is unchanged. The suffixed name flows into both the instance name and TRIGGER_RUNNER_ID from the same variable - every downstream flow (suspend scheduling, snapshot dispatch, cancel guards, run-engine fields) treats it as one opaque self-reported token, and restored VMs already carry deterministic name suffixes. Temporary measure (TRI-10293): the proper fix is gateway-side cleanup of failed-create registrations.
fix: retry transient instance create failures instead of abandoning t… …he run ComputeWorkloadManager.create swallows gateway errors by design, so a cold start that fails placement (e.g. a netns slot with a busy tap, a full node disk) silently abandons the dequeued run until the run engine's PENDING_EXECUTING timeout redrives it minutes later. These failures are transient per placement - redriven runs virtually always succeed - so retry the create up to 3 times with short backoff before giving up. Gateway 5xx and network-level fetch failures are retried; 4xx responses (won't heal) and timeouts (the instance may still be provisioning) are not.
fix: cancel pending delayed snapshots when the run completes or disco… …nnects The compute suspend flow delays snapshots by snapshotDelayMs to avoid wasted work on short-lived waitpoints, with the intent that a run which continues before the delay expires cancels the pending snapshot. But the only cancel() call site is the /continue workload action, which runners only invoke when restoring from an already-taken snapshot - so a pending snapshot is never actually cancelled (zero snapshot.canceled events in prod). When a run resumes and completes within the delay window, the stale snapshot fires anyway and fcrun pauses the VM for ~6-13s while its controller is mid warm-start long-poll. The frozen guest can't fire its abort timer or send a FIN, so firestarter keeps the connection claimable past the client deadline and dispatches runs into it - each one a ~300s stall (TRI-10293). Cancel the pending snapshot when the attempt completes and when the run socket disconnects. Genuine waitpoint suspensions keep the runner socket connected and the attempt incomplete, so neither hook cancels a snapshot that is still wanted. Cancellation is guarded by runnerId so a stale duplicate runner for a reassigned run can't cancel the new runner's pending snapshot.
feat(supervisor): 60s dequeue latency bucket to bracket the retry-exh… …austed error envelope
chore: add changeset for dequeue latency histogram
feat(supervisor): publish client-side dequeue API latency as a Promet… …heus histogram (#3887) The supervisor's dequeue round-trip time (`POST /engine/v1/worker-actions/dequeue`) was measured but only flowed into wide events and OTel span attributes — there was no Prometheus series, so latency percentiles and error rates weren't queryable. This adds `queue_consumer_pool_dequeue_duration_seconds` (histogram, label `outcome=success|empty|error`) to the existing consumer-pool metrics, scraped automatically by the existing ServiceMonitors on queue-raider/schedule-raider/supervisor. - Records every dequeue call, including failed ones, which previously emitted no timing at all - The pool's shared `ConsumerPoolMetrics` instance is injected into each consumer (mirrors the `BackpressureMetrics` → `BackpressureMonitor` wiring) - Buckets extend to 30s because `wrapZodFetch` retries internally (5 attempts, ≥7.5s backoff before a retryable error surfaces) - Existing `dequeueResponseMs` wide-event/span behavior unchanged
chore: release v4.5.0-rc.5 (#3808) ## Summary 1 new feature, 8 improvements, 1 bug fix. ## Highlights - Add optional `shouldPauseScaling` to the supervisor consumer pool scaling options to freeze scale-up while it returns true (scale-down stays allowed). ([#3836](#3836)) ## Improvements - The MCP server no longer tells the AI agent to wait for a run to complete after every `trigger_task` call. Waiting is now opt-in: the agent only waits when you ask it to (for example "trigger and then wait for it to finish"). This avoids burning tokens polling runs you didn't need to block on and keeps responses clearer. ([#3838](#3838)) - Update the bundled OpenTelemetry packages to their latest releases (`@opentelemetry/sdk-node` 0.218.0, `@opentelemetry/core` 2.7.1, `@opentelemetry/host-metrics` 0.38.3). ([#3810](#3810)) - `envvars.upload` now accepts an optional `isSecret` flag, letting you create the imported variables as secret (redacted) environment variables. When omitted, variables default to non-secret. ([#3809](#3809)) - Offload large trigger payloads to object storage before sending the trigger API request. The SDK uploads packets at or above the existing 128KB limit and sends an `application/store` pointer instead of embedding large JSON in the request body. `TriggerTaskRequestBody` now validates that `application/store` payloads are non-empty storage paths. ([#3785](#3785)) - Make mollifier buffer and drainer internals configurable. `MollifierBuffer` now accepts `ackGraceTtlSeconds`, `maxRetriesPerRequest`, `reconnectStepMs`, and `reconnectMaxMs` options, and `MollifierDrainer` accepts `maxBackoffMs` and `backoffFloorMs`. All default to their previous hardcoded values, so existing behaviour is unchanged. ([#3822](#3822)) - `MollifierDrainer` accepts a `drainBatchSize` option (default 1) that controls how many entries are popped per env per tick — in-flight handlers remain capped by the global `concurrency`. `MollifierBuffer` also gains `getDrainingCount()` / `listStaleDraining()`, backed by a new `mollifier:draining` ZSET maintained atomically with pop/ack/fail/requeue (observability-only). ([#3797](#3797)) - Adds AI SDK 7 support. The `ai` peer range now includes v7, and the `chat.agent` / chat surfaces work against v7's ESM-only build. On v7, install `@ai-sdk/otel` alongside `ai` and the SDK registers it for you so `experimental_telemetry` spans keep flowing into your run traces (v7 stopped emitting them from `ai` core). v5 and v6 keep working unchanged. ([#3833](#3833)) - `useTriggerChatTransport` now recovers when restored session state points at a session that no longer exists in the current environment ([#3816](#3816)) ## Bug fixes - Fix `@trigger.dev/core` build: cast the underlying log record exporter when calling `forceFlush` so it typechecks against the updated OpenTelemetry `LogRecordExporter` type (which no longer declares `forceFlush`). ([#3829](#3829)) <details> <summary>Raw changeset output</summary>⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ `main` is currently in **pre mode** so this branch has prereleases rather than normal releases. If you want to exit prereleases, run `changeset pre exit` on `main`.⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ # Releases ## @trigger.dev/build@4.5.0-rc.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` ## trigger.dev@4.5.0-rc.5 ### Patch Changes - The MCP server no longer tells the AI agent to wait for a run to complete after every `trigger_task` call. Waiting is now opt-in: the agent only waits when you ask it to (for example "trigger and then wait for it to finish"). This avoids burning tokens polling runs you didn't need to block on and keeps responses clearer. ([#3838](#3838)) - Update the bundled OpenTelemetry packages to their latest releases (`@opentelemetry/sdk-node` 0.218.0, `@opentelemetry/core` 2.7.1, `@opentelemetry/host-metrics` 0.38.3). ([#3810](#3810)) - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` - `@trigger.dev/build@4.5.0-rc.5` - `@trigger.dev/schema-to-json@4.5.0-rc.5` ## @trigger.dev/core@4.5.0-rc.5 ### Patch Changes - Add optional `shouldPauseScaling` to the supervisor consumer pool scaling options to freeze scale-up while it returns true (scale-down stays allowed). ([#3836](#3836)) - Fix `@trigger.dev/core` build: cast the underlying log record exporter when calling `forceFlush` so it typechecks against the updated OpenTelemetry `LogRecordExporter` type (which no longer declares `forceFlush`). ([#3829](#3829)) - `envvars.upload` now accepts an optional `isSecret` flag, letting you create the imported variables as secret (redacted) environment variables. When omitted, variables default to non-secret. ([#3809](#3809)) ```ts await envvars.upload("proj_1234", "prod", { variables: { STRIPE_SECRET_KEY: "sk_live_..." }, isSecret: true, }); ``` - Offload large trigger payloads to object storage before sending the trigger API request. The SDK uploads packets at or above the existing 128KB limit and sends an `application/store` pointer instead of embedding large JSON in the request body. `TriggerTaskRequestBody` now validates that `application/store` payloads are non-empty storage paths. ([#3785](#3785)) Payload uploads use the same resolved `ApiClient` as the trigger call (including `requestOptions.clientConfig`), not only the global `apiClientManager.client` — so custom `baseURL`, access token, and preview branch apply to both presign and trigger. - Update the bundled OpenTelemetry packages to their latest releases (`@opentelemetry/sdk-node` 0.218.0, `@opentelemetry/core` 2.7.1, `@opentelemetry/host-metrics` 0.38.3). ([#3810](#3810)) ## @trigger.dev/plugins@4.5.0-rc.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` ## @trigger.dev/python@4.5.0-rc.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/sdk@4.5.0-rc.5` - `@trigger.dev/core@4.5.0-rc.5` - `@trigger.dev/build@4.5.0-rc.5` ## @trigger.dev/react-hooks@4.5.0-rc.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` ## @trigger.dev/redis-worker@4.5.0-rc.5 ### Patch Changes - Make mollifier buffer and drainer internals configurable. `MollifierBuffer` now accepts `ackGraceTtlSeconds`, `maxRetriesPerRequest`, `reconnectStepMs`, and `reconnectMaxMs` options, and `MollifierDrainer` accepts `maxBackoffMs` and `backoffFloorMs`. All default to their previous hardcoded values, so existing behaviour is unchanged. ([#3822](#3822)) - `MollifierDrainer` accepts a `drainBatchSize` option (default 1) that controls how many entries are popped per env per tick — in-flight handlers remain capped by the global `concurrency`. `MollifierBuffer` also gains `getDrainingCount()` / `listStaleDraining()`, backed by a new `mollifier:draining` ZSET maintained atomically with pop/ack/fail/requeue (observability-only). ([#3797](#3797)) - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` ## @trigger.dev/rsc@4.5.0-rc.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` ## @trigger.dev/schema-to-json@4.5.0-rc.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` ## @trigger.dev/sdk@4.5.0-rc.5 ### Patch Changes - Adds AI SDK 7 support. The `ai` peer range now includes v7, and the `chat.agent` / chat surfaces work against v7's ESM-only build. On v7, install `@ai-sdk/otel` alongside `ai` and the SDK registers it for you so `experimental_telemetry` spans keep flowing into your run traces (v7 stopped emitting them from `ai` core). v5 and v6 keep working unchanged. ([#3833](#3833)) - `useTriggerChatTransport` now recovers when restored session state points at a session that no longer exists in the current environment ([#3816](#3816)) - Offload large trigger payloads to object storage before sending the trigger API request. The SDK uploads packets at or above the existing 128KB limit and sends an `application/store` pointer instead of embedding large JSON in the request body. `TriggerTaskRequestBody` now validates that `application/store` payloads are non-empty storage paths. ([#3785](#3785)) Payload uploads use the same resolved `ApiClient` as the trigger call (including `requestOptions.clientConfig`), not only the global `apiClientManager.client` — so custom `baseURL`, access token, and preview branch apply to both presign and trigger. - Update the bundled OpenTelemetry packages to their latest releases (`@opentelemetry/sdk-node` 0.218.0, `@opentelemetry/core` 2.7.1, `@opentelemetry/host-metrics` 0.38.3). ([#3810](#3810)) - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` </details> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
chore: release v4.5.0-rc.5 (#3808) ## Summary 1 new feature, 8 improvements, 1 bug fix. ## Highlights - Add optional `shouldPauseScaling` to the supervisor consumer pool scaling options to freeze scale-up while it returns true (scale-down stays allowed). ([#3836](#3836)) ## Improvements - The MCP server no longer tells the AI agent to wait for a run to complete after every `trigger_task` call. Waiting is now opt-in: the agent only waits when you ask it to (for example "trigger and then wait for it to finish"). This avoids burning tokens polling runs you didn't need to block on and keeps responses clearer. ([#3838](#3838)) - Update the bundled OpenTelemetry packages to their latest releases (`@opentelemetry/sdk-node` 0.218.0, `@opentelemetry/core` 2.7.1, `@opentelemetry/host-metrics` 0.38.3). ([#3810](#3810)) - `envvars.upload` now accepts an optional `isSecret` flag, letting you create the imported variables as secret (redacted) environment variables. When omitted, variables default to non-secret. ([#3809](#3809)) - Offload large trigger payloads to object storage before sending the trigger API request. The SDK uploads packets at or above the existing 128KB limit and sends an `application/store` pointer instead of embedding large JSON in the request body. `TriggerTaskRequestBody` now validates that `application/store` payloads are non-empty storage paths. ([#3785](#3785)) - Make mollifier buffer and drainer internals configurable. `MollifierBuffer` now accepts `ackGraceTtlSeconds`, `maxRetriesPerRequest`, `reconnectStepMs`, and `reconnectMaxMs` options, and `MollifierDrainer` accepts `maxBackoffMs` and `backoffFloorMs`. All default to their previous hardcoded values, so existing behaviour is unchanged. ([#3822](#3822)) - `MollifierDrainer` accepts a `drainBatchSize` option (default 1) that controls how many entries are popped per env per tick — in-flight handlers remain capped by the global `concurrency`. `MollifierBuffer` also gains `getDrainingCount()` / `listStaleDraining()`, backed by a new `mollifier:draining` ZSET maintained atomically with pop/ack/fail/requeue (observability-only). ([#3797](#3797)) - Adds AI SDK 7 support. The `ai` peer range now includes v7, and the `chat.agent` / chat surfaces work against v7's ESM-only build. On v7, install `@ai-sdk/otel` alongside `ai` and the SDK registers it for you so `experimental_telemetry` spans keep flowing into your run traces (v7 stopped emitting them from `ai` core). v5 and v6 keep working unchanged. ([#3833](#3833)) - `useTriggerChatTransport` now recovers when restored session state points at a session that no longer exists in the current environment ([#3816](#3816)) ## Bug fixes - Fix `@trigger.dev/core` build: cast the underlying log record exporter when calling `forceFlush` so it typechecks against the updated OpenTelemetry `LogRecordExporter` type (which no longer declares `forceFlush`). ([#3829](#3829)) <details> <summary>Raw changeset output</summary>⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ `main` is currently in **pre mode** so this branch has prereleases rather than normal releases. If you want to exit prereleases, run `changeset pre exit` on `main`.⚠️ ⚠️ ⚠️ ⚠️ ⚠️ ⚠️ # Releases ## @trigger.dev/build@4.5.0-rc.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` ## trigger.dev@4.5.0-rc.5 ### Patch Changes - The MCP server no longer tells the AI agent to wait for a run to complete after every `trigger_task` call. Waiting is now opt-in: the agent only waits when you ask it to (for example "trigger and then wait for it to finish"). This avoids burning tokens polling runs you didn't need to block on and keeps responses clearer. ([#3838](#3838)) - Update the bundled OpenTelemetry packages to their latest releases (`@opentelemetry/sdk-node` 0.218.0, `@opentelemetry/core` 2.7.1, `@opentelemetry/host-metrics` 0.38.3). ([#3810](#3810)) - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` - `@trigger.dev/build@4.5.0-rc.5` - `@trigger.dev/schema-to-json@4.5.0-rc.5` ## @trigger.dev/core@4.5.0-rc.5 ### Patch Changes - Add optional `shouldPauseScaling` to the supervisor consumer pool scaling options to freeze scale-up while it returns true (scale-down stays allowed). ([#3836](#3836)) - Fix `@trigger.dev/core` build: cast the underlying log record exporter when calling `forceFlush` so it typechecks against the updated OpenTelemetry `LogRecordExporter` type (which no longer declares `forceFlush`). ([#3829](#3829)) - `envvars.upload` now accepts an optional `isSecret` flag, letting you create the imported variables as secret (redacted) environment variables. When omitted, variables default to non-secret. ([#3809](#3809)) ```ts await envvars.upload("proj_1234", "prod", { variables: { STRIPE_SECRET_KEY: "sk_live_..." }, isSecret: true, }); ``` - Offload large trigger payloads to object storage before sending the trigger API request. The SDK uploads packets at or above the existing 128KB limit and sends an `application/store` pointer instead of embedding large JSON in the request body. `TriggerTaskRequestBody` now validates that `application/store` payloads are non-empty storage paths. ([#3785](#3785)) Payload uploads use the same resolved `ApiClient` as the trigger call (including `requestOptions.clientConfig`), not only the global `apiClientManager.client` — so custom `baseURL`, access token, and preview branch apply to both presign and trigger. - Update the bundled OpenTelemetry packages to their latest releases (`@opentelemetry/sdk-node` 0.218.0, `@opentelemetry/core` 2.7.1, `@opentelemetry/host-metrics` 0.38.3). ([#3810](#3810)) ## @trigger.dev/plugins@4.5.0-rc.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` ## @trigger.dev/python@4.5.0-rc.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/sdk@4.5.0-rc.5` - `@trigger.dev/core@4.5.0-rc.5` - `@trigger.dev/build@4.5.0-rc.5` ## @trigger.dev/react-hooks@4.5.0-rc.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` ## @trigger.dev/redis-worker@4.5.0-rc.5 ### Patch Changes - Make mollifier buffer and drainer internals configurable. `MollifierBuffer` now accepts `ackGraceTtlSeconds`, `maxRetriesPerRequest`, `reconnectStepMs`, and `reconnectMaxMs` options, and `MollifierDrainer` accepts `maxBackoffMs` and `backoffFloorMs`. All default to their previous hardcoded values, so existing behaviour is unchanged. ([#3822](#3822)) - `MollifierDrainer` accepts a `drainBatchSize` option (default 1) that controls how many entries are popped per env per tick — in-flight handlers remain capped by the global `concurrency`. `MollifierBuffer` also gains `getDrainingCount()` / `listStaleDraining()`, backed by a new `mollifier:draining` ZSET maintained atomically with pop/ack/fail/requeue (observability-only). ([#3797](#3797)) - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` ## @trigger.dev/rsc@4.5.0-rc.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` ## @trigger.dev/schema-to-json@4.5.0-rc.5 ### Patch Changes - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` ## @trigger.dev/sdk@4.5.0-rc.5 ### Patch Changes - Adds AI SDK 7 support. The `ai` peer range now includes v7, and the `chat.agent` / chat surfaces work against v7's ESM-only build. On v7, install `@ai-sdk/otel` alongside `ai` and the SDK registers it for you so `experimental_telemetry` spans keep flowing into your run traces (v7 stopped emitting them from `ai` core). v5 and v6 keep working unchanged. ([#3833](#3833)) - `useTriggerChatTransport` now recovers when restored session state points at a session that no longer exists in the current environment ([#3816](#3816)) - Offload large trigger payloads to object storage before sending the trigger API request. The SDK uploads packets at or above the existing 128KB limit and sends an `application/store` pointer instead of embedding large JSON in the request body. `TriggerTaskRequestBody` now validates that `application/store` payloads are non-empty storage paths. ([#3785](#3785)) Payload uploads use the same resolved `ApiClient` as the trigger call (including `requestOptions.clientConfig`), not only the global `apiClientManager.client` — so custom `baseURL`, access token, and preview branch apply to both presign and trigger. - Update the bundled OpenTelemetry packages to their latest releases (`@opentelemetry/sdk-node` 0.218.0, `@opentelemetry/core` 2.7.1, `@opentelemetry/host-metrics` 0.38.3). ([#3810](#3810)) - Updated dependencies: - `@trigger.dev/core@4.5.0-rc.5` </details> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
fix(supervisor): fail open on engaged verdict with no fresh timestamp When maxVerdictAgeMs is set, an engaged verdict must carry a fresh ts; a missing or stale ts can't be trusted (a dead producer could otherwise pin the brake), so treat it as not-engaged.
fix(supervisor): also strip DOCKER_REGISTRY_PASSWORD from debug env log Pre-existing secret that wasn't excluded from envWithoutSecrets; add it to the strip-list alongside the backpressure redis password.
PreviousNext