Netflix TechBlog - Medium

Dynamic Repartitioning for Time Series Workloads

Netflix Technology Blog — Wed, 03 Jun 2026 02:05:05 GMT

By Rajiv Shringi, Kaidan Fullerton, Oleksii Tkachuk and Kartik Sathyanarayanan

Introduction

Netflix’s TimeSeries Abstraction is a scalable system for ingesting and querying petabytes of temporal event data with millisecond latency. We use Apache Cassandra 4.x as the underlying storage for these main reasons:

Throughput, latency, and cost: Cassandra can handle millions of low‑latency reads and writes in a cost-effective manner.
Operational maturity: Our data platform team has deep operational expertise running large Cassandra clusters in production.

However, using Cassandra at this scale introduces trade‑offs for TimeSeries workloads. A key challenge is wide partitions, as TimeSeries dataset partitions can grow quite large with events accumulating over time.

This problem is further compounded by the fact that TimeSeries servers routinely deal with a very high read throughput:

Reads/second for different datasets

This post walks through our journey to reduce the impact of wide partitions in our TimeSeries datasets, the solutions we built, and the lessons we learned.

Note: Although this post walks through re-partitioning in Cassandra, the same techniques can be applied more broadly to other data stores.

Impact of Wide Partitions

For most of our datasets, we observe an average read latency in the order of single-digit milliseconds:

Ideal Latency for Reads (ms)

However, in some datasets, as partitions grow too wide, we observe high read latencies in the order of seconds, especially towards the tail end:

High Tail Latency for Reads (seconds)

This can result in timeouts:

Read timeouts / second

In extreme cases, if most of the reads target wide partitions, we can see Garbage Collection pauses, high CPU utilization and thread queueing.

High CPU utilization and thread-queueing in Cassandra clusters

Scaling up the underlying Cassandra cluster is always an option, but we need smarter alternatives than just throwing more money at the problem.

TimeSeries Partitioning Strategy

The TimeSeries Abstraction was designed to solve the problem of wide partitions by dividing the data into discrete time chunks. For more in-depth information, refer to our previous blog.

To summarize, here is an illustration of how TimeSeries partitioning strategy helps us break up wide partitions into manageable chunks.

Time Series partitioning breaking up a dataset into Time slices, time buckets and event buckets

This strategy further allows us to efficiently query and drop data based on time, without having to deal with tombstones.

Picking the Partitioning Strategy

When a namespace (a.k.a. dataset) is created, users must specify their anticipated workload characteristics. This specification is then fed into our provisioning pipeline. The pipeline processes these inputs, runs Monte Carlo simulations, and produces an optimal infrastructure and partition configuration.

Provisioning picks optimal infra and configuration based on user inputs

You can learn more about our methodology of capacity planning in this insightful AWS re:Invent talk given by one of our stunning colleagues.

The Problem with the Current Approach

Although this method of provisioning is effective in many situations, it proves insufficient for TimeSeries workloads under these conditions:

Workload is unknown or inaccurately estimated: Early on in a project, users can lack a reliable picture of production traffic or simply misestimate key parameters.
Workload evolves over time: Traffic patterns, client behavior, and product requirements change. A “good” partitioning strategy on day one can become inefficient months later.
Data outliers exist: Not all TimeSeries IDs behave the same. A small percentage of IDs can receive a vastly higher volume of events than the rest.

Fortunately, our design with discrete Time Slices gives us a natural escape hatch for the first two scenarios; each new Time Slice can use a different partitioning strategy.

Each Time Slice can have a unique partition strategy

However, manually adjusting these configurations in a fleet that has thousands of TimeSeries datasets is not sustainable. We need automation.

Solution 1: Time Slice Re-Partitioning

Cassandra exposes useful introspection APIs for understanding data usage and access patterns. For example, nodetool tablehistograms provide percentile distributions for partition sizes in a table. Using these tools, we can detect cases of both over and under partitioning.

Below is an example of over‑partitioning, where the TimeSeries provisioning pipeline selected very small time_bucket intervals based on user provided inputs:

Provisioning selected 60s time buckets based on user inputs

causing partitions to have less than 10 KB of data, leading to high read amplification and thread queueing:

Histogram of the given Cassandra table showing partition size percentiles

In order to tune partition strategies efficiently, we added a background worker, which monitors partition histograms of Time Slices attached to a given application, and exposes it via a Cassandra virtual table:

Histograms exposed through a Cassandra Virtual table

It then computes an adjustment factor when it detects partition sizes not meeting a configured density. This configured density is often set between 2 MiB to 10 MiB depending on the workload.

DynamicTimeSliceConfigWorker: 
namespace: my_dataset_1
Observed: TimeSlices have p99 partitions below configured target of 10MB. 
Proposed: time_bucket interval: 60s -> 604800s

The worker can then update future Time Slices with the new partition strategy:

Partitioning adjusted for future Time Slice(s)

This strategy has yielded real results in reducing our read latencies, as well as reducing the number of timeouts caused by thread queueing.

Reduction in tail latency and thread queueing for

However, this strategy only works if most of the data exhibits such behavior that warrants re-partitioning of the entire table. It does not work in cases where only a percentage of IDs within the table are wide.

We have a couple of options here:

Do Nothing: This is sometimes the right approach if there is no observed impact to the application’s top-level metrics.
Partial Returns: We implemented a ‘Partial Return’ feature, which aborts an inflight request if it has breached a configured latency SLO, while returning whatever data it has collected up until that point. This is a great option for clients who care more about latency than fetching all the data.

Tail latency drops around the SLO cutoff as Partial Returns are enabled

Block IDs: This is an extreme step but worth mentioning, because we do deal with bad data that occasionally seeps into the system e.g. test or spam IDs that can make the system unstable.

dgwts.config..block.Ids: ", , "

Ultimately, we encounter scenarios where valid and important TimeSeries IDs accumulate a high enough volume of events, with callers needing to process all the related data. Simply tolerating elevated latencies or timeouts when querying these IDs is not a desirable outcome.

This is where dynamic partitioning comes into play.

Solution 2: Dynamic Partitioning per ID

Dynamic partitioning is an asynchronous pipeline that auto-detects and splits wide partitions on a TimeSeries ID level rather than at the table level.

It has three main stages:

Detection: Detects wide partitions for a given TimeSeries ID during the read path.
Planning & Splitting: Plans and executes splits of those partitions into optimal sizes asynchronously.
Serving Reads: Re-routes the read queries transparently to read data from the split partitions when ready.

This is how it works at a high level; we will dive into details after:

Dynamic Wide Partition Split Async Pipeline

Here are the different stages of the pipeline:

Detection

Every TimeSeries read operation tracks how many bytes are read for a given partition. If the bytes read exceed a configured threshold, the server emits a detection event to Kafka:

{
  "time_slice": "data_20260328", // the Cassandra table this event was detected in
  "time_series_id": "profileId:123", // the ID detected as wide
  "time_bucket": 7, // the existing time_bucket partition
  "event_bucket": 2, // the existing event_bucket partition
  "immutable": true, // TimeSeries servers can compute if this partition is no longer receiving writes
  "version": "0" // reserved for future use e.g. invalidate if partition is no longer immutable
}

Our decision to detect wide partitions on reads, as opposed to writes, is based on our observation that the majority of the data in the wild doesn’t need this treatment. The slight downside is that some reads on these large partitions may suffer sub-optimal performance for a very short duration (typically seconds) until this process catches up.

Immutability

Although splitting mutable partitions is possible, it is inherently more complex. As a first step towards solving this problem, we chose to reduce the surface area of this change by focusing on immutable partitions, while still meaningfully reducing caller timeouts.

Planning

Detection may occur based on a partial read, so the planner must still read the entire partition once to compute an accurate split plan. The checkpointing becomes crucial here. For planning reads that fail to process the entire partition, the process can always continue from the last saved checkpoint.

Checkpointing

The wide_row metadata table serves as the backbone for state transitions and checkpointing of partition splits. It also stores information that is used later by TimeSeries servers to properly route Read queries.

wide_row metadata for storing split states and checkpoints

Splitting

The Planner delegates the splitting of data to an appropriate split-strategy. For example, if EventBucketPartitionSplitStrategy is selected, we split the partition by assigning more event buckets to the same time bucket. If the partition is ultra-wide, we cap the number of event buckets we split into, in order to control the resultant read amplification. Spreading into multiple partitions in such cases is still beneficial in order to spread the read workload to multiple Cassandra replicas.

Split by assigning more event buckets for a given time bucket

Further, since the Splitter has the full view of the partition, it can ensure total sort order across all the split buckets.

Validating Splits

The Planner stores a pre-split checksum of a given partition during the planning phase, while the Splitter computes and stores the post-split checksum. The split status is marked as completed only if the two checksums match.

Ensure checksums match pre- and post-split before marking a split as COMPLETED

Tracking Splits

The pre- and post-split partition sizes across different datasets are tracked to see how effectively the partition splits are being planned and executed:

Track pre- and post-split partition sizes to ensure we are splitting optimally

Serving Reads

The TimeSeries servers load the partition-keys of completed splits periodically into in-memory Bloom filters. Every read operation checks the Bloom filter to see whether a query can be diverted to the split partitions.

Here is what the Read path looks like:

Read path for diverting reads to existing or split partitions

The size of the Bloom filters is monitored to ensure we have enough memory per server. Due to the compactness of partition keys, and ratio of wide partitions in a given dataset, the filters fit comfortably in each server instance.

Bloom filter approximate element count per namespace and time slice

The Bloom filter latency to check whether a given partition key is wide for every read request is typically in single-digit microseconds or better, making this diversion practically invisible to the callers.

Latency for checking Bloom filters is extremely small for callers to notice the diversion

For the cases that do end up with a Bloom filter hit, the TimeSeries servers lookup the wide_row metadata to see how to read a specific wide partition:

{
  "pre_split_data": {
    "time_slice": "data_20260328",
    "time_series_id": "6313825", → What to read
    "time_bucket": 0,
    "event_bucket": 2
    …
  },
  "post_split_data": {
    "time_slice": "wide_data_20260328_0", → Where to read it from
    "event_bucket_partition_strategy": { → Strategy to delegate to for reading
    "target_event_buckets": 2,
    "start_event_bucket": 32 → How should the strategy read it
  }
  …
}

This metadata read is backed by a read-through cache, making it quite performant:

Metadata fetch latency is quite low to affect read operations

Finally, the reads for the split partitions are delegated to our existing PartitionReader, which reads N smaller partitions in parallel, rather than 1 large partition, improving overall performance and stability!

Read much smaller partitions in parallel and merge results

Fallbacks

The existing wide partition from the original time slice is never deleted. This helps us in creating safe fallbacks in many different scenarios of partial failures and eventual consistency. The slightly larger storage space we use as a result is worth the operational safety we gain.

Building Additional Confidence

Serving incorrect reads would be disastrous. To establish trust beyond checksums, we leveraged additional mechanisms such as:

Using our existing Data Bridge pipelines to verify splits offline:

Spark job to ensure that the split data is an exact match to the original data

Implementing a phased rollout strategy to safely advance through stages as our confidence in the system grew:

Advance through Read modes once previous mode passes checks

A critical part of this phased rollout was the Comparison phase, which compared bytes served by old read path and the new read path while in shadow mode:

A chart of bytes match vs bytes differ in a given shadow period

Results

As a result of these dynamic splits, we see a huge improvement in the average read latency of most wide partitions, bringing it down from seconds:

Existing average latency for reading wide partitions

to low double-digit milliseconds!

Average latency for reading dynamically split partitions

Tail latencies of reading wide partitions dropped from several seconds:

Existing tail latency for reading wide partitions

to around 200 ms or better:

Tail latency for reading dynamically split partitions

resulting in a drop in read timeouts:

Overall, this has resulted in a more stable Cassandra cluster with lower CPU utilization and little to no thread queuing:

Low CPU utilization and no thread-queueing

Further, for extreme wide rows, where a dataset would face constant timeouts and unavailability blips, the service was able to paginate and query 500MB+ partitions while remaining available:

grpc … com.netflix.dgw.ts.TimeSeriesService/SearchEventRecords -d
'{"namespace": "...",
    "search_query": {...},
    "time_interval": {
      "start": "2026–05–11T23:42:51.484398Z",
      "end": "2026–05–12T00:13:50.694205Z"
    },
    "pageSize" : 1000,
  }'
# Response:
{
  "next_page_token" : ….,
  "records": [
    {
      …
    }
  ],
  "response_context": [{
    "namespace": "...",
    …
    # Trades elevated latency for being available
    "time_taken": "41.072410142s"
    }
  ]
}

Conclusion

There is more work planned around this feature, like splitting mutable wide partitions, or re-processing previously failed splits, but this has been a successful start in improving service performance and reducing our support burden.

Further, we would like to highlight some key lessons that we learned at different points in this journey.

Reducing Surface Area: As a first step, explore simpler solutions that can still deliver meaningful impact. Also, reducing the surface area of a complex change and deploying incrementally pays off operationally.
Building Confidence: Invest time and resources to build confidence in new features, especially when justified by the feature complexity, deployment blast radius, and/or potential impact.

Acknowledgements: Special thanks to our stunning colleagues who further contributed to this feature’s success: Tom DeVoe, Chris Lohfink, Sumanth Pasupuleti and Joey Lynch.

Dynamic Repartitioning for Time Series Workloads was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

High-Throughput Graph Abstraction at Netflix: Part I

Netflix Technology Blog — Fri, 29 May 2026 18:49:13 GMT

By Oleksii Tkachuk, Kartik Sathyanarayanan, Rajiv Shringi

Introduction

Netflix has a diverse range of graph use cases, each serving specific business needs with unique functionality and performance requirements. These use cases fall into two broad categories:

OLAP: These use cases typically involve open-ended and algorithmic exploration of large graph datasets. They often utilize industry-standard models and languages such as RDF with SPARQL, Property Graphs with Gremlin or openCypher, and even SQL. The primary focus in these situations is in-depth analysis, rather than achieving high throughput and low latency.
OLTP: These use cases require extremely high throughput — up to millions of operations per second — while delivering traversal results within milliseconds. Achieving such a level of performance often requires making trade-offs, which can include accepting eventual consistency or restricting query complexity. For example, the service can demand a specified starting point for traversals and enforce a maximum traversal depth. Such use cases are often directly tied to streaming or user experiences and demand high global availability.

Netflix’s Graph Abstraction was designed specifically for this second category of use cases. As of this writing, the abstraction is handling close to 10 million operations per second across 650 TB of graph datasets with low latency and cost efficiency.

This post is the first in a multi-part series that explores the Graph Abstraction architecture in depth. We’ll cover how the abstraction indexes data for real-time and historical views, manages strongly typed graphs, performs efficient traversals, and integrates with the Netflix Big Data ecosystem.

Usage at Netflix

From a business standpoint, the primary driver for developing the Graph Abstraction was internal demand for supporting several key use cases:

Real-Time Distributed Graph (RDG): A graph capturing dynamic relationships across entities and interactions throughout the Netflix ecosystem. You can learn more about the initial RDG implementation in this insightful blog post. This functionality has since been integrated into the Graph Abstraction.
Social Graph: A graph of social connections within Netflix Gaming, designed to boost user engagement.
Service Topology: A graph of all internal Netflix services, used for real-time and historical analysis to improve root cause analysis during incidents.

Let’s examine the overall architecture of the Graph Abstraction and how it integrates with the Netflix Online Datastore ecosystem.

Architecture

Instead of building the persistence and caching layers from scratch, we chose to build taller on top of existing Netflix data abstractions.

The Key-Value (KV) Abstraction stores the latest view of nodes and edges, serving as the real-time index for all queries. Optionally, users can plug-in the TimeSeries (TS) Abstraction if they are interested in a historical view of how the graph evolves over time. Additionally, we use EVCache to achieve low-millisecond latencies and are actively experimenting with more specialized caching layers to further improve performance. Finally, the Graph Abstraction integrates with the Data Gateway Control Plane to manage graph schemas and automate the provisioning, deletion, and configuration of datasets in both KV and TS.

Property Graph Model

The Abstraction uses the Property Graph model to store its data. The graph consists of nodes and edges of various types, each with associated properties. These properties are strongly typed to enable efficient filtering and ensure consistent data exports. For semantic reasons, edges can be either unidirectional or bidirectional.

Namespaces

The Abstraction separates data into isolated units called “namespaces.” Each namespace is associated with a physical storage layer, as configured in the Data Gateway Control Plane, and can be deployed on either dedicated or shared hardware. The optimal, most cost-effective hardware configuration is determined by our provisioning automation, based on user-provided requirements such as throughput, latency, dataset size, and workload criticality. For more details on this topic, see this talk given by our stunning colleague Joey Lynch at AWS re:Invent.

Graph Schema

Each namespace is further associated with an explicit graph schema configured in the Control Plane. The graph schema defines node and edge types, allowed properties, permitted relationships, and directions.

The Graph schema is implemented as a collection of edge mappings that describe the nature of the relationship between given node types.

{
  "edgeConfig": {
    "edgeMappings": [
      {
        "edgeMappingKey": {
          "fromNodeType": "account",
          "edgeType": "owns",
          "toNodeType": "profile"
        },
        "directionType": "UNIDIRECTIONAL"
      },
      {
        "edgeMappingKey": {
          "fromNodeType": "profile",
          "edgeType": "linked_to",
          "toNodeType": "device"
        },
        "directionType": "BIDIRECTIONAL"
      }
    ]
  }
}

Edge mappings are further extended with specification of property schema that consists of allowed property names and their type specification:

{
   "edgeMappingKey":{
      "fromNodeType":"profile",
      "edgeType":"linked_to",
      "toNodeType":"device"
   },
   "propertySchema":{
      "propertyMappings":[
         { "propertyKey":"registration_time", "propertyValueType":"TIMESTAMP" },
         { "propertyKey":"status", "propertyValueType":"STRING" }
      ]
   }
}

The Abstraction servers load this schema on startup and build an in-memory metadata graph of possible relationships, enabling several key optimizations:

Data Quality: The Abstraction rejects non-conforming nodes, edges, and properties during writes, ensuring high data quality and consistent exports.
Query Planning: The Abstraction uses the schema to quickly construct the possible traversal paths the service should take to answer a given user query.
Deduplication of Traversed Edges: For bidirectional traversals on edges between the same node type, the schema helps avoid redundant processing by deduplicating traversed paths.
Eliminating Traversal paths: For a given user query, the Abstraction removes traversal paths associated with impossible relationships, as well as those where filters or property types are incompatible.

Further, the Abstraction servers periodically poll the schema from the Data Gateway Control Plane in order to keep it updated with user changes. Looking ahead, we plan to leverage the graph schema for additional improvements, such as:

Minimizing Query Fanout: By using edge cardinality within edge mappings, we aim to select the most efficient traversal paths and minimize query fanout.
Improved Developer Experience: The schema will support generating a type-safe data access layer and enhance the Gremlin-like API with schema awareness.

Next, let’s look at how this data is organized in a real-time index within the KV Abstraction.

Real-Time Index: Key-Value Storage

Before we discuss how the data is organized into graph indexes, let’s discuss how KV organizes data within namespaces and provides idempotency guarantees:

Data partitioning: A namespace is associated with a table in the underlying storage layer. Within the table, data is partitioned into records by unique IDs, with each record holding multiple sorted items as key-value pairs. This structure effectively makes each namespace a map of sorted maps, providing flexibility for diverse access patterns.
Idempotency: Writes to a given ID and key are idempotent, enabling request hedging and safe retries. The idempotency token contains a timestamp, which KV uses to enforce Last-Write-Wins (LWW) semantics at the storage layer.

We use the KV as the underlying storage for all real-time graph indices on nodes and edges. For more on Netflix’s Key-Value Abstraction, see this excellent post published by our KeyValue team.

Node Storage

The two-tiered partitioning strategy works well for node storage. Each node type is isolated within its own KV namespace, which stores all the properties for nodes of that type.

This storage format enables several efficient access patterns for nodes:

Efficient reads: A given node and all its properties are fetched in a single partition lookup, achieving single-digit millisecond latency.
Property selection pushdown: Target property keys are pushed down to the KV layer, reducing the amount of data fetched and further decreasing latencies and network overhead.
Property filtering pushdown: Property keys and values can be efficiently filtered at the KV layer.
Efficient exports: This model supports highly parallelized node exports by node type.

Edge Storage

Links and Property Index

Edges utilize two distinct types of indexes: one exclusively for the edge connections (links), and one for edge properties.

The Edge links are arranged as an adjacency list mapping source nodes to their connected neighbors.

The Edge Property index stores information about properties of every edge.

Separating edge links from their properties brings several benefits, but also introduces a key trade-off:

Benefits:

Efficient property upserts: Allows individual properties to be upserted over time without needing to read the entire property set for an edge.
Wide row prevention: Decoupling edge links from their properties prevents large partitions in databases like Cassandra, enabling efficient storage and low-latency reads — even for edges with millions of connections.

Trade-off:

Non-atomic writes: Storing edges across multiple namespaces means that writes across these namespaces are not atomic. We’ll discuss how this is addressed in the Consistency Enforcement section.

Forward and Reverse Indexes

Additionally, edge indexes are separated into forward and reverse indexes to support traversals in either direction. The illustration below shows an example of the reverse index counterpart for the links namespace shown above.

To ensure consistent record identifiers when updating edge properties in either direction, the Abstraction lexicographically sorts and concatenates the source and destination node IDs to create a direction-agnostic identifier for property storage. This ensures that properties can be accessed or mutated in a single database call regardless of the direction specified in the request.

This storage format enables several efficient access patterns:

Point Reads: Given an edge id, all properties can be fetched in a single partition lookup on the properties index.
Range Reads: Given a source node, a range read on a partition in the links index can efficiently return all edges. Depending on the desired direction, the Abstraction can target the forward or reverse index.
Property Filtering: Properties are fetched only for the links that match the record or page limit criteria, minimizing the data exchanged over the network.
Sort Orders: By default, edge links are sorted lexicographically by their target node. To support fetching the latest connections, the Abstraction retrieves target edge links in memory, sorts them by their last-write time, and returns the results. In order to ensure optimal performance without exerting too much memory pressure, we aim to limit the number of edges per source node within the system.

Next, let’s explore the caching strategies used by the Abstraction.

Caching Strategies in Graph Abstraction

Although the Graph Abstraction already provides efficient reads and writes to durable storage, caching remains critical for the stability and performance of any graph datastore for two key reasons:

Write amplification: A single write on the fronting service can result in multiple writes to the backing durable storage due to the use of multiple indexes. Whenever possible, it’s best to avoid unnecessary writes — for example, by not writing an edge link that already exists.
Read amplification: A single traversal request on the fronting service may translate into thousands of fetch operations on the backend, especially for highly interconnected graphs.

To address these challenges, the Graph Abstraction employs two distinct caching strategies.

Write-aside Caching of Edge Links

An edge link contains no additional information beyond the link itself and its last-write timestamp. To reduce write amplification on durable storage, we cache edge links for short durations, helping to avoid writing a link that already exists. This mechanism is balanced with configurable TTL windows, cache invalidation on deletes, and lease acquisitions with exponential backoff. These strategies provide the necessary consistency guarantees while still allowing the last-write timestamp to be refreshed according to the predefined staleness.

Read-aside Caching of Properties

To reduce read amplification on the durable store, the Graph Abstraction leverages KV’s integration with EVCache. Multiple KV namespaces can share the same caching clusters for cost efficiency. The Abstraction first fetches data from durable storage, while subsequent reads are served from the cache. Caching is applied at both the record and item levels, benefiting all graph objects.

Graph Abstraction employs two invalidation strategies, selected based on write throughput and consistency requirements:

Invalidation on write: Both record and item caches are invalidated with every write, ensuring consistency across regions. This strategy is ideal for graphs that change infrequently and cannot tolerate data staleness, but comes with the tradeoff of pushing a higher throughput on the cache.
TTL-driven invalidation: Cache entries are invalidated only when their TTL expires. This approach works best for frequently modified objects that can tolerate some staleness.

Work In Progress: Write-Through Caching

We are also developing a write-through caching strategy designed to store most of the data required by the Abstraction during traversals. This caching mechanism can organize indexes by different sort orders (e.g., sorting data by last-write timestamp), at the cost of increased memory consumption. Stay tuned for more details on this approach.

Next, let’s examine the consistency guarantees in Graph Abstraction and how they are enforced for both reads and writes.

Consistency Enforcement

Enforcing data consistency in Graph Abstraction poses several challenges. The connected nature of the data, low-latency API requirements, and the need to handle intermittent failures have led to design choices that enforce strict eventual consistency across multiple regions.

Entropy Repair

Each write in the Abstraction persists data for both inward and outward indices in parallel to support high throughput. Further, each write happens on multiple KV namespaces. To prevent inconsistencies or lasting entropy from failures in any operation, the Abstraction uses a robust retry mechanism using Kafka:

Node Deletions

Deleting nodes in a highly connected graph is more complex than simply removing a KV record as each node may have thousands of connected edges that must be handled to maintain graph integrity. Further, synchronously deleting all such connections would introduce unacceptable latency for the Abstraction callers.

The Abstraction employs an asynchronous deletion strategy to manage this issue. The consequence of this approach, however, is that the observed mutated state is only eventually consistent. Further, to ensure correctness of asynchronous deletes during concurrent updates, the Last-Write-Wins (LWW) conflict resolution mechanism is essential.

Global Replication

The consistency guarantees of Graph Abstraction are shaped by its multi-region availability. As illustrated in the diagram below, both the caching layer and durable storage replicate data asynchronously across regions, resulting in an eventually consistent system.

Now that we’ve covered storing the real-time graph index, let’s see how it enables graph traversals.

Graph Traversals

The Abstraction provides a custom gRPC traversal API, inspired by Gremlin, which enables exploration of the distributed graph by letting users chain traversals, apply filter criteria, sort results, limit results, and more.

Let’s explore a hypothetical scenario where the Abstraction is used to recommend shows to users on a shared device, by considering the duration of the most recent viewing session for each show across all profiles and accounts associated with that device:

TraversalRequest.newBuilder()
  .setNamespace("")
  .setTraversalQuery(
     TraversalQuery.newBuilder()
       // Given id of the 'device' node type.
       .setStartNode(node("device", "my-device-id"))
       .setTraversal(
          Traversal.newBuilder()
            // fetch the first 5 connections
            .setEdgeLimit(5)
            .setDirectionTraversal(
               DirectionTraversal.newBuilder()
                  // traverse in the IN direction
                  .setDirection(IN)
                  // minimize data exchange: only interested in certain properties
                  .addNodePropertiesSelections(propSelection("account", "created_at"))
                  .addNodePropertiesSelections(propSelection("profile", "last_active"))
                  .setDirectionFilter(
                     DirectionFilter.newBuilder()
                       // only interested in certain connected types
                       .setTypeMatchingStrategy(EXCLUDE_NON_TARGETED)
                       .addAllNodeFilters(typeFilters("account", "profile"))))
            // chain traversals to the intermediate result
            .addNextTraversals(
               Traversal.newBuilder()
                 .setOrder(LATEST)
                 // limit to 200 connections for the 2nd hop
                 .setEdgeLimit(200)
                 .setDirectionTraversal(
                    DirectionTraversal.newBuilder()
                      // now traverse in the OUT direction
                      .setDirection(OUT)
                      .addEdgePropertiesSelections(propSelection("watched", "view_time"))
                      .addEdgePropertiesSelections(propSelection("has_plan", "active"))
                      .setDirectionFilter(
                         DirectionFilter.newBuilder()
                           .setTypeMatchingStrategy(EXCLUDE_NON_TARGETED)
                           .addAllNodeFilters(typeFilters("title", "plan")))))))
  .build();

And let’s visualize the intended results set produced by the request above:

We’ll explore the design and implementation of traversal planning and execution, along with different traversal types, in the Part II of this blog series.

Now let’s look at the performance metrics of Graph Abstraction based on current production use cases.

Real World Performance

Across all applications at Netflix, Graph Abstraction ensures high availability while processing up to 10 million operations per second across all writes, individual edge / node reads and traversals at peak hours:

Edge and node persistence achieve single-digit millisecond latencies (p99 shown in red, p90 shown in orange, and p50 shown in green):

Traversal performance depends on the number of hops, the edge fanout at each stage, and associated filters and sort orders. We parallelize work as much as possible to reduce latencies. Typically 1-hop traversals are executed with single-digit millisecond latency:

1-hop traversal latencies

We also support a Count API that performs counting traversals at a very high rate with similar latencies, which we will cover in Part II of this series:

Currently, the RDG is powered by 2-hop traversals with a higher degree of fan-out. While these operations can reach upwards of 100 ms in latency, the 90th percentile (p90) latency remains under 50ms.

2-hop traversal latencies

We track the average and max edge fanout at different depths to give us insights into the traversal performance for different graph datasets.

Median edge fan-out

Max edge fan-out

Asynchronous operations such as node deletions can be slightly latent, but typically perform with sub-second latency:

At the moment, we are storing close to 650 TB of data globally across all our graph datasets.

Conclusion

As Netflix scales further into new verticals such as live content, games, and ads, Graph Abstraction will remain crucial for uncovering and leveraging rich connections — while continuing to support a high throughput and availability at low latencies.

Stay tuned for Part II of this blog series, where we’ll explore the implementation of graph traversals, counting and constraint mechanisms.

In Part III, we’ll take a closer look at the temporal index implementation and its integration with the Time Series Abstraction.

Acknowledgments

Special thanks to our stunning colleagues who contributed to Graph Abstraction’s success: Kaidan Fullerton, Joey Lynch, Sudhesh Suresh, Vinay Chella, Sumanth Pasupuleti, Vidhya Arvind, Raj Ummadisetty, Jordan West, Chris Lohfink, Joe Lee, Jingxi Huang, Jessica Walton, Prudhviraj Karumanchi, Akashdeep Goel, Sriram Rangarajan, Chris Van Vlack, Christopher Gray, Luis Medina, Ajit Koti, Mohidul Abedin.

High-Throughput Graph Abstraction at Netflix: Part I was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

From Silos to Service Topology: Why Netflix Built a Real-Time Service Map

Netflix Technology Blog — Fri, 29 May 2026 14:01:02 GMT

By Parth Jain, Rakesh Sukumar, Yingwu Zhao, Renzo Sanchez & Nathan Fisher
How we built a living map of our distributed infrastructure to help engineers understand dependencies, troubleshoot faster, and keep Netflix running smoothly for our members around the world.

The Puzzle with a Thousand Pieces

Picture this: It’s 3am, and an engineer gets paged. One of our critical services is showing elevated error rates. Members trying to watch their favorite films and series are seeing degraded experiences. The clock is ticking.

A single service at the center of a web of dependencies — services, data stores, and call chains branching in every direction. Without a unified map, engineers have to reason about this structure from memory and scattered signals.

In a system with thousands of microservices supporting our entertainment experience for members worldwide, answering these questions quickly can mean the difference between a minor blip and a major incident.

We kept hearing variations of this story from engineers across Netflix. The tooling gap was clear: we had plenty of signals, but no unified way to understand how everything connected.

The Three Questions Every Engineer Asks

When troubleshooting distributed systems, engineers fundamentally need to understand relationships:

Which services depend on each other? Not just theoretical dependencies from configuration files or architecture diagrams, but actual runtime connections based on real traffic.

What’s the blast radius? When something breaks or needs to go down for maintenance, what else will be affected? Which teams need to be notified?

Where’s the source? Is my problem caused by an upstream issue, or am I the root cause that’s cascading to others?

Traditional observability tools show fragments of this picture. Metrics show symptoms and performance characteristics. Logs show individual service behavior. Traces show single request flows through the system. But none of them show the complete map of how everything connects — the steady-state topology of dependencies that forms the backbone of our distributed architecture.

For an engineer at 3am, having to mentally stitch together information from multiple tools is slow, error-prone, and stressful. We needed something better: a unified view of service dependencies — a map showing how everything connects — with easy navigation to the detailed signals when you need to dig deeper.

Why This Matters More Than Ever

Netflix runs on thousands of microservices working together to deliver entertainment to our members. When you press play on your favorite series, that single action triggers a cascade of service-to-service calls — authentication, recommendations tailored to your tastes, video encoding selection, playback optimization, and more.

This architecture gives us tremendous flexibility and allows hundreds of engineering teams to innovate independently. But it also creates fundamental observability challenges.

And these challenges were growing. New initiatives like our Live programming and Ads-supported plans require even more sophisticated monitoring and faster troubleshooting. Live events can’t wait for lengthy incident investigations. The scale and real-time nature of these systems demanded better tooling.

We analyzed thousands of support requests from our engineers over a four-year period. The patterns were consistent:

“What are my upstream and downstream dependencies?”
“Is this failure in my service, or is something I depend on broken?”
“Which services will be impacted if I take this down for maintenance?”
“Why is this service showing as ‘Unknown’ in my metrics?”
“What changed in my call path recently that could explain this behavior?”

Engineers were asking dependency questions constantly. We needed to provide answers — quickly, accurately, and in real-time.

Building on What We Learned

We didn’t start from scratch. Over the years, we explored various approaches to solving this problem — from evaluating external graph databases and vendor platforms to building internal prototypes with different storage technologies and data models.

Each iteration taught us something valuable:

Real-time matters: Dependency maps that are hours old are useless in dynamic environments where services deploy multiple times per day. We needed near real-time updates.

Scale changes everything: Solutions that work at modest scale hit fundamental walls at Netflix scale. Storage systems that handle thousands of nodes struggle with our service count and traffic volume.

Integration is key: Any solution needs seamless integration with our existing observability ecosystem. Engineers shouldn’t have to learn entirely new tools or leave their existing workflows.

Data quality is critical: Incomplete or incorrect dependency information is worse than no information — it leads to wrong conclusions during incidents.

Multiple perspectives needed: We learned that no single source of dependency information tells the complete story. Network connectivity data lacks application context. Application metrics only cover instrumented services. We needed to combine multiple sources.

These lessons shaped every decision we made in building Service Topology.

What We Needed: A Living Map

We set out to build something specific: a living map of our infrastructure — one that updates in real-time as services deploy, as traffic patterns shift, as new dependencies form and old ones disappear.

The requirements were clear:

Real-time updates, not stale snapshots: In an environment where services deploy continuously, yesterday’s topology map is archaeology, not observability.

Fast queries at scale: When an engineer is troubleshooting at 3am, they can’t wait minutes for a query to return. We needed sub-second response times for traversing the call graph.

Multiple layers: Network-level connectivity doesn’t tell the whole story. We needed to see both the network layer (what’s actually talking to what) and the application layer (which APIs and endpoints are being called).

Rich context, not just connections: Knowing Service A talks to Service B isn’t enough. We needed to overlay health status, availability tiers, business domains, ownership information, and other metadata to make the information actionable.

Visual and programmatic access: Engineers needed a UI for exploration and troubleshooting. But automated systems — resilience frameworks, blast radius calculators, incident response automation — needed programmatic API access.

Our Approach: Three Sources of Truth

Three data sources produce three independent topology graphs — network, application, and request — each stored separately and queryable on their own or merged into a single unified view.

Here’s the key insight we arrived at: no single source tells the complete story.

We built Service Topology by using three complementary sources to build separate dependency graphs — one from each perspective — that can be combined into a unified view or explored independently:

Each source creates its own graph that is physically separate — the network layer in one graph database partition, the IPC layer in another partition, and the tracing layer using columnar storage optimized for analytical queries. This physical separation allows each layer to evolve independently and be queried in parallel. When users request a unified view, we execute traversal queries across all layers simultaneously and merge results, achieving sub-second response times even when combining all three layers.

Each source creates its own graph of service relationships:

1. eBPF Network Flows (Network Layer)

We capture network flow records at the kernel level using eBPF technology — information about which services are connecting to which other services over the network. This gives us ground truth about actual network-level communication.

The value: Comprehensive coverage. Every service shows up here because we’re capturing actual network traffic, regardless of whether applications are instrumented. This layer provides topology at both cluster-level (which deployment clusters are communicating) and app-level (which applications are communicating).

The limitation: Network-level information lacks application context. We know Service A connected to Service B’s IP address using a specific protocol, but not which specific API endpoint or path was called (e.g., /api/v1/users vs /api/v1/orders).

2. IPC Metrics (Application Layer)

We collect Inter-Process Communication metrics from our instrumented services. These are the metrics applications emit when they make calls to other services via gRPC, GraphQL, REST, or other protocols.

The value: Rich application context. We can see which specific endpoints were called, error rates, latency distributions, protocol details, and request/response characteristics. This layer provides app-level topology — since IPC metrics are emitted by applications, the natural granularity is application-to-application connections with endpoint details.

The limitation: Only works for instrumented services. If a service doesn’t emit IPC metrics, we won’t see its application-level calls this way.

3. End-to-End Tracing (Request Layer)

We integrate distributed tracing information that follows individual requests as they flow through our system. We aggregate traces to build a unified topology graph, but also allow engineers to overlay individual traces on the topology to see specific request flows.

The value: Shows actual request paths. Not just “Service A can call Service B,” but “Service A did call Service B as part of serving this specific member request.” This captures runtime behavior, including conditional logic and feature flags. Engineers can both see the aggregated pattern and drill into individual traces. We aggregate traces to build topology at both cluster-level and app-level, allowing engineers to view request patterns at the granularity most useful for their investigation.

The limitation: Sampling. We can’t trace every request without impacting performance, so we sample. This is excellent for understanding common flows, but may miss rarely-used code paths in the aggregated view.

Bringing It Together: Multi-Layer Architecture

Here’s what makes this powerful: we build three separate graphs — one from each source — that create different perspectives on service relationships:

Network graph from eBPF flows: Every connection, regardless of instrumentation
Application graph from IPC metrics: Rich endpoint and protocol details
Request graph from tracing: Actual runtime behavior and call paths

Engineers can:

View each graph independently to focus on a specific perspective (pure network connectivity, application-level calls, or traced request flows)
Combine them into a unified graph by querying multiple partitions in parallel and merging results — our system returns the union of nodes and edges from all requested layers while preserving each layer’s distinct properties

The unified view is especially powerful because:

Network flows ensure completeness — we don’t miss anything
IPC metrics provide application details — we understand the “how” and “what”
Tracing shows actual behavior — we see real request patterns

Each source compensates for the limitations of the others. The result is a comprehensive, accurate, and contextualized view of service dependencies that can be explored from multiple angles.

From Flows to Graph: How We Built It

Here’s the high-level architecture (we’ll dive deeper into engineering challenges in our next post):

Flow logs travel from multi-region Kafka through three aggregation stages — initial batching, intermediary resolution, and final enrichment — before being persisted to the graph database and served via API.

Multi-Region Ingestion: We consume flow logs from Kafka across multiple AWS regions where Netflix operates. This runs continuously, processing millions of flow records as they arrive.

Distributed Processing: We use Apache Pekko Streams (a fork of Akka) to process these flows in a distributed, fault-tolerant pipeline. The system automatically partitions work across our Auto Scaling Groups to handle the volume and provides natural backpressure handling.

Three-Stage Distributed Aggregation: We aggregate network flows through a three-stage pipeline that solves a fundamental challenge: network flow logs only show individual network hops through intermediaries (App A → Load Balancer → App B, or App A → NAT Gateway → App B), not the true application-level connections we need (App A → App B).

Stage 2 resolves network intermediaries: raw flow logs show two separate hops (App A → Load Balancer → App B), but the resolved graph stores the direct application-to-application relationship (App A → App B).

Stage 1 performs initial aggregation from Kafka. Stage 2 applies resolution logic — identifying network intermediaries (load balancers, NAT gateways, API gateways, proxies) and combining their incoming and outgoing flows to reconstruct direct application-to-application paths. Stage 3 performs final aggregation with health status integration before graph persistence. This graduated approach also prevents hot spots by distributing load across multiple points even when specific applications or network intermediaries see 100x more traffic than others.

Graph Storage: We persist the topology in Netflix’s graph database, an abstraction layer built on top of our distributed key-value storage infrastructure. This graph database is specifically designed for high-throughput graph operations at our scale, with fast multi-hop traversal capabilities. Each of our three data sources (network flows, IPC metrics, tracing) creates a separate graph that can be queried independently or merged.

gRPC API: We expose the topology through a gRPC service that supports multi-hop traversal, filtering by availability tier and business domain, pagination for large result sets, and sub-second query response times.

The technical details of building this at Netflix scale — handling Kafka lag, managing memory and garbage collection, optimizing distributed processing, debugging reactive streams — deserve their own discussion. We learned a lot, and we’ll share those lessons in our next post.

What Engineers Can Do Now

Today, the service topology map is helping engineers across Netflix:

Visualize Dependencies: See upstream and downstream dependencies for any service, with the ability to filter by availability tier (Tier 0, Tier 1, etc.) and business domain. Choose between the unified view (combining all sources) or individual graph views (network-only, IPC-only, or trace-only) depending on what you’re investigating.

Jump to Detailed Signals: From any service in the topology, quickly navigate to logs, traces, and detailed metrics in their respective tools. No more hunting for the right service name or time window — the topology provides the context and the starting point.

Understand Blast Radius: Before taking a service down for maintenance or making significant changes, see exactly what will be impacted. Identify which teams to notify and what to monitor.

Overlay Health Status: See not just the topology, but which services in the call path are experiencing issues. This is integrated with health status tracking, so you can quickly identify if a problem you’re seeing is actually originating somewhere else.

Query Programmatically: Use our gRPC API to integrate topology information into automated systems. For example, our Platform Modernization Engineering team uses this to verify that critical Live services have proper availability tier classifications throughout their dependency chains.

Investigate Faster: During incidents, quickly identify if a failure is local or if it’s propagating from somewhere else in the call graph. Follow the failure pattern to find the root cause.

Plan Changes Confidently: Understand the impact of proposed architectural changes or service migrations before implementing them.

Time Travel Through Topology: Query what the topology looked like at specific points in the past. Understand what changed in dependencies around the time an issue started, or see how your service’s dependency footprint has evolved over time. This time-travel capability is powered by time-window aggregation — instead of storing every time slice separately, we use layer-specific aggregators that accumulate topology data across windows, allowing us to reconstruct historical views efficiently without exploding storage costs.

The Living Map: Always Current

What makes this truly useful is that it’s a living map. It’s not a static diagram drawn in a design document that goes out of date the moment it’s published. It’s continuously updated based on actual traffic:

When a new service starts calling an API, it appears in the topology with near real-time freshness
When a service stops making calls to a dependency, that edge fades from the graph
When services deploy and their behavior changes, the topology reflects it
When incidents impact service health, the status overlay updates in real-time

This means engineers can trust what they see. The map reflects reality, not someone’s idea of what the architecture should be.

The Journey Continues

We’re not done. We continue to evolve the system with new capabilities:

Change Event Overlay: We’re working to surface deployment events, configuration changes, and other mutations alongside the topology graph. Correlation becomes easier when you can see both the dependencies and what changed when.

Richer Context: As we expand coverage and integrate more signals, we continue to enrich the topology with additional endpoint-level details, protocol information, and network path context.

And looking further ahead, we’re excited about something bigger: Automated root cause analysis. Imagine an intelligent agent that continuously crawls the topology graph, correlates failures across dependencies, understands historical patterns, and surfaces likely root causes automatically. Service topology provides the knowledge graph foundation that makes this kind of intelligent automation possible.

Why This Matters for Our Members

This might seem like infrastructure — plumbing that our members never see directly. But it matters immensely to their experience.

When engineers can quickly understand dependencies and identify issues, incidents get resolved faster. When we can model blast radius before making changes, we avoid disruptions. When automated systems can query dependency information programmatically, we can build smarter, more resilient systems.

All of this translates to what matters most: our members getting to watch their favorite films and series, seamlessly, whenever they want. Whether it’s a weekend binge of a beloved show, a live sports event, or discovering something new through our recommendations tailored to their tastes — we want it to just work.

What’s Next in This Series

This is the first in a series of posts about building Service Topology at Netflix.

In our next post, we’ll pull back the curtain on the engineering challenges we faced at scale: How do you handle Kafka consumer lag when ingesting millions of flow logs per second? What happens when distributed processing meets garbage collection pauses? How do you debug reactive streams that stall under load? How do you manage hot nodes in a distributed system? We’ll share the real problems we hit in production and the solutions we developed.

In future posts, we’ll explore the lessons we learned that apply to any distributed system at scale, and where we’re heading next with time travel capabilities and Automated root cause analysis.

Acknowledgements

This post was written by Parth Jain.

Service Topology was built by Parth Jain, Rakesh Sukumar, Yingwu Zhao, Renzo Sanchez-Silva, and Nathan Fisher.

Special thanks to the many engineers across Netflix who made this possible — the Observability team who built the broader system, the graph database platform team who provided the storage foundation, and the Platform Modernization Engineering, Live, and Ads teams who provided invaluable feedback and use cases throughout development.

From Silos to Service Topology: Why Netflix Built a Real-Time Service Map was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Scaling ArchUnit with Nebula ArchRules

Netflix Technology Blog — Fri, 08 May 2026 15:55:59 GMT

By John Burns and Emily Yuan

Introduction

At Netflix, we operate using a polyrepo strategy with tens of thousands of Java repositories. This means that we need to have ways of sharing common build logic across these repositories. On the JVM Ecosystem team within Java Platform, we build tooling such as the Nebula suite of Gradle plugins to provide standard ways to build projects, keep dependencies up-to-date, and publish artifacts reliably across the Java ecosystem. Our mission also entails providing build-time feedback to the developer when they deviate from the paved road, or when their code base contains technical debt.

Case Study

After a Netflix incident relating to a library releasing a backwards-incompatible change, our team was asked to provide some tooling and practices to improve the Java library lifecycle management. This was not a simple case of a library making a reckless breaking change. The code removed had been deprecated for years. Library authors often struggle to know when it is safe to remove deprecated code, or refactor code that is not meant to be used by downstream applications. Fleet-wide migrations, such as upgrading major Spring Boot versions, also involve deprecated code removal. To help with this, we established a suite of API lifecycle annotations:

@Deprecated from the Java standard library
@Public A custom annotation to use on APIs meant to be used downstream
@Experimental A custom annotation for new APIs which may not yet be stable
All other APIs are assumed to be “internal”

Library authors can annotate their APIs with these annotations. However, how will they know which downstream projects are using their API incorrectly, based on these?

As we sought to improve the paved road for JVM-based libraries at Netflix, we needed a good way of identifying this kind of technical debt, not only for the benefit of the Java Platform-provided libraries, but any team delivering shared libraries to the organization. For this, we looked at ArchUnit.

ArchUnit is a popular OSS library (3.5k stars, 84 contributors) used to enforce “architectural” code rules as part of a JUnit suite. It is used internally by Gradle, Spring, and is provided as part of the Spring Modulith platform. The rules engine, which is built directly on top of ASM, can be used for a wide variety of use cases. It is powerful enough to be a general purpose static analysis tool with the following distinctive features:

1. Works cross-language (JVM), because it uses ASM/bytecode, not AST parsing.

2. Exposes a builder API pattern that makes it easy to write rules

3. Also has a lower level API ideal for writing more complex custom rules.

The limitation of ArchUnit is that it is designed to be used as part of a JUnit suite in a single repository. The Nebula ArchRules plugins give organizations the ability to share and apply rules across any number of repositories. Rules can be sourced from OSS libraries or private internal libraries. This makes the plugin generally useful for any JVM+Gradle engineering organization.

Why ArchUnit?

Before we go into how ArchRules works, it is good to understand why we would want to use ArchUnit in this way instead of other static analysis tools.

AST vs Bytecode

Some tools, such as PMD, process rules against an AST (abstract syntax tree). An AST is a structured representation of source code. This kind of tool will have rules that are syntax dependent. Rules that need to support multiple JVM languages, such as Kotlin or Scala, often need to be rewritten for each language. It also allows code which should be found to be hidden under syntactic sugar not anticipated by the rule author. ArchUnit uses ASM to analyze actual compiled bytecode, which means it doesn’t matter how that code was produced. What is analyzed is the actual code that will be run.

Rule Authorship

Tools like PMD and Spotbugs are not optimized for custom rule authorships. Most usage of these tools run built-in provided rules, or add in pre-made third party plugins. Take a look at what a custom rule for PMD might look like:

 //AllocationExpression/ClassOrInterfaceType[
   @Image='DateTime' and (
       (count(..//Name[@Image='DateTimeZone.UTC'])<=0)
       and
       (count(..//Name[@Image='DateTimeZone.forID'])<=0)
    ) or (
       (
           (count(..//Name[@Image='DateTimeZone.UTC'])>0)
             or
           (count(..//Name[@Image='DateTimeZone.forID'])>0)
       ) and (../Arguments/ArgumentList and count(../Arguments/ArgumentList/Expression) = 1)
   )
 ]
]]>

This rule ensures that DateTimes are not instantiated without an explicit zone. This is a raw string meant to be used within PMD’s xpath parser. There is no IDE guidance on crafting it. To test it, a whole separate PMD process needs to be wired up to interpret the rule and evaluate it against a source file. Let’s see how a similar rule would look with ArchUnit:

ArchRuleDefinition.priority(Priority.MEDIUM)
.noClasses()
.should()
.callConstructorWhere(
    // constructor does not have a zone arguement
    target(doesNot(have(rawParameterTypes(DateTimeZone.class))))
   // constructor is for DateTime
        .and(targetOwner(assignableTo(DateTime.class)))
)

This is type-safe Java code with a fluent API. It is also simple to unit test, as ArchUnit has a method to pass a rule object and class references to evaluate the rule against those classes.

Class Relations

Because ArchUnit processes the entire classpath with ASM, it retains a graph of the class data, allowing rules to easily traverse class relationships and call sites. This allows rules to have much more context about the code it is evaluating.

Rules Libraries

The first step was to build the ability to write ArchUnit rules which can be shared and published. In order to do this, we have the ArchRules Library Plugin. This plugin adds an additional source set to your Gradle project called archRules. In this source set, you can create a class which implements the ArchRulesService interface. This interface has a single abstract method which returns a Map. The keys of this map are the names of your rules, and the ArchRule is the rule you would like to define using the standard ArchUnit API. Here is an example:

public class GuavaRules implements ArchRulesService {
  static final ArchRule OPTIONAL = ArchRuleDefinition.priority(Priority.MEDIUM)
        .noClasses()
        .should()
        .dependOnClassesThat()
        .haveFullyQualifiedName("com.google.common.base.Optional")
        .because("Java Optional is preferred over Guava Optional");

    @Override
    public Map getRules() {
        Map rules = new HashMap<>();
        rules.put("guava optional", OPTIONAL);
        return rules;
    }
}

This code and its dependencies will not be bundled with your main code. It is bundled into a separate Jar with the arch-rules classifier. When publishing, your library will publish this jar as a separate variant with the usage attribute set to arch-rules. This means that in order for downstream projects to use these rules, they must use Gradle Module Metadata for dependency resolution. There are 2 flavors of rules Libraries: Standalone rules libraries, bundled rule libraries.

Standalone Rule Libraries

A Standalone Rule library contains no main code: only archRules. These are useful for defining rules for code you don’t own, such as Core Java APIs or OSS libraries. They are also useful for generic rules that can apply to any code, such as “don’t use code marked as @Deprecated”. We maintain a collection of OSS Standalone rule libraries which anyone is free to use, and serve as examples of the types of rules you may want to write yourself. However, the real power of ArchRules is in “bundled rule libraries”.

Bundled Rule Libraries

A bundled rule library is a library with both main and archRules sources. The main source set will contain useful library code, whatever it may be. The archRules will contain rules specific to the usage of that library. For example, rules scoped to that library’s package, or referencing that library’s specific API. Whenever possible, we recommend writing rules in this bundled way. That is because the ArchRules Runner Plugin will be able to automatically detect these rules and run them in only the source sets that use this library as a dependency. An example of this can be seen in our Nebula Test library.

In any case, the library plugin will automatically generate a service loader registration entry for your ArchRulesService so that the runner can discover your rules.

Running Rules

The ArchRules Runner Plugin allows rules to be evaluated against your code. Standalone rule libraries can be evaluated against all source sets by adding them to the archRules configuration in your build. For example:

dependencies {
    archRules("your:rules:1.0.0")
}

As mentioned before, bundled rules will be evaluated automatically. To do this, the runner plugin creates a separate configuration for each of your source sets. In each of these configurations, the archRules classpath is combined with the runtimeClasspath with the arch-rules variant selected. This configuration is the classpath used when the ServiceLoader discovers implementations of ArchRulesService. In the following example, we have a Project which uses a test helper library as a testImplementation dependency, and also adds a standalone rules library to the archRules configuration. The test runtime classpath will only contain the implementation jar for the helper library, but the arch rules runtime will contain the archrules jar for the bundled rules and standalone rules. This all happens automatically.

Gradle configurations used by ArchRules

Once the rules classpath is determined, the runner plugin will create a Gradle work action to evaluate rules against that specific source set. This action runs with classpath isolation using the *archRuleRuntime configuration. Within this action, a ServiceLoader is used to discover rule definitions. The action ends by writing a binary serialization of rule violations to a file for reporting.

In a project running rules, you also have the ability to customize rule configurations using the archRules extension. For example, you can override a rule’s priority level:

archRules {
    ruleClass("com.netflix.nebula.archrules.deprecation") {
        priority("HIGH")
    }
}

Other customizations include disabling running rules on certain source sets and configuring the failure threshold (i.e., high priority failures will cause the build to fail).

Reporting

The ArchRules runner plugin has two built-in reports: JSON and console. The json report will collect the output from all source sets within a project and create a single json file with all of the data. The console report also collects the output from all source sets within a project, but it prints to the console an easy to read report, for example:

Console Report output

Note that failure details feature a detailed plain English description, along with a pointer to the exact line of code in violation.

For custom reporting, you can either use the JSON file, or create your own task that reads the binary files. Take a look at the source code for the ArchRules runner plugin’s report tasks for an example of how to do this.

Case Study Solution

Going back to our original problem, using ArchRules, we were able to deliver a platform for library authors to track the usage of their APIs. They write ArchRules to detect usage of the annotations, scoped to their library’s package, such as:

ArchRuleDefinition.priority(Priority.MEDIUM)
    .noClasses().that(resideOutsideOfPackage(packageName + ".."))
    .should()
    .dependOnClassesThat(resideInAPackage(packageName + "..").and(are(deprecated())))
    .orShould().accessTargetWhere(targetOwner(resideInAPackage(packageName + ".."))
        .and(target(is(deprecated())).or(targetOwner(is(deprecated())))))
    .allowEmptyShould(true)
    .because("Deprecated APIs are subject to removal");

NB: the deprecated() predicate comes from nebula-archrules.

Our internal Nebula standard Gradle wrapper and plugin suite automatically enable the ArchRules runner on every project, and provides a custom reporter which sends the report data to our Internal Developer Portal on every main-branch CI build. This way, library authors can easily see a report of all downstream consumers using their experimental, deprecated, or non-public APIs, giving them confidence to make “breaking” changes, knowing that it will not actually break downstream consumers. If their changes are currently blocked by downstream usage, they can easily see exactly which projects are reporting those usages.

OSS Rule Libraries

While the most powerful way to use ArchRules is for you to write your own rules, we have built some OSS rule libraries that anyone is free to use, or reference as examples.

Nullability

These rules enforce proper nullability annotation in Java, for example, that every public class is marked with JSpecify’s @NullMarked. It is smart enough to exclude Kotlin code, as Kotlin has built-in nullability.

Gradle Plugin Best Practices

Writing Gradle plugins can be hard, especially since there are many APIs and patterns that should not be used anymore. These rules help enforce current best practices when writing Gradle plugins.

Joda / Guava Rules

These rule libraries discourage the use of Joda Time and Guava classes (respectively) as these have been superseded by java.time and standard library enhancements.

Security Rules

These rules help mitigate CVEs by detecting usage of known vulnerable APIs. Ideally, we keep dependencies up to date to mitigate CVEs. But sometimes that is not immediately feasible, and in those cases, a compile time check to ensure the specific vulnerable API is not used is often good enough.

Conclusion

We are now running 358 (and counting) rules across over 5,000 repositories detecting over nearly 1 million issues. About 1,000 of these issues are for “High” priority rules. Being able to run these rules on this scale allows us to quickly gain insight into our large fleet of microservices, and identify the areas carrying the most critical technical debt. This makes it easier to focus and prioritize our efforts.

Going forward, we will be exploring how to tie auto-remediation solutions into the ArchRules findings. ArchUnit currently provides very specific and detailed information about failures in reports, which makes a very strong input signal to an auto remediation tool. We will explore deterministic solutions such as OpenRewrite and non-deterministic solutions such as LLMs. Pairing the easy rule authorship and deterministic results of ArchUnit with an auto-remediation tool that can correctly interpret the results to solve the issue at hand will be a very powerful combination.

We also will investigate how to get ArchRule failure information surfaced in the IDE as inspections.

If you have questions or feedback about Nebula ArchRules, reach out to us by posting in the #nebula channel on the Gradle Community Slack.

Scaling ArchUnit with Nebula ArchRules was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph

Netflix Technology Blog — Mon, 04 May 2026 16:01:02 GMT

Saish Sali, Nipun Kumar, Sura Elamurugu

Introduction

As Netflix has grown, machine learning continues to support our ability to deliver value to members and drive excellence across multiple areas of our business. When Netflix began investing in machine learning over a decade ago, it was primarily focused on a single domain: personalization. Scala was the industry standard, our ML teams were relatively small, and optimizing member engagement was our primary use case. Fast forward to today, and machine learning has become the backbone of Netflix’s business transformation. We now apply ML across various business domains, including:

Personalization: Optimizing engagement and helping members discover content they’ll love
Studio: Pre and post-production workflows
Payments: Fraud detection, payment routing, and recurring billing optimization
Ads: Our newest domain, requiring real-time decisioning and targeting

… and a growing number of additional use cases across the company

Each domain operates with a different tech stack, different business metrics, and a distinct organizational structure. While this diversity is a testament to how machine learning has evolved to drive value across many verticals at Netflix, this growth introduces a new challenge: enabling cross-pollination of models and data across domains.

The Challenge: A Fragmented ML Landscape

As our ML investments scaled across these domains, a critical problem emerged: the models produced largely became black boxes. Without any discovery infrastructure, ML practitioners couldn’t easily collaborate or share work across business verticals.

Consider a concrete example: content embeddings. Our Studio teams create sophisticated embeddings that identify scene boundaries, detect visual transitions, and understand content structure. These embeddings were originally built for production workflows.

But those same embeddings could be incredibly valuable elsewhere. Ads could hypothetically use content embeddings for context matching (ensuring advertisements align with the tone and content of what’s currently playing). Personalization could leverage them for episodic merchandising and recommendations (matching the topic or mood of an episode with a user’s preferred viewing preferences). Yet making this cross-pollination happen is extraordinarily difficult.

Why? Our ML tools exist in silos, each with its own backend services and user interface. The model registry is unaware of which A/B tests were using its models, and the pipeline orchestrator is unaware of downstream model dependencies. ML practitioners have to traverse multiple systems to answer basic questions about their work. Finding a model requires opening the model registry, understanding its lineage means switching to the pipeline orchestrator, and tracking which A/B tests use that model requires navigating to the experimentation platform. This fragmentation prevents practitioners from answering critical questions:

Discovery: What features exist? What data sources are available for generating features for a model?
Lineage: Which pipeline is generating data for a specific model? What data sources feed those features?
Impact: Which A/B tests are running this model? Which models will break if I change this feature? Who owns each piece of this chain?

The Hard Problem: Connecting everything

The real challenge wasn’t just building a consolidated UI. We needed to connect the different pieces of infrastructure our ML practitioners were using to perform different parts of the ML lifecycle.

Our ML ecosystem generates metadata from dozens of sources:

Pipeline orchestration systems emit execution details, stage dependencies, and data transformations
Deployed model registry tracks model versions, artifacts, staleness, and deployment history
Experimentation platform manages A/B tests and their configurations
Feature store catalog feature definitions and usage
AI Dataset platform tracks the creation, management, discovery, and loading of datasets.
Identity platform maintains user, team, and organization metadata

Each system employs different formats, identifiers, and mental models. The hard technical problem we had to solve was: How do we collect this heterogeneous metadata, transform it into a unified entity model, and build a connected graph that enables true exploration and collaboration across business domains?

The Solution: Metadata Service and the Model Lifecycle Graph

Our answer was the Metadata Service (MDS), which builds a Model Lifecycle Graph that indexes and connects ML-related entities across Netflix. MDS is optimized for real-time ingestion of ML metadata (e.g., models, features, pipelines, experiments, datasets) and to answer cross-domain questions such as “Which experiments are running this model?” or “Which models share these features?” It is the foundation that enables discovery, ingesting events from diverse sources, enriching them with context, and materializing relationships across entities.

Our vision: to make every ML asset at Netflix discoverable, understandable, and reusable by every ML practitioner, regardless of their team or domain.

Core Abstractions: The Vocabulary of the System

Before diving into the technical implementation, it’s helpful to understand the conceptual model that underpins MDS. This vocabulary enables consistent communication across teams and systems:

Component: Any object that is uniquely addressable using an AI Platform’s (AIP) Uniform Resource Identifier (URI). An AIP URI follows the formataip:////, ensuring global uniqueness. For example:

Models: aip://model/registry/ranking-v5
Users: aip://user/identity/alice
Pipelines: aip://pipeline/orchestrator/weekly-training

Entity: A component within the ML ecosystem, characterized by additional properties such as name, description, creation date, and owners. Entities represent ML-specific assets, such as models, features, and pipelines.

Entity Type: A group of entities that share the same data shape. A data shape is a set of property constraints that specify the attributes and relationships an entity must have.

Domain: A functional grouping of related entity types that defines the abstract interface for a category of ML assets. For example, the Models domain defines what a Model and Model Instance look like, while the Pipelines domain defines Schedules, Requests, and Executions.

Provider: A concrete implementation of a domain, backed by a specific source system. For example, the Models domain is currently backed by our internal model registry. This separation allows MDS to support multiple providers for the same domain. If a new model registry were introduced, it could be added as an additional provider without changing the domain interface.

We can summarize these concepts with a concrete example:

This URI-based addressing scheme is crucial as it allows any service to reference any ML asset with a single string, and MDS can resolve that reference back to rich, connected metadata.

From Events to Entities to Graph

The journey from raw system events to a queryable graph happens in stages. Let’s walk through each with a concrete example: connecting a model to its A/B tests through relationship inference.

1 Event Ingestion

MDS integrates with various source systems via Kafka and AWS SNS/SQS, consuming events in real-time. Source systems emit thin events that include an identifier and an event type.

Example event:

{
  "event_type": "model_instance_created",
  "instance_id": "ranking-model-v5-20XX0101",
  ...
}

This design keeps producers simple. Source systems only need to announce that a change occurred, without building complete payloads or understanding downstream requirements.

Each source system has dedicated event handlers in MDS:

Pipeline Orchestration: Ingests pipeline execution events, including node definitions, schedules, requests, and job attempts
Model Registry: Captures model deployments, configurations, and version updates
Feature Store: Tracks feature definitions and their versions
Experimentation Platform: Monitors A/B test configurations and allocations
Datasets: Tracks ML datasets and their versions
Identity Platform: Maintains ownership and team membership information

2 Entity Enrichment

MDS implements a hydration contract for each event type. When an event arrives, MDS:

Validates the event schema
Calls the source system’s API to fetch the complete, current state
Transforms the response into a normalized entity

This design has a crucial property: the order of events doesn’t matter. MDS always fetches the latest facts from the source of truth. This pattern decouples the event stream from state consistency. If the event bus drops a message or delivers it out of order, the next event corrects the state. The event stream becomes a notification of change rather than a log of changes.

This notification of change pattern has a few important tradeoffs. On the plus side, it keeps producers simple, makes us robust to out-of-order or dropped events, and ensures that MDS can always reconcile to the latest state by reading from the source of truth. The tradeoff is that we place additional read load on source systems during hydration and need to be deliberate about rate limiting, caching, and backoff in our enrichment workers so that we don’t overload them.

For our ranking model example, when the model_instance_created event arrives, MDS calls the Model Registry API: GET /api/v1/instances/ranking-model-v5-20XX0101

The registry responds with a full descriptor. Example response (key fields only):

{
  "id": "ranking-model-v5-20XX0101",
  "pipeline_run_id": "train-weekly-ranking-20XX0101",
  "owner_emails": ["alice@netflix.com"],
  "labels": [{"key": "team", "value": "personalization"}],
  ...
}

3 Data Transformation and Normalization

Raw events are heterogeneous and each source system has its own schema and semantics. MDS workers transform these events into a unified entity model with standardized fields.

Without normalization, downstream consumers would need to understand every source system’s schema. Normalization creates a consistent interface, allowing queries and relationships to work across all entity types. Here is an example.

Normalized MDS entity:

{
  "id": "aip://model/registry/ranking-model-v5-20XX0101",
  "pipeline_run": "aip://pipeline-run/orchestrator/train-weekly-ranking-20XX0101",
  "entity_type": "ModelInstance",
  "owners": ["aip://user/identity/alice"],
  "tags": [{"tag": "team", "value": "personalization"}],
  ...
}

The normalization process standardizes field names and formats. For example, platform-specific IDs become global AIP URIs, owner_emails becomes owners with resolved user URIs, and labels become tags. Foreign keys like pipeline_run_id are transformed into entity references. However, there’s still no reference to which A/B tests are using this model. The Model Registry doesn’t track experiments, and the Experimentation Platform doesn’t track which pipeline produced a given model. This is where knowledge enrichment becomes critical.

4 Storage and Indexing

Once normalized, entities are persisted to Datomic and immediately indexed in Elasticsearch. This happens synchronously within the event processing flow.

Datomic for Caching and Relationships
Normalized entities are first written to Datomic, which serves as both a local cache and a graph database.

Why Datomic? Datomic serves as both the system of record for MDS and the working dataset for enrichment processes. Its immutable fact model means we can continuously add relationships without losing the original entity state.

What we store:

All entity attributes as facts
Entity references (foreign keys that may point to entities not yet fully resolved)
All relationships as reified edges (added by enrichment processes)
Entity lifecycle state (tracking which entities are fully enriched vs awaiting hydration)

This enables:

Complex graph traversals: Navigate from a model to its features to their data sources in a single query
Entity relationships: Join across multiple domains without N+1 query problems
Flexible schema evolution: Easy to add new entity types and attributes as the catalog grows
Progressive enrichment: Background jobs efficiently identify and process entities requiring additional hydration, enabling gradual graph completion without reprocessing fully enriched entities

In practice, we use Datomic for relationship-heavy, navigational queries such as:

Starting from this model instance, show me all upstream datasets and downstream experiments.
Given this feature, list all consuming models and their owning teams.

These queries often span multiple hops in the graph and benefit from Datomic’s immutable fact model and efficient joins across entity relationships.

Elasticsearch for Discovery
Immediately after writing to Datomic, entities are indexed in Elasticsearch to power fast, full-text search across the catalog.

What we index:

Primary fields: Entity name, description, entity type, owner names
Relationship metadata: Names of related entities (e.g., a model’s features, pipelines, A/B tests) stored in the related field
Tags: Domain-specific metadata stored as key-value pairs (e.g., team::personalization, env::production, model.state::released)

Index structure:

Single entities index: All entity types (models, features, pipelines, etc.) are indexed in one unified index, differentiated by the entityType field
Separate owners index: Dedicated index for users and groups to enable cross-entity owner searches
Relevance boosting: Exact name matches score higher than other relevant matches

This enables:

Multi-field text search across entity names, descriptions, tags, and related metadata
Relevance ranking with boosting (exact name matches score significantly higher)
Complex filtering by entity type, ownership, tags, and domain-specific attributes (stored as tags)
Fuzzy matching to handle typos and partial queries

Elasticsearch powers the entry point into the system: users typically start with a free-text search in the AIP Portal (for a model name, a team, or a domain term), and then switch to graph navigation once they land on an entity page. Indexing happens in near real-time as part of the ingestion and enrichment workflows, so changes are usually visible in the Portal with a short delay that is acceptable for interactive use.

5 Knowledge Enrichment and Graph Formation

Once entity metadata is persisted in Datomic, scheduled background processes take over to discover and materialize relationships. These enrichment jobs run periodically, scanning for uncached or partially resolved entities (entities that exist only as references without full metadata).

The enrichment workflow:

Identify candidates: Find entities marked as uncached or with unresolved references
Hydrate relationships: Query source-of-truth systems to fetch related entity details
Materialize edges: Write discovered relationships back to Datomic
Re-index: Trigger Elasticsearch indexing for updated entities
Mark as enriched: Update entity status to prevent redundant processing

This asynchronous approach allows MDS to handle the computational cost of graph formation without blocking real-time event ingestion. It also enables retry logic and gradual enrichment as new entities become available.

Because enrichment is asynchronous, newly discovered relationships may appear with a short delay after the underlying entities are created (typically minutes rather than seconds). We track when each entity was last enriched and surface this timestamp in the AIP Portal, so practitioners can reason about staleness and know when it’s safe to rely on a particular relationship for debugging or impact analysis.

Why enrich? Source systems are purpose-built and don’t know about entities in other domains. Enrichment discovers and materializes cross-system relationships that enable powerful lineage and impact queries.

Example: Connecting Models to A/B Tests

When MDS processes a new model instance, background enrichment jobs discover relationships through multi-hop inference:

Step 1: Direct link to pipeline

The model references a pipeline_run_id. An enrichment job hydrates the pipeline and discovers its A/B test associations: GET /api/v1/pipeline-runs/train-weekly-ranking-20XX0101

Response:

{
"run_id": "train-weekly-ranking-20XX0101", "pipeline":  "weekly-ranking-trainer",
"ab_test_cells": [
   {"test_id": "12345","cell_number": 2,"cell_name": "treatment_ranking_v5"}
 ]
 ...
}

Step 2: Discover A/B test context
The enrichment job discovers the pipeline ran for A/B test cell #2 and queries the Experimentation Platform for test details: GET /api/v1/tests/12345

{
 "test_id": "12345",
 "name": "Ranking Model v5 vs v4",
 "status": "ACTIVE",
 "cells": [{"cell_number": 1, "name": "control_ranking_v4"}],
 ...
}

Step 3: Infer transitive relationships
The enrichment job now has the complete chain:

Model Instance was produced by Pipeline Run
Pipeline Run was executed for A/B Test Cell #2
The A/B Test Cell #2 belongs to A/B Test “Ranking Model v5 vs v4”
Model Instance now gets associated with this A/B Test

The job writes the inferred relationship back to Datomic and triggers re-indexing, and materializes these edges in the graph. MDS doesn’t just store what it’s told; it derives new knowledge by walking the graph in the background.

Why this matters: Without MDS, answering “Which A/B tests are using this model?” requires:

Looking up the model in the Model Registry
Finding which pipeline produced it
Checking the Pipeline Orchestrator for A/B test tags
Querying the Experimentation Platform for test details

With the model lifecycle graph, it’s a single query:

query {
  model(id: "aip://model/registry/ranking-model-v5-20XX0101") {
    name
    owners { name }
    currentInstance {
      version
      pipeline {
        name
        owners { name }
      }
      features {
        edges {
          node {
            name
            data { edges { node { name } } }
          }
        }
      }
      associatedAbTests {
        name
        cells { number name }
      }
    }
  }
}

The reverse query also works: “What models are being tested in experiment 12345?”

Enabling Exploration, Not Just Search

With the Model Lifecycle Graph in place, we shift from entity search to entity exploration. Discovery isn’t just about finding a model; It’s about traversing relationships:

Start with a model, explore its features
From features, navigate to the core data driving them
From the data, trace back to the pipelines generating it
From pipelines, see which teams own and depend on them
From experiments, understand which models are being tested

For example, imagine an engineer investigating a degraded engagement metric for a personalization model. They might:

Start with the model instance powering the affected recommendations in the AIP Portal.
Inspect the model’s features and follow a suspicious feature to its upstream dataset.
From the dataset page, see that its pipeline recently had failed runs and identify the owning team.
Confirm which A/B tests are currently running this model instance to understand which members and surfaces are impacted.

Before MDS and the Model Lifecycle Graph, this required manual checks across multiple tools (model registry, pipeline orchestrator, experiment platform). Now it’s a contiguous journey in a single interface.

This graph-based exploration answers questions that were previously impossible:

Lineage queries: What is the complete lineage of this model, from training data to production experiments?
Impact analysis: Which models will be affected if I change this feature?
Usage discovery: Which A/B tests are using this model?
Dependency mapping: What data sources does my pipeline transitively depend on?
Deprecation planning: Which entities are no longer being used and can be retired?

Every entity has deep context: its creation time, ownership, update history, and most importantly, its relationships to other entities.

The Model Lifecycle Graph is surfaced to practitioners through the AIP Portal, a unified interface that provides full-text search across all entity types, detailed entity pages with navigable relationships, and personalized views for teams and individuals.

A typical interaction in the AIP Portal looks like:

Search: Type a model, feature, dataset, or team name into the single search box backed by Elasticsearch.
Inspect: Land on an entity page that shows key metadata (description, owners, domains, tags) alongside a relationships panel.
Explore: Click through to related entities (upstream datasets, downstream experiments, and sibling model versions) to navigate the Model Lifecycle Graph without leaving the portal.

When new entity types are introduced into MDS, the portal automatically provides baseline search, entity pages, and relationship navigation, and we can then layer on domain-specific visualizations (such as model deployment history or dataset version timelines) over time.

The Road Ahead: Open Challenges

Building the ML lifecycle graph is an ongoing journey. Significant challenges remain, and these represent the future opportunities for us:

Tool Proliferation: As new ML tools emerge, we need robust integration patterns that scale. How do we design plugin architectures that make adding new sources seamless? If we don’t keep up with new tools, practitioners will be forced back into fragmented views, and the Model Lifecycle Graph will lose coverage and trust.
Domain-Specific Visualizations: Different entity types require distinct visualization experiences. Model pages should display deployment history, A/B test associations, and performance metrics. Feature pages should highlight data lineage and consuming models. Pipeline pages must show execution history, dependencies, and schedules. Dataset pages require versioning timelines and downstream consumers. How do we design a flexible UI framework that allows each entity type to have its own tailored experience while maintaining consistent navigation and interaction patterns across the portal? Without rich, domain-specific experiences, the portal risks becoming a generic catalog rather than a tool that ML practitioners rely on in their daily workflows.
Metadata Quality: Today, MDS ensures data consistency through source-of-truth hydration and schema validation at ingestion. Background enrichment jobs continuously infer relationships and materialize entities from source systems. However, challenges remain in ensuring completeness and timeliness at scale. When source systems fail to emit events, when ownership information becomes stale, or when entities lack descriptions and contextual metadata, the graph’s utility degrades. How do we build automated validation and enrichment systems to detect metadata anomalies, suggest missing relationships, and maintain quality benchmarks across millions of entities? Poor or stale metadata erodes practitioner trust: if the graph is incomplete or incorrect, teams will revert to ad hoc knowledge and one-off integrations rather than using MDS as their source of truth.
Advanced Relationship Inference: Beyond explicit relationships declared in source systems, how do we infer implicit connections? Can we detect that two models serve similar purposes based on shared features? Can we recommend features based on usage patterns from similar pipelines? We are in the early stages of exploring these ideas. Done well, they would turn MDS from a passive catalog into an active recommendation engine for ML assets, accelerating reuse and reducing duplicate work across domains.

Acknowledgments

This work represents the collective effort of stunning colleagues across the AI Platform organization: Emma Carney, Megan Ren, Nadeem Ahmad, Pat Oleniuk, Prateek Agarwal, Tigran Hakobyan, Yinglao Liu

Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

State of Routing in Model Serving

Netflix Technology Blog — Fri, 01 May 2026 21:03:13 GMT

By Nipun Kumar, Rajat Shah, Peter Chng

Introduction

This is the first blog post in a multi-part series that shares technical insights into how our ML model serving infrastructure powers several personalized experiences at scale across various domains (e.g., title recommendations, commerce). In this introductory blog post, we will dive into our domain-independent API abstraction and its traffic routing capabilities that the central ML model serving platform exposes to several domain-specific microservices for model inference. This singular API, or entry point, into the ML model serving platform has significantly increased the speed of innovation for iterating on newer versions of existing ML experiences, as well as enabling completely new product experiences with ML.

Machine Learning use cases powering member experiences on Netflix require rapid iteration and evolution in response to new learnings. The success of our ML model serving infrastructure largely depends on enabling researchers to rapidly experiment with new hypotheses and safely, at scale, release their models into production. Equally important is enabling multiple microservices at Netflix to seamlessly get model inference without exposing the complexities of ML model inference. To achieve this in a uniform and scalable manner, we created a centralized ML serving platform. As of 2025, the platform serves hundreds of model types and versions, netting 1 million requests per second. In this post, we’ll zoom in on a core challenge of any large-scale ML serving system: How to route traffic to the right model instance, on the right cluster shard, for the right user and use case, while preserving a simple abstraction for both client services and model researchers.

Background

Models at Netflix

To properly frame our discussion, let’s first clarify the distinction between model serving and model inference. At Netflix, the definition of an ML model has historically been somewhat unique. While model inference typically focuses only on an infer(features) -> score capability, models at Netflix act as self-contained workflows that transform inputs to outputs. A “model” encapsulates pre- and post-processing, feature computation logic, and an optional ML-trained component, all packaged in a standard format suitable for use across multiple contexts. We refer to the end-to-end execution of this workflow as model serving. This distinction matters because our routing and API abstractions operate at the level of workflows, not just individual scoring functions.

A few simplified examples of model serving use cases:

Use case: Personalized Continue Watching row on Netflix Homepage

Input: UserId, Country, Device ID
Output: Ranked List of movies and shows (aka title): [titleId1, titleId2, titleId3,…]

Use case: Payment Fraud Detection

Input: UserId, Country, Payment Transaction details
Output: Probability of the transaction being fraudulent

A typical flow of this serving workflow is depicted below:

To achieve this higher level of abstraction, the model definition contains a list of facts (raw, unprocessed data or observations built as states in different business workflows) that it needs to compute features, and it relies on the model serving platform to supply these facts at serving time by calling several other microservices. Likewise, during offline training, Netflix’s ML fact store provides snapshots for bulk access to facilitate feature computation.

The important takeaway from this model definition is that the calling services only need to provide standard request context (such as userId, country, device), and the relevant domain context (such as titles to rank, or payment transaction for fraud detection), and the model can itself compute features and perform inference as part of the execution flow. This common set of request contexts across domains enables them to share a standard API abstraction and standardizes how various client microservices can uniformly integrate with the serving app. Furthermore, clients are shielded from the model selection and execution, allowing the model architecture and data inputs to evolve with minimal client coordination.

This post focuses on showcasing the technical details to support this design paradigm. We’ll first describe how we implemented this abstraction with Switchboard, a centralized routing service, and then discuss the operational challenges we encountered at scale and how they led us to the Lightbulb architecture.

ML Model Serving Platform Principles

We envisioned a central model serving platform for all of Netflix’s member-facing ML Model serving needs. This ambitious effort required principled thinking to provide the right level of abstraction for both the researchers and client applications. The following ideas, which are relevant to the topic of this blog post, ensured that the platform acts as an enabler of rapid ML innovation and limits the exposure of ML model iterations to the client apps:

Model innovation independent of client apps: There should be only a one-time integration effort by the calling app with the ML serving platform for a new use case. After that, almost all model iterations, including intermediate model A/B experiments, should be mostly opaque to the calling apps. This implies that the platform should handle tasks such as model selection based on a user’s A/B allocation, fetching additional data needed by experimental models, logging for further training or observability, and more. This also benefits the ML researcher, as they only need to coordinate with one platform for model innovation.
Decouple clients from model sharding: Models are distributed across multiple serving compute cluster shards, each with its own Virtual IP (VIP) Address. Various factors, such as traffic patterns, SLAs, model architecture, and CPU/Memory availability, affect model-to-cluster mapping, and changes to this mapping result in changes to the VIP address at which a model is reachable. The serving platform should make clients agnostic to such frequent VIP address changes while ensuring high availability.
Flexible traffic routing rules: Support flexible mechanisms to introduce new traffic routing rules. This includes supporting traffic routing based on A/B experiments, providing a knob to slowly shift traffic to new models and VIP addresses, and allowing client overrides.

Introducing Switchboard

Standard out-of-the-box API Gateway solutions (such as AWS API Gateway, a standalone Service Mesh proxy) did not meet all our requirements. In particular, we needed first-class integration with Netflix’s experimentation platform, the ability to expose gRPC endpoints to clients, and the ability to use rich domain-specific context for routing customizations, which generic proxies were not designed to handle. Furthermore, the platform required customizations to model-specific lifecycle stages (shadow mode, canaries, rollbacks) to enable safe rollouts and migrations.

Hence, we embarked on building a custom service that serves as a flexible proxy layer for all traffic, handling over 1 million requests per second while maintaining high availability and reliability. We named it Switchboard.

Switchboard serves as the central entry point for the system, acting as a mandatory interface for all clients to access the appropriate model based on their context. Its role is to perform context-aware routing and to apply any configured context enrichment to the model inputs.

Here is a visual representation of the request flow from different clients to different serving clusters:

Objective Abstraction

To support this system design, we introduce the concept of an “Objective”. It’s an Enumeration defined by the serving platform that every request into the system must provide. It has three key purposes:

In short, an Objective is the serving platform’s name for a specific business use case (e.g., ContinueWatchingRanking), which decouples clients from concrete models and guides the platform’s routing and model selection decisions.

Key Capabilities of Switchboard

To summarize, these are the key capabilities of Switchboard:

Common Client Abstraction: Switchboard provides a single point of contact for all our clients’ model needs. When clients wish to consume additional models for new ML applications addressing the same business need, there is no new service dependency to introduce or new clients to manage to make requests to the models. From an ML Ops perspective, this also gives us knobs to control client rate limits across model versions and manage central concurrency limits to deal with bad clients.
Context-Aware Routing: Switchboard can route a request based on a rich set of contextual features, such as the user’s current device, locale, ranking surface type (e.g., home page vs. search results), or the current A/B test a user is in.
Dynamic Traffic Splitting: It enables real-time traffic splitting for canary deployments and experimentation. This allows engineers to safely roll out a new model version to a small, controlled percentage of users before a full launch.
Model Versioning and Lifecycle Management: Switchboard inherently manages concurrent request traffic to multiple versions of the same model. This is crucial for:

Shadow Mode Testing: Routing production traffic to a new model version without affecting the user experience, enabling performance comparisons.
Instant Rollback: Immediate switching of traffic away from a problematic new model version back to a stable one.

But is this the whole story? Not quite. Introducing this routing layer adds complexity to our model deployment cycles. In addition, we need a mechanism to collect the context-based routing information from the researchers when they choose to deploy model variants.

The Glue — Switchboard Rules

Given that Objectives serve as the contract between clients and the serving platform, we needed a way for researchers to attach model variants, experiments, and traffic splits to those Objectives without changing client code. This is where Switchboard Rules comes in.

The primary UX for model researchers to define models associated with an objective in a flexible manner is a JavaScript configuration, which we call Switchboard Rules. It’s used to produce a set of rules (typically a JSON file) that primarily dictate the following things to the serving platform:

The default model to use for a given Objective
A/B experiments to configure for a set of Objectives and the corresponding models to load for those experiments
Customizations to gradually shift traffic to a new model

Here is an example of an A/B test rule in the context of the Continue Watching row:

/**
Configuration rule written by a Model Researcher to add an A/B experiment in the Model Serving system.
Cell 1: Uses the default, currently productized model
Cell 2 and Cell 3: Use different experimental (candidate) models
**/

function defineAB12345Rule() {
    const abTestId = 12345;

    const objectives = Objectives.ContinueWatchingRanking;
    const abTestCellToModel = {
        1: {name: "netflix-continue-watching-model-default"},
        2: {name: "netflix-continue-watching-model-cell-2"},
        3: {name: "netflix-continue-watching-model-cell-3"}
    };

    return {
        cellToModel: abTestCellToModel,
        abTestId: abTestId,
        targetObjectives: [objectives],
        modelInputType: constants.TITLE_INPUT_TYPE,
        modelType: 'SCORER'
    };
}

These rules are consumed by both the Switchboard and the Model Serving clusters. Given these rules, the serving platform components can take various actions, some detailed below:

Control Plane Flow:

Assignment: Produce model-to-cluster shard assignment.
Validation: Load all specified models into the Serving Cluster Shard and validate model dependencies to ensure successful execution.
Mapping: Provide the model-to-shard VIP address mapping to Switchboard.

Data Plane Flow:

Allocation: If the request is for Objective=ContinueWatchingRanking, query the Experimentation Platform for the userId’s cell allocation.
Model Selection: Use the allocation and A/B test rule to select the appropriate model.
Request Routing: Route the request to the serving cluster shard with the selected model and context.
Model Execution (on the serving host): Run the model workflow steps and return the response.

A key highlight of this setup is the decoupling of the experimentation config from the serving platform code. This includes having an independent release cycle for the rules, separate from the code deployments. Netflix’s Gutenberg system provides an excellent ecosystem that enables a flexible pub-sub architecture, facilitating proper versioning, dynamic loading, easy rollbacks, and more. Both Switchboard and the Serving Cluster Host subscribe to the same Switchboard Rules configuration.

To prevent race conditions and ensure proper sync of the dynamic Switchboard Rules configuration, the following flow is considered:

Evolving Challenges

Switchboard solved the primary problem of improving model iteration and innovation velocity, and provided an excellent ML serving abstraction to over 30 service clients. However, as the system scale increased, a few challenges and problems with this design became apparent:

Single point of failure: The presence of Switchboard in the critical request path clearly highlights the risks of shutting down access to all serving hosts in extreme cases, such as unintentional bugs or noisy neighbors sending excessive traffic.
Why this matters: Switchboard became a shared dependency whose failure would degrade or disable multiple ML-powered experiences at Netflix.
Added latency due to additional network hop: Switchboard in the request path adds between 10–20ms of latency due to serialization-deserialization operations, depending on payload size. Additionally, it further exposes a request to tail latency amplification.
Why this matters: The added latency is unacceptable for some latency-sensitive clients, resulting in end-user impact due to service timeouts.
Reduced Client flexibility: Switchboard obscures visibility into client request origins from the serving clusters. Consequently, distinguishing data logged for real vs artificial traffic, which is essential for model training, is difficult and requires ongoing customization and increased MLOps overhead.
Why this matters: It makes it harder to do tenant separation and test traffic isolation.

What Next? — Lightbulb

The aforementioned challenges of operating Switchboard at scale forced us to rethink the core implementation while retaining its key features. Our goal was not to throw away Switchboard’s design, but to refactor where and how its responsibilities were executed, keeping the benefits while reducing risk and latency. Particularly:

Common Client Abstraction
Decouple clients from model sharding
Flexible traffic routing rules
Lightweight system client
Single place to define model and experimentation config
Fast experimentation config propagation
Fallback and client-side caching in case of failures

However, we did want to address some of the previous design choices to move forward with:

Remove the routing service from the direct request path: Having a single service in the active request path introduces another failure mode and limits fallback flexibility. While routing rules change infrequently, maintaining consistency comes at the cost of increased availability risks.
Separate model inputs from the request metadata: In certain cases, the request payload could be quite large. Needing to deserialize and then re-serialize the payload as it flowed through Switchboard to make a routing decision was a significant contributor to latency and increased serving costs.
Provide better isolation for the routing layer: Consolidating multiple use cases (tenants) into a single routing cluster poses two main challenges. First, error propagation posed a risk, as a surge of problematic requests from one tenant could cascade errors back to Switchboard, potentially impacting other users. Second, the cluster had to accommodate diverse latency requirements because the requests from different use cases varied significantly in complexity.

This required some changes in our setup flow: While it largely remained unchanged, however, we created separate components for Routing and Model Selection (Lightbulb):

We now take the rules for an Objective and break them into distinct sets of configuration:

Model Serving Configuration: This allows us to determine which model should be used at request time, along with the required metadata
Routing Rules: Given a model we want to serve at request time, this tells us which VIP the request should be routed to.

The Data Plane changes also reflect this separation, as we now rely on Envoy to take care of the routing details:

Envoy is already used for all egress communication between apps at Netflix, and it can route requests to different clusters (VIPs) based on the configurable Routing Rules published from our control plane. However, it lacks the information needed to make routing decisions and the ability to enrich the request body with additional serving parameters required for A/B testing model variants. We introduced Lightbulb to cover this gap:

Lightbulb consumes the minimal request context, which contains use-case information, and provides the metadata mapping required for routing at the Envoy layer.
Lightbulb resolves the request context to determine a routingKey configuration along with the ObjectiveConfig — this is where we place the model id along with other request-specific configurations required for model execution. This is done to separate the config resolution associated with the request from the placement and routing information needed to reach it on the inference cluster.
While the routingKey is added to the headers for Envoy proxy to consume, the client adds the ObjectiveConfig parameters to the request itself. This is done to avoid bloating the request headers while passing additional parameters for the model to process the request appropriately.
The routing of the actual request is performed by the Envoy proxy, which has the metadata to map the routingKey to the actual cluster VIP running the model. Because the routingKey is in a header, this determination can be made with minimal overhead.

These changes retain the advantages of Switchboard, such as a single integration point, abstraction of model id from use case, context-aware routing, while addressing the challenges we observed over time.

Conclusion

The evolution from Switchboard to Lightbulb marks a significant architectural refinement in our ML model serving infrastructure. While Switchboard provided the initial abstraction layer critical for rapid innovation, its latency and single-point-of-failure risk posed scaling hurdles. The subsequent adoption of Lightbulb, a decoupled service focused solely on routing metadata, and its integration with Envoy successfully resolved these challenges. This sophisticated new architecture preserves the key benefits — seamless client integration and flexible experimentation — while ensuring reliable, efficient, and scalable delivery of personalized member experiences, positioning us well for future ML growth.

In future posts in this series, we’ll dive deeper into other aspects of our ML serving platform, including inference and feature fetching, and how they interact with the routing architecture described here.

Special thanks to Sura Elamurugu, Sri Krishna Vempati, Ed Maddox, and Sreepathi Prasanna for their invaluable feedback and partnership in iterating on this idea and bringing this blog post to life.

State of Routing in Model Serving was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Scaling Camera File Processing at Netflix

Netflix Technology Blog — Fri, 24 Apr 2026 15:06:01 GMT

Orchestrating Media Workflows Through Strategic Collaboration

Authors: Eric Reinecke, Bhanu Srikanth

Introduction to Content Hub’s Media Production Suite

At Netflix, we want to provide filmmakers with the tools they need to produce content at a global scale, with quick turnaround and choice from an extraordinary variety of cameras, formats, workflows, and collaborators. Every series or film arrives with its own creative ambitions and technical requirements. To reduce friction and keep productions moving smoothly, we built Netflix’s Media Production Suite (MPS) with the goal of automating repeatable tasks, standardizing key workflows, and giving productions more time to focus on creative collaboration and craftsmanship.

A critical part of this effort is how we handle image processing and camera metadata across the hundreds of hours and terabytes of camera footage that Netflix productions ingest on a daily basis. Rather than build every component from scratch, we chose to partner where it made sense–especially in areas where the industry already had trusted, battle-tested solutions.

This article explores how Netflix’s Media Production Suite integrates with FilmLight’s API (FLAPI) as the core studio media processing engine in Netflix’s cloud compute infrastructure, and how that collaboration helps us deliver smarter, more reliable workflows at scale.

Why We Built MPS

As Netflix’s production slate grew, so did the complexity of file-based workflows. We saw recurring challenges across productions:

File wrangling sapping time from creative decision-making
Inconsistent media handling across shows, regions, or vendors
Difficult to audit manual processes that are prone to human error
Duplication of effort as teams reinvented similar workflows for each production

Content Hub Media Production Suite was created to address these pain points. MPS is designed to:

Bring efficiency, consistency, and quality control to global productions
Streamline media management and movement from production through post-production
Reduce time spent on non-creative file management
Minimize human error while maximizing creative time

To achieve this, MPS needed a robust, flexible, and trusted way to handle camera-original media and metadata at scale.

The Right Tool for the Job

From the start, we knew that building a world-class image processing engine in-house is a significant, long-term commitment: one that would require deep, continuous collaboration with camera manufacturers and the wider industry.

When designing the system, we set out some core requirements:

Inspect, trim, and transcode original camera files and metadata for any Netflix production with trusted color science
Support a wide variety of cameras and recording formats used worldwide while staying current as new ones are released
Run well in our paved-path encoding infrastructure, enabling us to take advantage of proven compute and storage scalability with robust observability

FilmLight develops Baselight and Daylight, which are commonly used in the industry for color grading, dailies, and transcoding. Their FilmLight API (FLAPI) allows us to use that same media processing engine as a backend API.

Rather than duplicating that work, we chose to integrate. FilmLight became a trusted technology partner, and FLAPI is now a foundational part of how MPS processes media.

The Media Processing Engine

MPS is not a single application; it’s an ecosystem of tools and services that support Netflix productions globally. Within that ecosystem, the FilmLight API plays the following key roles.

Parsing camera metadata on ingest

Productions upload media to Netflix’s Content Hub with ASC MHL (Media Hash List) files to ensure completeness and integrity of initial ingest, but soon after, it’s important to understand the technical characteristics of each piece of media. We call this workflow phase “inspection.”

Footage ingested with MPS is inspected using FLAPI and all metadata is indexed and stored

At this stage, we:

Use FLAPI to gather camera metadata from the original camera files
Conform the workflow critical fields to Netflix’s normalized schema
Make it searchable and reusable for downstream processes

This metadata is integral to:

Matching footage based on timing and reel name for automated retrieval
Debugging (e.g., why a shot looks a certain way after processing)
Validations and checks across the pipeline

FLAPI provides consistent, camera-aware insight into footage that may have originated anywhere in the world. Additionally, since we’re able to package FLAPI in a Docker image, we can deploy almost identical code to both cloud and our production compute and storage centers around the world, ensuring a consistent assessment of footage wherever it may exist.

2. Generating VFX plates and other deliverables

Visual effects workflows constantly push image processing pipelines to their absolute limits. For MPS to succeed, it must generate images with accurate framing, consistent color management, and correct debayering/decoding parameters — all while maintaining rapid turnaround times.

To achieve this, we leverage Netflix’s Cosmos compute and storage platform and use open standards to provide predictable and consistent creative control.

At this phase, we use the FilmLight API to:

Debayer original camera files with the correct format-specific decoding parameters
Crop and de-squeeze images using Framing Decision Lists (ASC FDL) to ensure spatial creative decisions are preserved
Apply ACES Metadata Files (AMF), providing repeatable color pipelines from dailies through finishing
Generate an array of media deliverables in varied formats

These processes are automated, repeatable, and auditable. We deliver AMFs alongside the OpenEXRs to ensure recipients know exactly what color transforms are already applied, and which need to be applied to match dailies.

Because we use FilmLight’s tools on the backend, our workflow specialists can use Baselight on their workstations to manually validate pipeline decisions for productions before the first day of principal photography.

The Media Processing Factory in the Cloud

Finding an engine that competently processes media in line with open standards is an important part of the equation. To maximize impact, we want to make these tools available to all of the filmmakers we work with. Luckily, we’re no strangers to scaled processing at Netflix, and our Cosmos compute platform was ready for the job!

Cloud-first integration

The traditional model for this kind of processing in filmmaking has been to invest in beefy computers with large GPUs and high-performance storage arrays to rip through debayering and encoding at breakneck speed. However, constraints in the cloud environment are different.

Factors that are essential for tools in our runtime environment include that they:

Are packageable as Serverless Functions in Linux Docker images that can be quickly invoked to run a single unit of work and shut down on completion
Can run on CPU-only instances to allow us to take advantage of a wide array of available compute
Support headless invocation via Java, Python, or CLI
Operate statelessly, so when things do go wrong, we can simply terminate and re-launch the worker

Operating within these constraints lets us focus on increasing throughput via parallel encoding rather than focusing on single-instance processing power. We can then target the sweet spot of the cost/performance efficiency curve while still hitting our target turnaround times.

When tools are API-driven, easily packaged in Linux containers, and don’t require a lot of external state management, Netflix can quickly integrate and deploy them with operational reliability. FilmLight API fit the bill for us. At Netflix, we leverage:

Java and Python as the primary integration languages
Ubuntu-based Docker images with Java and Python code to expose functionality to our workflows
CPU instances in the cloud and local compute centers for running inspection, rendering, and trimming jobs

While FLAPI also supports GPU rendering, CPU instances give us access to a much wider segment of Netflix’s vast encoding compute pool and free up GPU instances for other workloads.

To use FilmLight API, we bundle it in a package that can be easily installed via a Dockerfile. Then, we built Cosmos Stratum Functions that accept an input clip, output location, and varying parameters such as frame ranges and AMF or FDL files when debayering footage. These functions can be quickly invoked to process a single clip or sub-segment of a clip and shut down again to free up resources.

Elastic scaling for production workloads

Production workloads are inherently spiky:

A quiet day on set may mean minimal new footage to inspect.
A full VFX turnover or pulling trimmed OCF for finishing might require thousands of parallel renders in a short time window.

By deploying FLAPI in the cloud as functions, MPS can:

Allocate compute on demand and release it when our work queue dies down
Avoid tying capacity to a fixed pool of local hardware
Smooth demand across many types of encoding workload in a shared resource pool

This elasticity lets us swarm pull requests to get them through quickly, then immediately yield resources back to lower priority workloads. Even in peak production periods, we avoid the pain of manually managing render queues and prioritization by avoiding fixed resource allocation. All this means lightning-fast turnaround times and less anxiety around deadlines for our filmmakers.

Designed for Seasoned Pros and Emerging Filmmakers

Netflix productions range from highly experienced teams with very specific workflows to newer teams who may be less familiar with potential pitfalls in complex file-based pipelines.

MPS is designed to support both:

Industry veterans who need to configure precise, bespoke workflows and trust that underlying image processing will respect those decisions.
Productions without a color scientist on staff — those who benefit from guardrails and sane defaults that help them avoid common workflow issues (e.g., mismatched color transforms, inconsistent debayering, or incomplete metadata handling).

The partnership with FilmLight lets Netflix focus on workflow design, orchestration, and production support, while FilmLight focuses on providing competent handling of a wide variety of camera formats with world-class image science!

Collaboration and Co-Evolution

Netflix aimed to integrate MPS into a wider tool ecosystem by developing a comprehensive solution based on emerging open standards, rather than making MPS a self-contained system. Integrating FLAPI into our system requires more than an API reference–it requires ongoing partnership. FilmLight worked closely with Netflix teams to:

Align on feature roadmaps, particularly around new camera formats and open standards
Validate the accuracy and performance of key operations
Debug edge cases discovered in large-scale, real-world workloads
Evolve the API in ways that serve both Netflix and the wider industry
Create a positive feedback cycle with open standards like ACES and ASC FDL to solve for gaps when the rubber hits the road

One example of this has been with the implementation of ACES 2. FilmLight’s developers quickly provided a roadmap for support. As our engineering teams collaborated on integration, we also provided feedback to the ACES technical leadership to quickly address integration challenges and test drive updates in our pipeline.

This collaborative relationship–built on open communication, joint validation, and feedback to the greater industry–is how we routinely work with FilmLight to ensure we’re not just building something that works for our shows, but also driving a healthy tooling and standards ecosystem.

Impact

While much of this work takes place behind the scenes, its impact is felt directly by our productions. Our goal with building MPS is for producers, post supervisors, and vendors to experience:

Fewer delays caused by missing, incomplete, or incorrect media
Faster turnaround on VFX plates and other technical deliverables
More predictable, consistent handoffs between editorial, color, and VFX
Less time spent troubleshooting technical issues, and more time focused on creative review

In practice, this often shows up as the absence of crisis: the time a VFX vendor doesn’t have to request a re-delivery, or the time editorial doesn’t have to wait for corrected plates, or the time the color facility doesn’t have to reinvent a tone-mapping path because the AMF and ACES pipeline are already in place.

Looking Ahead

As camera technology, codecs, open standards, and production workflows continue to evolve, so will MPS. The guiding principles remain:

Automate what’s repeatable
Centralize what benefits from standardization
Partner where deep domain expertise already exists

The integration with FilmLight API is one example of this philosophy in action. By treating image processing as a specialized discipline and collaborating with a trusted industry partner, Netflix is delivering smarter, more reliable workflows to productions worldwide.

At its core, this partnership supports a simple goal: reduce manual workflow and tool management, giving filmmakers more time to tell stories.

Acknowledgements

This project is the result of collaboration and iteration over many years. In addition to the authors, the following people have contributed to this work:

Matthew Donato
Prabh Nallani
Andy Schuler
Jesse Korosi

Scaling Camera File Processing at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale

Netflix Technology Blog — Fri, 17 Apr 2026 15:01:02 GMT

By: Brett Axler, Casper Choffat, and Alo Lowry

In the three years since our first Live show, Chris Rock: Selective Outrage, we have witnessed an incredible expansion of our live content slate and the live operations that support it. From modest beginnings of streaming just one show per month, we are now capable of streaming over nine shows in a single day, reaching tens of millions of concurrent members. This post pulls back the curtain on the Live Operations teams that enable this rapid scale.

Humble Beginnings

In March 2023, the engineers who built Netflix’s first live streaming pipeline also operated it. There was no dedicated operations team or formal command center. All of our incident response playbooks were written for SVOD, and SLAs were not designed for the speed of live. For the first live shows on the platform, the engineers who designed what is described in earlier parts of this series monitored dashboards on laptops, coordinated over Slack, and troubleshot in real time while millions of members watched.

The physical setup matched the operational workflows: improvised. Temporary control rooms were put together in conference rooms. For larger events, Netflix rented third-party broadcast facilities, hardware control panels, multiviewers, and communication panels — the kind of infrastructure that established broadcast networks had built over decades. Every show was a team effort. Engineers and leadership at all levels were involved in every event. Each live show, regardless of size, was a massive effort to launch.

Netflix’s Early Live Operations

Last month, in March 2026, Netflix streamed the World Baseball Classic live to members in Japan. 47 matches over two weeks, with peak concurrent viewership exceeding 17.9 million for a single game, operations running 24/7 from permanent facilities in Los Gatos and Los Angeles, with international coverage extending to Tokyo. In March alone, Netflix launched approximately 70 live events. That is three events shy of the total number Netflix streamed live in all of 2024. The technical systems that make this possible have been covered in detail across this series. What hasn’t been told is the operational story: the people, procedures, and facilities Netflix built to run those systems in real time, under pressure, with no ability to pause or roll back.

The Architecture of Live Operations

The Architecture of Live Operations: Evolving the Broadcast Operations Center

When a technology company transitions into live broadcasting, it faces a unique challenge: blending traditional broadcast television practices with massive-scale live-streaming engineering. At the heart of this intersection is the Broadcast Operations Center (BOC).

The Transmission Operations Center in Los Angeles

The BOC serves as the critical “cockpit” for live events. It is the physical command center where a fully produced video feed is received directly from a stadium or venue and then handed off to the live streaming infrastructure. Everything from signal ingest, inspection, and conditioning to closed-captioning, graphics insertion, and ad management happens within these walls. By utilizing a hub-and-spoke model with highly redundant architectures, such as dual internet circuits and SMPTE 2022–7 seamless switching technologies, the BOC replaces direct, vulnerable paths from the venue to the live streaming pipeline, making each live event highly repeatable and far less dependent on the quirks of individual event locations.

Securing the Signal: Reliability from the Venue Before the BOC can work its magic, we have to guarantee the video and audio feeds actually survive the journey from the production site to our facility. To ensure absolute reliability from the venue, Netflix enforces strict specifications for live signal contribution.

For any show-critical feed, meaning the primary feed our members will watch live, we require three completely discrete transmission paths. We utilize a strict hierarchy of approved transmission methods, prioritizing dedicated video fiber and single-feed satellite links, followed by dedicated enterprise-grade internet and robust SRT contribution systems.

We don’t just rely on redundant transport lines; we require full hardware redundancy out of the production truck itself. This includes using separate router line cards and discrete transmission hardware to prevent any single point of failure. Furthermore, every single piece of transmission hardware at the venue must be powered by two discrete power sources, protected by uninterruptible power supply (UPS) batteries, and surge-conditioned.

Finally, before we ever go live to millions of viewers, our operators execute exhaustive “FACS/FAX” (facilities checks) testing during rehearsals and before every show. This involves running specialized Audio/Video sync tests, latency tests, and quality tests to guarantee perfect audio and video synchronization, validating closed captions, and touring the backup switcher inputs.

Building the Human Infrastructure: Building the human operational model to run a facility like the BOC didn’t happen overnight. For a platform scaling from its very first live comedy special to streaming over 400 global events a year, the operational strategy had to undergo a massive, multi-year evolution.

Phase 1: The “All-Hands” Engineering Era. In the earliest days of live streaming, there was no dedicated operations team or formal broadcast operations center. The software engineers who wrote the code and built the live-streaming infrastructure were the same people manually operating the events on launch night. Every show was an “all-hands-on-deck” scenario. While this raw, startup-style approach worked for initial milestones, having core developers manually set up and tear down software configurations for every single broadcast was fundamentally incapable of scaling.

Phase 2: The Shift to Specialized Engineering (SOEs and BOEs). To separate event execution from core software development, the operational model matured to introduce specialized engineering teams. First, the Streaming Operations Engineering (SOE) team was established. These are highly skilled streaming engineers whose sole focus is to configure the full event on the live pipeline and support it during the broadcast. By having SOEs act as the first line of escalation, the core software developers were freed up to focus on building new live-streaming pipeline features.

However, as the physical broadcast facilities grew, it became clear that supporting the streaming pipeline wasn’t enough; the physical broadcast hardware and facility workflows needed dedicated oversight too. To solve this, Broadcast Operations Engineers (BOEs) were introduced to work alongside the SOEs. The BOE acts as the primary escalation point for all physical broadcast facility and hardware issues, overseeing the operation of all shows during a given shift.

Phase 3: The “Co-Pilot” Control Room Model. With specialized engineers in place to handle the deep technical infrastructure, the day-to-day operation of the actual video and audio feeds was handed over to dedicated operators. Initially, the Broadcast Control Rooms were structured much like an airplane cockpit.

This approach utilized a “first and second captain” workflow, pairing two Broadcast Control Operators (BCOs) together to run a single event, functioning exactly like a pilot and co-pilot. This collaborative model allowed for intense focus and high-quality execution, making it the ideal setup for running just one or two live events per day. However, as the ambition grew to stream up to 10 concurrent events a day for massive global tournaments, a 1:1 scale of pairing operators simply required too much space and manpower. A new model had to be adopted.

Phase 4: The Transmission Operations Center (TOC) Fleet Model. To manage high-density event days and continuous tournament coverage, the workflow was completely reimagined with the launch of the Transmission Operations Center (TOC) model. Rather than treating every live broadcast as an isolated launch in its own room, the TOC treats live events like a fleet. It centralizes operations and distinctly separates the traditional broadcast functions from the streaming functions to maximize human efficiency.

The TOC model divides the labor across three highly specialized, tiered roles:

Transmission Control Operator (TCO): The TCO is responsible for managing all inbound signals arriving from the event venues, such as fiber optic, SRT, and satellite feeds. They ensure these incoming feeds meet strict quality, latency, and operational thresholds. Thanks to centralized dashboarding, a single TCO can manage up to five events concurrently.
Streaming Control Operator (SCO): While the TCO handles what comes in, the SCO manages what goes out. They oversee all outbound feeds, including the streams heading to the live streaming pipeline and any syndication feeds sent to third parties for commercial distribution. Like the TCOs, SCOs can manage up to five events concurrently.
Broadcast Control Operator (BCO): With the inbound and outbound transmission mechanics handled by the broader TOC, the BCO is able to focus entirely on the creative and qualitative execution of the event. Operating on a strict 1:1 ratio (one operator per event), the BCO seamlessly switches between backup inbound feeds if an issue arises, ensures audio and video remain in perfect synchronization, and performs rigorous quality control. They also monitor critical metadata, such as closed captions and digital ad-insertion messages (SCTE), right before the final polished feed is handed into the live streaming pipeline.

The Big Bet Exception. While the fleet-style TOC model enables immense concurrency for daily programming, the most critical, high-visibility events, like major holiday football games, utilize a specialized Big Bet Model. For these flagship broadcasts, an entire Broadcast Operations Center is dedicated exclusively to a single event. This hyper-focused environment strips away the multi-event ratios, providing operators with advanced instrumentation and dedicated facility engineers to ensure the absolute highest level of reliability for events where failure is simply not an option.

Operational Workflow at a Glance (Courtesy of Melissa “Mouse” Merencillo)

The Live Command Center (LCC)

The Live Command Center (LCC) is not an MCR (Master Control Room). Nor is it a traditional Network Operations Center (NOC). The LCC holds the end-to-end view of quality, health metrics, and reliability for every live stream — from signal ingest at the production venue through cloud encoding, CDN delivery, and playback on member devices — and coordinates the human response when any part of that chain breaks.

What makes this hard is the data and speed requirements. Standard monitoring tools incur propagation delays of minutes. However, during a live stream, a signal degradation that goes undetected for three minutes can affect millions of members before any mitigation begins. The LCC runs a purpose-built observability stack, the Live Control Center, that aggregates telemetry from across the entire pipeline in near real time: concurrent viewer counts, start failure rates, rebuffer ratios, CDN health, encoder status, and signal path health from the contribution feed forward.

Live Control Center (Courtesy of Chris Carey)

During live events, the system ingests up to 38 million events per second. The LCC’s job is to make that volume of data meaningful and actionable for the small team of operators watching it live.

Two roles staff the LCC leading up to and during live events. LCC Operations Leads are the shift supervisors and incident commanders. They triage anomalies, make escalation decisions, and own the incident response process from detection through resolution.

Live Technical Launch Managers (TLMs) function as air traffic controllers: they maintain cross-functional context across more than 45 technical, product, and services teams from encoding, CDN, and playback to social media, customer service, and security teams. TLMs start coordinating with these teams months and sometimes years ahead of a live event to ensure escalation paths and playbooks are in place when the LCC needs to translate a CDN engineer’s concern into a product decision at 2am while a game is still in progress. Together, these roles form the operational leadership layer that keeps engineers focused on building rather than watching dashboards.

The live operations teams rank shows by three categories:

Low-Profile Events: These are lightweight, often lack new features, and anticipate low viewership. They are typically managed with a small team of 1–2 operators and automated alerting.
High-Profile Events: These are mid-tier events that warrant more attention due to their size, unique features, or anticipated viewership.
Big Bet Events: These represent the highest operational weight, such as an NFL game, with massive viewership expectations and special features. They require the full support of the LCC: a fully staffed physical operations room for the entire duration, active incident command structures, and key engineering teams on standby to support their specific product areas.

In addition to a show’s event category, the TLMs deployed a Live Operational Level (LOL) model that helps engineers determine whether they need to be on standby, live online, or even in the LCC for any given show.

Based on the show’s event category, special features, expected viewership, and overall risk, non-operational teams are put into one of four categories:

Red: Non-operational teams must remain online for the duration of the event. This is most often seen in large boxing matches and sporting events, such as the NFL Christmas Day games.

Orange: Non-operational teams are required to check in online ~30 minutes prior to show and are asked to monitor the health of their systems through the first commercial breaks until the LCC releases them to LOL Yellow.

Yellow: Non-operational teams are not required to be online, but should be reachable by page in 2 minutes. Special PagerDuty rotations and verifications are in place to ensure these teams are reachable.

Grey: Business as usual. Teams will be reached out to by their normal pager rotation if their help is needed during the show.

Visual Representation of LOL Levels (Courtesy of Gemini Nano Banana Pro)

By tiering events, Netflix ensures that resource allocation is proportionate to operational needs, preventing a continuous “crisis” mentality and allowing our non-operational partners to focus on their day jobs.

As of April 2026, most engineering teams are Yellow or Grey, with Ops and Site Reliability Engineers making up most of the teams online to support shows, in addition to engineers performing feature tests.

Building the Model

The first lesson from 2023 was straightforward: what worked for one show a month would not work for ten shows a week. The engineers who built the pipeline were also the ones operating it, which meant the people best positioned to fix problems were also the ones most likely to be paged at 2am. There was no operational layer to absorb that load.

In 2024, Netflix streamed 72 live events and began building the team that would eventually run them. The first version of the LCC looked nothing like it does today: a cluster of desks, monitors on stands, and laptops running dashboards, set up in the middle of the office. The TLM team was stood up to own cross-functional coordination for live launches and began formalizing the runbooks, event tiering structure, and incident management protocols that would later enable Netflix to scale operations to support hundreds of shows per year.

By the time Jake Paul vs. Mike Tyson and the first NFL Christmas Games arrived, the LCC had moved into a dedicated conference room, and partnerships with device and labs teams were producing more effective monitoring tools. But the biggest operational lesson of that period came from communications.

For Tyson/Paul, Netflix had over 300 people online across engineering, product, and business functions. Some people were online because their support was needed, while many others were just excited to be part of it. Coordinating that many people over Slack and Zoom during an active event with 64 million concurrent streams was unmanageable.

That experience drove the implementation of a squad model: defined teams with clear roles, scoped communication channels, and a single escalation path into the LCC. Around the same time, the LCC began integrating with IP-based communications systems, finally bridging the gap between the command center and the Broadcast Operations Center that had been operating largely in a fractured parallel until then.

Visual Representation of Squad Operations Model (Courtesy of Gemini Nano Banana Pro)

2025 brought 220 live events and a permanent LCC facility, along with a dedicated operations team, the Live Command Center Operations Leads. With the growing number of shows, TLMs were getting spread thin, spending more than half their week operating shows late into the evening and over weekends, then getting called back into the office at 9 am to lead critical launch meetings. The addition of the LCC Ops Leads resolved the bandwidth issue by separating planning and operations into distinct roles within a single centralized team.

As the slate continued to grow and large series like the World Baseball Classic and FIFA Women’s World Cup were announced, the vendor-operator model was introduced, creating an elastic workforce that could scale up for large series events without carrying full-time headcount year-round to support peak capacity. The key enabler was documentation: standardized runbooks and onboarding materials detailed enough that a trained operator could reach full effectiveness within their first week. WWE RAW became a weekly operation, normalizing what had previously felt exceptional. By early 2026, multi-event days were no longer a test of capacity but had become the expected operating condition.

The next chapter is international. Netflix has begun standing up regional Live Operations Center coverage to support live events outside North America, with EMEA operations soon running out of London. The model draws on the same runbooks, tooling, and escalation structures developed in Los Gatos, with follow-the-sun shift handoffs connecting EMEA and US teams across time zones. Looking further ahead, Netflix is planning to bring the LCC and BOC under one roof — a single integrated facility that combines broadcast operations and cloud monitoring into a unified space. The physical separation between those two functions has always introduced friction at the seams. Closing it is the logical next step.

Operational Principles for Live at Scale

Building a live operations discipline means accepting one constraint above all others: you cannot optimize for efficiency before you have built for reliability.

Netflix designed for quality first: Standardized runbooks, tiered event structures, pre-documented failure modes, so the 50th show runs as smoothly as the fifth. Off-the-shelf monitoring tools with propagation delays don’t meet that bar. The Netflix Live Control Center and Live Control Room platforms exist because observability at live scale is a product decision that demands the same design rigor as the pipeline it monitors, turning millions of telemetry events per second into something a small team can act on in real time. Technical systems and human systems have to scale together, and the most reliable incident response plan is always the one written before anyone needs it.

The operational model is also a cultural one. Bringing contingent operators into a proprietary tech stack requires deliberate onboarding design. The vendor model only works when documentation is built to be followed confidently by someone new within their first week. Beyond process, the most durable parts of how Netflix runs live operations reflect something the Netflix culture memo makes explicit: the best ideas come from anywhere. In practice, that means frontline operators catching issues that engineers miss, vendor staff surfacing workflow friction that improves the system for everyone who follows, and a team that treats candid feedback as standard practice rather than an exception. The technology, the slate, and the scale keep changing. The discipline stays current by staying curious and iterating on the tools, the runbooks, and the team.

Conclusion: What’s Next

With 2026 already off to a successful start in operational scaling, we’re excited to shift our focus to the upcoming launch of our new Live Broadcast Operations Center in Los Angeles and our new Live Operations Center (LOC) in West London. The LOC will initiate Netflix’s follow-the-sun coverage as live content continues to grow with over 400 live events in 2026, including the launch of 24/7 linear free-to-air broadcast channels with TF1 this summer. On the technical front, further development of automated alerting tools and monitoring by exception will continue to reduce operations’ manual workload.

In 2023, the engineers led the operations. By 2026, they had developed systems that mostly ran themselves, with a dedicated operational team ensuring they operated smoothly for millions of members. The technology behind Netflix’s Live content has been documented throughout this series, but what runs alongside the tech stack is a set of operational principles, rehearsed incident management processes, and monitoring infrastructure that had to be created from scratch and continues to develop.

A special thanks to Te-Yuan Huang, Rob Saltiel, Tara Kozuback, Chris Carey, Di Li, Patrick Li, Anne Aaron, and Melissa “Mouse” Merencillo for their support on this article.

The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Evaluating Netflix Show Synopses with LLM-as-a-Judge

Netflix Technology Blog — Fri, 10 Apr 2026 16:26:01 GMT

by Gabriela Alessio, Cameron Taylor, and Cameron R. Wolfe

Introduction

When members log into Netflix, one of the hardest choices is what to watch. The challenge isn’t a lack of options — there are thousands of titles — but finding the most intriguing one is complex and deeply personal. To help, we surface personalized promotional assets, especially the show synopsis — a brief description highlighting key plot elements, with cues like genre or talent.

Strong synopses help members scan, understand, and choose. Poor synopses frustrate, mislead, and drive abandonment. Ensuring high-quality synopses is essential, but scaling quality validation is hard. We host hundreds of thousands of synopses, usually with multiple variants per show. We need to ensure quality at scale so every member gets a consistently great experience every time they read a synopsis. This approach helps us scale high‑quality synopsis coverage for our rapidly expanding catalog, enabling greater speed and coverage without sacrificing quality.

This report outlines our LLM-based approach for evaluating synopsis quality. Using recent advances in agents, reasoning, and LLM-as-a-Judge, we score four key synopsis quality dimensions, achieving 85%+ agreement with creative writers. Additionally, we show that higher LLM judge quality is correlated with key streaming metrics, allowing us to proactively identify and fix impactful issues weeks or months before a show debuts on Netflix.

The Making of a “Good” Synopsis

Writing high-quality synopses requires creative expertise. Our expert creative leads are best positioned to craft the creative approaches and define quality standards. However, AI can help us consistently evaluate these expert-driven quality criteria at scale. Synopsis quality at Netflix, which our system aims to predict, is viewed along two dimensions:

Creative Quality: members of our creative writing team assess synopsis quality according to our internal writing guidelines and rubrics.
Member Implicit Feedback: we measure the relative impact of a particular show synopsis on core streaming metrics.

These two definitions of quality capture distinct and important aspects of quality, one focused upon creative excellence and the other upon utility to members.

Creative Quality

For this project, we evaluate synopses against a subset of our creative writing quality rubric — the same criteria to which human writers would adhere. These quality rubrics change over time as quality standards evolve. Given Netflix’s distinctive voice and elevated editorial standards, the quality bar is high. Each criterion has extensive guidelines with examples across regions, genres, and synopsis types.

Human evaluation. We began by partnering with a group of creative writing experts to iteratively refine our definition of creative quality. We initially labeled ~1,000 diverse synopses, where three expert writers scored each against the criteria and explained their ratings. Due to the subjectivity of the task, early instance-level agreement was low. To reach a better consensus, we conducted calibration rounds (~50 synopses per round), surfaced disagreements, and evolved our quality scoring guidelines. Key interventions that were found to improve agreement include:

Using binary scores (instead of 1–4 Likert scores).
Allowing writers to reference past examples.
Maintaining a searchable taxonomy of common errors.

Golden evaluation data. After eight calibration rounds, writer agreement reached ~80%. To further stabilize labels, we used a model-in-the-loop consensus where:

Multiple writers score each synopsis.
An LLM, guided by the rubric, aggregates to a final label.
Writers review cases with substantial disagreement.

The result is a golden set of ~600 synopses with binary, criteria-level scores and explanations — our North Star for aligning an LLM judge with expert opinion.

Member Implicit Feedback

Netflix gauges implicit member feedback on a synopsis with two metrics:

Take Fraction: how often members who see a title’s synopsis choose to start watching it.
Abandonment Rate: how often members start a title but stop watching soon after.

Higher take fraction indicates more choosing, while lower abandonment suggests authentic, non-misleading presentation. Both of these metrics have been validated via A/B testing to serve as short-term behavioral proxies for long-term member retention. As part of evaluating our system, we also study the ability of LLM-derived quality scores to predict short-term engagement metrics. This step confirms that our scores capture behaviorally meaningful signals and assesses our ability to forecast member response to a given synopsis.

Scaling Quality Scoring with LLM-as-a-Judge

We begin our experiments by creating simple, per-criteria prompts that:

Supply criterion-specific show metadata.
Summarize the relevant quality guidelines.
Use zero-shot chain-of-thought prompting to elicit an explanation.
Request a binary decision for the synopsis.

Using a single prompt to evaluate all quality criteria is found to overload the LLM and yields poor performance — dedicated judges for each criteria perform better. Because criteria are unique, each task has its own setup, but there are some shared components:

We use the same LLM for all criteria.
The judge always outputs an explanation before its final score.
Final scores are binary.

Due to our use of binary scoring, judges can be evaluated with simple accuracy metrics over the golden dataset. Next, we summarize the experiments that led to our final system.

Prompt optimization. Because LLMs are sensitive to prompt phrasing, we apply Automatic Prompt Optimization (APO) over a ~300-sample dev set. Scoring guidelines are provided as additional context to the prompt optimizer. After APO, we manually refine candidate prompts with the help of an LLM, yielding initial prompts with accuracies shown below. These prompts work well for some criteria (e.g., precision) but poorly for others (e.g., clarity), highlighting criterion-specific nuances.

Improved reasoning. Many failures of our initial system arise due to a lack of accurate reasoning through highly-subjective evaluation examples. To improve reasoning accuracy, we leverage two forms of inference-time scaling:

Longer rationales: increase the length of the rationale or explanation generated by the LLM prior to producing a final score.
Consensus scoring: sample several outputs from the LLM and aggregate their scores to produce the final result.

Tiered rationales. Using tone as an example, we tested whether longer rationales are helpful by defining three rationale length tiers (shown above) and comparing their accuracies. Accuracy rises with longer rationales but returns are diminishing. Medium rationales noticeably outperform short ones, while long rationales offer only a slight additional gain; see below.

Longer rationales improve performance but degrade human-readability, which is problematic given that explanations are key pieces of evidence for creative experts. As a solution, we adopt tiered rationales: the judge reasons at any length but concisely summarizes its reasoning process prior to the final score. Tiered rationales preserve the benefits of extended reasoning, make outputs easier to inspect, and even benefit scoring accuracy. For example, our tone evaluator improves from 86.55% to 87.85% binary accuracy when using tiered rationales.

Consensus scoring. We can also allocate more inference-time compute by sampling multiple outputs per synopsis and aggregating their scores. We aggregate via a rounded average to ensure that the final score remains binary. For tone and clarity criteria with tiered rationales, 5× consensus scoring yields a clear accuracy boost as shown below.

Consensus scoring on the precision evaluator, which uses a vanilla (short) chain-of-thought, yields no benefit. As an explanation, we notice that longer rationales increase variance in scores across multiple outputs, while short rationales yield consistent scores. Consensus may be most useful for evaluators with longer rationales, where it helps to stabilize score variance. When shorter rationales are used, all scores tend to be the same, making consensus less meaningful.

What about reasoning models? While our setup elicits reasoning from a standard LLM, we also explored quality scoring with true reasoning models (i.e., models that generate long reasoning trajectories prior to final output). For tone, using a reasoning model with 5× consensus yields improving accuracy with increasing reasoning effort, even outperforming tiered rationales at the highest reasoning effort; see below. However, we skip reasoning models in our final system, as they significantly increase inference costs for only a marginal performance gain.

Agents-as-a-Judge for factuality. Synopses have four common types of factuality errors:

Incorrect plot information.
Incorrect metadata (e.g., genre, location, release date).
Incorrect on- or off-screen talent.
Incorrect award information.

Detecting these factuality errors requires comparing the synopsis to ground-truth context, where necessary context varies per criteria. For example, plot information requires a plot summary or script, while award information needs a list of awards. As we have learned, simplicity drives reliability: too much context or too many criteria harms accuracy. Motivated by this idea, we adopt factuality agents, where each agent evaluates one narrow aspect of factuality.

An agent receives context tailored to one facet of factuality and produces both a rationale and a binary factuality score. The final score of the Agents-as-a-Judge system is the minimum factuality score across agents — any failed aspect yields an overall fail. All rationales are fed to an LLM aggregator to produce a combined rationale to accompany the final score. As shown below, leveraging factuality agents significantly benefits scoring accuracy. Further benefits are achieved by using tiered rationales and consensus scoring within each agent.

Final system. In summary, our automatic evaluation system uses a combination of standard LLM-as-a-Judge, tiered rationales, consensus scoring, and Agents-as-a-Judge to maximize binary scoring accuracy for each criteria. A summary of the techniques used for each criteria and the associated binary scoring accuracy is provided below.

Member Validation of LLM-as-a-Judge

Beyond expert agreement, we also study how LLM-as-a-Judge scores relate to member behavior. This analysis serves two goals:

Further validating LLM-judge accuracy.
Linking creative quality to member-perceived quality.

Framed as predictors of member outcomes, LLM judges help us assess how promotional assets affect viewing and determine which creative attributes matter most to members discovering content they enjoy. To perform this analysis, we take advantage of the fact that most shows have multiple, personalized synopses (i.e., a synopsis “suite”). Using this suite, we can measure the causal effect of synopsis selection on metrics like take fraction and abandonment rate.

Our methodology. We correlate synopsis performance (take fraction or abandonment) with LLM quality scores. Specifically, within each show s, we relate changes in a synopsis’s LLM score to changes in its performance, normalizing by the show-level standard deviation and clustering standard errors by show; see below.

β captures the average association between within-show changes in LLM score and changes in performance. While we don’t have clean, experimental variation in LLM scores, this analysis still validates predictive value and practical utility.

Member-focused results. We report correlations for individual LLM criteria and a “Weighted Score” that combines all criteria to reduce noise and maximize signal from behavioral data. As shown below, results show promising prediction of take fraction and abandonment. Precision and clarity are especially predictive, and the weighted score provides a statistically useful signal of higher take and lower abandonment. In short, LLM evaluators capture factors that matter to members, making them a valuable tool for monitoring synopsis quality and engagement.

Closing Remarks

The LLM-as-a-Judge system used to evaluate show synopses at Netflix is the result of extensive experimentation grounded in both creative expertise and member outcomes. Building an automatic evaluation system that works reliably in practice is hard, and the approach we have described reflects countless lessons learned through iteration to improve accuracy and scalability. We have validated the system extensively with human evaluation at both the system and component levels, and we have shown that its outputs correlate with key streaming metrics. As a result, we are confident that it captures the dimensions of synopsis quality that matter most — both creatively and from the member perspective — which has driven its widespread adoption in the Netflix synopsis authoring workflow.

Evaluating Netflix Show Synopses with LLM-as-a-Judge was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Stop Answering the Same Question Twice: Interval-Aware Caching for Druid at Netflix Scale

Netflix Technology Blog — Mon, 06 Apr 2026 22:15:14 GMT

By Ben Sykes

In a previous post, we described how Netflix uses Apache Druid to ingest millions of events per second and query trillions of rows, providing the real-time insights needed to ensure a high-quality experience for our members. Since that post, our scale has grown considerably.

With our database holding over 10 trillion rows and regularly ingesting up to 15 million events per second, the value of our real-time data is undeniable. But this massive scale introduced a new challenge: queries. The live show monitoring, dashboards, automated alerting, canary analysis, and A/B test monitoring that are built on top of Druid became so heavily relied upon that the repetitive query load started to become a scaling concern in itself.

This post describes an experimental caching layer we built to address this problem, and the trade-offs we chose to accept.

The Problem

Our internal dashboards are heavily used for real-time monitoring, especially during high-profile live shows or global launches. A typical dashboard has 10+ charts, each triggering one or more Druid queries; one popular dashboard with 26 charts and stats generates 64 queries per load. When dozens of engineers view the same dashboards and metrics for the same event, the query volume quickly becomes unmanageable.

Take the popular dashboard above: 64 queries per load, refreshing every 10 seconds, viewed by 30 people. That’s 192 queries per second from one dashboard, mostly for nearly identical data. We still need Druid capacity for automated alerting, canary analysis, and ad-hoc queries. And because these dashboards request a rolling last-few-hours window, each refresh changes slightly as the time range advances.

Druid’s built-in caches are effective. Both the full-result cache and the per-segment cache. But neither is designed to handle the continuous, overlapping time-window shifts inherent to rolling-window dashboards. The full-result cache misses for two reasons.

If the time window shifts even slightly, the query is different, so it’s a cache miss.
Druid deliberately refuses to cache results that involve realtime segments (those still being indexed), because it values deterministic, stable cache results and query correctness over a higher cache hit rate.

The per-segment cache does help avoid redundant scans on historical nodes, but we still need to collect those cached segment results from each data node and merge them in the brokers with data from the realtime nodes for every query.

During major shows, rolling-window dashboards can generate a flood of near-duplicate queries that Druid’s caches mostly miss, creating heavy redundant load. At our scale, solving this by simply adding more hardware is prohibitively expensive.

We needed a smarter approach.

The Insight

When a dashboard requests the last 3 hours of data, the vast majority of that data, everything except the most recent few minutes, is already settled. The data from 2 hours ago won’t change.

What if we could remember the older portions of the result and only ask Druid for the part that’s actually new?

This is the core idea behind a new caching service that understands the structure of Druid queries and serves previously-seen results from cache while fetching only the freshest portion from Druid.

A Deliberate Trade-Off

Before diving into the implementation, it’s worth being explicit about the trade-off we’re making. Caching query results introduces some staleness, specifically, up to 5 seconds for the newest data. This is acceptable for most of our operational dashboards, which refresh every 10 to 30 seconds. In practice, many of our queries already set an end time of now-1m or now-5s to avoid the “flappy tail” that can occur with currently-arriving data.

Since our end-to-end data pipeline latency is typically under 5 seconds at P90, a 5-second cache TTL on the freshest data introduces negligible additional staleness on top of what’s already inherent in the system. We decided it was better to accept this small amount of staleness in exchange for significantly lower query load on Druid. But a 5s cache on its own is not very useful.

Exponential TTLs

Not all data points are equally trustworthy. In real-time analytics, there’s a well-known late-arriving data problem. Events can arrive out of order or be delayed in the ingestion pipeline. A data point from 30 seconds ago might still change as late-arriving events trickle in. A data point from 30 minutes ago is almost certainly final.

We use this observation to set cache TTLs that increase exponentially with the age of the data. Data less than 2 minutes old gets a minimum TTL of 5 seconds. After that, the TTL doubles for each additional minute of age: 10 seconds at 2 minutes old, 20 seconds at 3 minutes, 40 seconds at 4 minutes, and so on, up to a maximum TTL of 1 hour.

The effect is that fresh data cycles through the cache rapidly, so any corrections from late-arriving events in the most recent couple of minutes are picked up quickly. Older data lingers much longer, because our confidence in its accuracy grows with time.

For a 3-hour rolling window, the exponential TTL ensures the vast majority of the query is served from the cache, leaving Druid to only scan the most recent, unsettled data.

Bucketing

If we were to use a single-level cache key for the query and interval, similar to Druid’s existing result-level cache, we wouldn’t be able to extract only the relevant time range from cached results. A shifted window means a different key, which means a cache miss.

Instead, we use a map-of-maps. The top-level key is the query hash without the time interval; the inner keys are timestamps bucketed to the query granularity (or 1 minute, whichever is larger) and encoded as big-endian bytes so lexicographic order matches time. This enables efficient range scans; fetching all cached buckets between times A and B for a query hash. A 3-hour query at 1-minute granularity becomes 180 independent cached buckets, each with its own TTL; when the window shifts (e.g., 30 seconds later), we reuse most buckets from cache and only query Druid for the new data.

How It Works

Today, the cache runs as an external service integrated transparently by intercepting requests at the Druid Router and redirecting them to the cache. If the cache fully satisfies a request, it returns the result; otherwise it shrinks the time interval to the uncached portion and calls back into the Router, bypassing the redirect to query Druid normally. Non-cached requests (e.g., metadata queries or queries without time group-bys) pass straight through to Druid unchanged.

This intercepting proxy design allows us to enable or disable caching without any client changes and is a key to its adoption. We see this setup as temporary while we work out a way to better integrate this capability into Druid more natively.

When a cacheable query arrives, those that are grouping-by time (timeseries, groupBy), the cache performs the following steps.

Parsing and Hashing. We parse each incoming query to extract the time interval, granularity, and structure, then compute a SHA-256 hash of the query with the time interval and parts of the context removed. That hash is the cache key: it encodes what is being asked (datasource, filters, aggregations, granularity) but not when, so the same logical query over different overlapping time windows maps to the same cache entry. There are some context properties that can alter the response structure or contents, so these are included in the cache-key.

Cache Lookup. Using the cache key, we fetch cached points within the requested range, but only if they’re contiguous from the start. Because bucket TTLs can expire unevenly, gaps can appear; when we hit a gap, we stop and fetch all newer data from Druid. This guarantees a complete, unbroken result set while sending at most one Druid query, rather than “filling gaps” with multiple small, fragmented queries that would increase Druid load.

Fetching the Missing Tail. On a partial cache hit (e.g., 2h 50m of a 3h window), we rebuild the query with a narrowed interval for the missing 10 minutes and send only that to Druid. Since Druid then scans just the recent segments for a small time range, the query is usually faster and cheaper than the original.

Combining. The cached data and fresh data are concatenated, sorted by timestamp, and returned to the client. From the client’s perspective, the response looks identical to what Druid would have returned, same JSON format, same fields.

Asynchronous Caching. The fresh data from Druid is parsed into individual time-granularity buckets and written back to the cache asynchronously, so we don’t add latency to the response path.

Negative Caching

Some metrics are sparse. Certain time buckets may genuinely have no data. Without special handling, the cache would treat these empty buckets as gaps and re-query Druid for them every time.

We handle this by caching empty sentinel values for time buckets where Druid returned no data. Our gap-detection logic recognizes these empty entries as valid cached data rather than missing data, preventing needless re-queries for naturally sparse metrics.

However, we’re careful not to negative-cache trailing empty buckets. If a query returns data up to minute 45 and nothing after, we only cache empty entries for gaps between data points, not after the last one. This avoids incorrectly caching “no data” for time periods where events simply haven’t arrived yet, which would exacerbate the chart delays of late arriving data.

The Storage Layer

For the backing store, we use Netflix’s Key-Value Data Abstraction Layer (KVDAL), backed by Cassandra. KVDAL provides a two-level map abstraction, a natural fit for our needs. The outer key is the query hash, and the inner keys are timestamps. Crucially, KVDAL supports independent TTLs on each inner key-value pair, eliminating the need for us to manage cache eviction manually.

This two-level structure gives us efficient range queries over the inner keys, which is exactly what we need for partial cache lookups: “give me all cached buckets between time A and time B for query hash X.”

Results

The biggest win is during high-volume events (e.g., live shows): when many users view the same dashboards, the cache serves most identical queries as full hits, so the query rate reaching Druid is essentially the same with 1 viewer or 100. The scaling bottleneck moves from Druid’s query capacity to the much cheaper-to-scale cache, and with ~5.5 ms P90 cache responses, dashboards load faster for everyone.

On a typical day, 82% of real user queries get at least a partial cache hit, and 84% of result data is served from cache. As a result, the queries that reach Druid scan much narrower time ranges, touching fewer segments and processing less data, freeing Druid to focus on aggregating the newest data instead of repeatedly re-querying historical segments.

An experiment validated this, showing about a 33% drop in queries to Druid and a 66% improvement in overall P90 query times. It also cut result bytes and segments queried, and in some cases, enabling the cache reduced result bytes by more than 14x. Caveat: the size of these gains depends heavily on how similar and repetitive the query workload is.

Looking Ahead

This caching layer is still experimental, but results are promising and we’re exploring next steps. We’ve added partial support for templated SQL so dashboard tools can benefit without writing native Druid queries.

Longer term, we’d like interval-aware caching to be built into Druid: an external proxy adds infrastructure to manage, extra network hops, and workarounds (like SQL templating) to extract intervals. Implemented inside Druid, it could be more efficient, with direct access to the query planner and segment metadata, and benefit the broader community without custom infrastructure. We’d likely ship it as an opt-in, configurable, result-level cache in the Brokers, with metrics to tune TTLs and measure effectiveness. Please leave a comment if you have a use-case that could benefit from this feature.

More broadly, this strategy, splitting time-series results into independently cached, granularity-aligned buckets with age-based exponential TTLs, isn’t Druid-specific and could apply to any time-series database with frequent overlapping-window queries.

Summary

As more Netflix teams rely on real-time analytics, query volume grows too. Dashboards are essential at our scale, but their popularity can become a scaling bottleneck. By inserting an intelligent cache between dashboards and Druid, one that understands query structure, breaks results into granularity-aligned buckets, and trades a small amount of staleness for much lower Druid load, we’ve increased query capacity without scaling infrastructure proportionally, and hope to deliver these benefits to the Druid community soon as a built-in Druid feature.

Sometimes the best way to handle a flood of queries is to stop answering the same question twice.

Stop Answering the Same Question Twice: Interval-Aware Caching for Druid at Netflix Scale was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.