Introduce offloaded snapshot/restore by sboeuf · Pull Request #8264 · cloud-hypervisor/cloud-hypervisor

sboeuf · 2026-05-21T12:27:11Z

By relying on the existing local live migration support and reusing the
semantics and the protocol associated with it, we intend to provide a
way for snapshotting and restoring a VM to/from a dedicated process that
we can call the offload daemon.

By allowing an external to perform the snapshot/restore actions on
behalf of Cloud Hypervisor, we give our users the opportunity to
implement their own offloaded daemon. The goal is to avoid bloating
Cloud Hypervisor with numerous features related to snapshot/restore, and
let the user decide how to perform the snapshot/restore actions. One
example is that we can decide to encrypt the guest RAM on the fly in
order to avoid writing an unencrypted version to local disk. Another
example is to be able to send guest RAM and associated state/config data
over the network without having to persist the data first to local
storage.

There might be other reasons to choose going with an offloaded daemon to
perform the snapshot/restore of the VM, but in every case, this empowers
the user to make their own choice.

Also, given we'd like to be able to support userfaultfd mechanism to be on
parity between the internal snapshot/restore implementation and the offload
daemon proposal, the post-copy feature had to be added to live migration.
With live migration protocol now supporting post-copy, we can expect both
remote migration over TCP to be performed with the post-copy mechanism,
as well as offloading VM restore with the ability to let the daemon fault the
pages.

This is a large PR that might need to be cut into smaller pieces, but it gives a
global understanding of what is the end goal here and what it takes to achieve
it.

likebreath

Very neat idea. I like how this lets us offload functionality out of the core VMM implementation.

Overall looks good, though I haven't dug into the reference restore_daemon implementation yet. One more thought: I think we should support 'keep-alive' on the offload_snapshot endpoint, which would align better with the generic 'snapshot' expecation.

sboeuf · 2026-05-22T09:43:46Z

@likebreath thanks for the quick review :)
After having an offline conversation with @rbradford, we agreed it would be even simpler to avoid introducing a new API endpoint given this was more of an alias rather than a completely new endpoint. Therefore, I've removed the commits related to ch-remote and adding the two new endpoints.
The summary is that Cloud Hypervisor can already support something like an offload daemon thanks to its migration protocol, and this PR only introduces a reference implementation for such daemon so that we can run some integration tests.

likebreath

we agreed it would be even simpler to avoid introducing a new API endpoint given this was more of an alias rather than a completely new endpoint.

Makes perfect sense. Some comments below about the reference daemon implementation.

sboeuf · 2026-06-03T15:36:41Z

Just a summary of the proposal from this updated PR:

Goal

We'd like a way to allow CH's users to implement the features they need for snapshot/restore (things like encryption of guest RAM on the fly, or avoiding persisting the snapshot to local disk and instead send it over the network, etc...), without overloading CH with these features.

Offload daemon

One way we think this is achievable is by reusing the live migration protocol so that an offload daemon can behave as a destination VM (for the snapshot case), and as a source VM (for the restore case). The existing protocol gives us this ability and we've been able to verify that we can make snapshot/restore work from the offload daemon (almost) the same way it works with CH's internal snapshot/restore.

What's missing

One thing that is missing to be on parity with the current snapshot/restore support is userfaultfd. And given userfaultfd can't be entirely handled from the daemon (because the setup has to happen from CH's process to apply to the right VMAs), we must extend CH to support it. And given we're talking about using the live migration support, that basically means we would have to add post-copy (uffd) support to the current live migration protocol.

The proposal

Adding post-copy support to the current live migration protocol fits well with the live migration promise, and by adding it to the protocol, we can achieve both post-copy over the network AND fast restore from an offloaded daemon since we'd expect the daemon to serve pages on demand through the extended protocol.
I'd like to get some feedback since this is a first draft of how this could be shaped. Also, I've tried to keep things as simple as possible on the post-copy support for remote mirgation but we could also think about pre-copy + post-copy if we wanted to optimize migration time.

saravan2 · 2026-06-04T18:46:46Z

+  Anonymous memory is rejected with the same error message that local
+  live migration produces.
+- Orchestrator-supplied network FDs (today carried by `vm.restore`'s
+  `net_fds` field) are **not** plumbed through `vm.receive-migration`,


Heads up that I have an upcoming PR plumbing vfio_fds through vm.receive-migration over the SCM_RIGHTS channel, mirroring how vm.restore handles fd substitution. The mechanism I had implemented for vfio_fds can be extended to cover net_fds in vm.receive-migration. If you want, I can fold net_fds in my commits so that offload daemon can receive orchestrator supplied network FDs.

Ah that's good to know! If you think that's not too much work then yes, otherwise we can still remove the limitation later.

rbradford · 2026-06-08T12:07:31Z

Still under active development so drafting.

sboeuf · 2026-06-09T13:12:55Z

Undrafted since it's now ready for reviews.

rbradford · 2026-06-10T12:02:38Z

@sboeuf Your new test failed.

sboeuf · 2026-06-10T12:04:55Z

@sboeuf Your new test failed.

Yes it should be fixed now.

Expose VmMigrationConfig as a public facing structure that can be used by an offload daemon to act as if it was the VM to migrate to, or the VM to migrate from. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7

Adding a new dedicated binary that is meant to be used as a reference implementation for validating that offloaded snapshot/restore works and meant to be used through tests in general. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7

Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7

Move next_data_extent and write_region_sparse out of memory_manager.rs into a new vmm::sparse module so the snapshot writer, the restore reader, and the offload daemon can share one implementation. No functional change intended. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7

Copy only populated extents when writing the snapshot file and when filling the restore memfd, leaving unwritten ranges as holes. Both the on-disk snapshot and the restored guest RAM stay sparse, so that untouched guest pages cost no disk space or host memory. This brings the offload daemon at feature parity with CH's internal implementation of snapshot/restore. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7

Extending the snapshot/restore documentation so that it explains what are the goals behind this offloaded snapshot/restore feature, how to use it in practice, and also by documenting the protocol used by the offload daemon so that anyone could write its own daemon. By relying on the existing local live migration support and reusing the semantics and the protocol associated with it, we intend to provide a way for snapshotting and restoring a VM to/from a dedicated process that we can call the offload daemon. By allowing an external to perform the snapshot/restore actions on behalf of Cloud Hypervisor, we give our users the opportunity to implement their own offloaded daemon. The goal is to avoid bloating Cloud Hypervisor with numerous features related to snapshot/restore, and let the user decide how to perform the snapshot/restore actions. One example is that we can decide to encrypt the guest RAM on the fly in order to avoid writing an unencrypted version to local disk. Another example is to be able to send guest RAM and associated state/config data over the network without having to persist the data first to local storage. There might be other reasons to choose going with an offloaded daemon to perform the snapshot/restore of the VM, but in every case, this empowers the user to make their own choice. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7

Introducing PageFault as the new wire command needed by both post-copy live migration and on-demand (fast) restore from the offload daemon. This new command describe the need from the destination to fault the page content in. This request describe the page through a MemoryRange structure, and the response can be either 0 or the actual page size. In case it's 0, that means the source had access to the guest memory and was able to copy the page content directly. In case the response is the actual page size, there's a payload associated which contains the page content. We can expect local live migration and offload restore to run locally and therefore have access to the guest memory. The remote live migration over the network is the case where we would expect the page content to be sent over the wire. This command is served through an additional connection happening on the UNIX or TCP socket. The goal is to keep the same codepath between local and remote migrations. This additional channel allows PageFault commands to be issued asynchronously so they can be served without blocking the main connection. A connection role is introduced in order to identify an additional connection related to pre-copy memory versus the newly introduced channel for serving post-copy requests. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7

Extract the page content provider out of the userfaultfd handler so it can be plugged with different backends in followup commits. No functional change intended. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7

Adding the socket backed UffdMemorySource that resolves each fault by sending a Command::PageFault request to the peer over a dedicated fault connection. This connection is brought up and ready to serve before restoring the VM. Also extending the receive-migration to accept a new postcopy boolean parameter to let the destination know if we're expecting postcopy migration or ondemand restore to happen. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7

Add a --lazy flag to the offload daemon's restore subcommand to support the postcopy mechanism from live migration protocol. Through this lazy mode, the daemon creates empty memfds to back the guest memory and send them over to the VMM. This allows the VM to be started quickly after the memfd is mapped into CH's address space. At runtime, when the guest accesses the pages (or when the prefault handler request the pages), the daemon faults every page by copying the page content to its shared memory mapping. Once the page content is copied, it replies to the PageFault request to notify the VMM that it can consider the page present. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7

Adding postcopy=on knob to vm.send-migration endpoint so that a remote migration over TCP can resume the destination's VM and stream pages on demand instead of running the pre-copy dirty-tracking loop. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7

sboeuf requested a review from a team as a code owner May 21, 2026 12:27

sboeuf force-pushed the offload_snapshot branch from 6276033 to cb69c03 Compare May 21, 2026 13:35

phip1611 self-requested a review May 21, 2026 14:37

likebreath reviewed May 22, 2026

View reviewed changes

Comment thread cloud-hypervisor/src/bin/ch-remote.rs Outdated

Comment thread cloud-hypervisor/src/bin/ch-remote.rs Outdated

Comment thread vmm/src/lib.rs

sboeuf force-pushed the offload_snapshot branch from cb69c03 to 4dd352b Compare May 22, 2026 09:40

sboeuf force-pushed the offload_snapshot branch from 4dd352b to 2d8062b Compare May 22, 2026 09:45

likebreath reviewed May 28, 2026

View reviewed changes

sboeuf force-pushed the offload_snapshot branch from 2d8062b to 18152cb Compare June 3, 2026 14:53

sboeuf mentioned this pull request Jun 3, 2026

vm-migration: Add migration protocol versioning #8316

Open

saravan2 reviewed Jun 4, 2026

View reviewed changes

sboeuf force-pushed the offload_snapshot branch 4 times, most recently from f700950 to 069cbb4 Compare June 5, 2026 11:05

rbradford marked this pull request as draft June 8, 2026 12:07

sboeuf force-pushed the offload_snapshot branch 3 times, most recently from e53d618 to d8e6fb8 Compare June 9, 2026 13:12

sboeuf marked this pull request as ready for review June 9, 2026 13:12

sboeuf force-pushed the offload_snapshot branch from d8e6fb8 to 268583c Compare June 10, 2026 08:20

sboeuf force-pushed the offload_snapshot branch from 268583c to 5f47c5c Compare June 10, 2026 12:03

sboeuf added 3 commits June 10, 2026 08:38

vmm: Export VmMigrationConfig as public

f0d8761

Expose VmMigrationConfig as a public facing structure that can be used by an offload daemon to act as if it was the VM to migrate to, or the VM to migrate from. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7

ci: Add integration test for offload snapshot

790f688

Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7

sboeuf added 8 commits June 10, 2026 08:38

sboeuf force-pushed the offload_snapshot branch from 5f47c5c to cd9aea5 Compare June 10, 2026 15:39

Conversation

sboeuf commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

likebreath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sboeuf commented May 22, 2026

Uh oh!

likebreath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sboeuf commented Jun 3, 2026

Goal

Offload daemon

What's missing

The proposal

Uh oh!

saravan2 Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sboeuf Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

rbradford commented Jun 8, 2026

Uh oh!

sboeuf commented Jun 9, 2026

Uh oh!

rbradford commented Jun 10, 2026

Uh oh!

sboeuf commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sboeuf commented May 21, 2026 •

edited

Loading

saravan2 Jun 4, 2026 •

edited

Loading