Introduce offloaded snapshot/restore#8264
Conversation
likebreath
left a comment
There was a problem hiding this comment.
Very neat idea. I like how this lets us offload functionality out of the core VMM implementation.
Overall looks good, though I haven't dug into the reference restore_daemon implementation yet. One more thought: I think we should support 'keep-alive' on the offload_snapshot endpoint, which would align better with the generic 'snapshot' expecation.
|
@likebreath thanks for the quick review :) |
likebreath
left a comment
There was a problem hiding this comment.
we agreed it would be even simpler to avoid introducing a new API endpoint given this was more of an alias rather than a completely new endpoint.
Makes perfect sense. Some comments below about the reference daemon implementation.
|
Just a summary of the proposal from this updated PR: GoalWe'd like a way to allow CH's users to implement the features they need for snapshot/restore (things like encryption of guest RAM on the fly, or avoiding persisting the snapshot to local disk and instead send it over the network, etc...), without overloading CH with these features. Offload daemonOne way we think this is achievable is by reusing the live migration protocol so that an offload daemon can behave as a destination VM (for the snapshot case), and as a source VM (for the restore case). The existing protocol gives us this ability and we've been able to verify that we can make snapshot/restore work from the offload daemon (almost) the same way it works with CH's internal snapshot/restore. What's missingOne thing that is missing to be on parity with the current snapshot/restore support is userfaultfd. And given userfaultfd can't be entirely handled from the daemon (because the setup has to happen from CH's process to apply to the right VMAs), we must extend CH to support it. And given we're talking about using the live migration support, that basically means we would have to add post-copy (uffd) support to the current live migration protocol. The proposalAdding post-copy support to the current live migration protocol fits well with the live migration promise, and by adding it to the protocol, we can achieve both post-copy over the network AND fast restore from an offloaded daemon since we'd expect the daemon to serve pages on demand through the extended protocol. |
| Anonymous memory is rejected with the same error message that local | ||
| live migration produces. | ||
| - Orchestrator-supplied network FDs (today carried by `vm.restore`'s | ||
| `net_fds` field) are **not** plumbed through `vm.receive-migration`, |
There was a problem hiding this comment.
Heads up that I have an upcoming PR plumbing vfio_fds through vm.receive-migration over the SCM_RIGHTS channel, mirroring how vm.restore handles fd substitution. The mechanism I had implemented for vfio_fds can be extended to cover net_fds in vm.receive-migration. If you want, I can fold net_fds in my commits so that offload daemon can receive orchestrator supplied network FDs.
There was a problem hiding this comment.
Ah that's good to know! If you think that's not too much work then yes, otherwise we can still remove the limitation later.
f700950 to
069cbb4
Compare
|
Still under active development so drafting. |
e53d618 to
d8e6fb8
Compare
|
Undrafted since it's now ready for reviews. |
|
@sboeuf Your new test failed. |
Yes it should be fixed now. |
Expose VmMigrationConfig as a public facing structure that can be used by an offload daemon to act as if it was the VM to migrate to, or the VM to migrate from. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7
Adding a new dedicated binary that is meant to be used as a reference implementation for validating that offloaded snapshot/restore works and meant to be used through tests in general. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7
Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7
Move next_data_extent and write_region_sparse out of memory_manager.rs into a new vmm::sparse module so the snapshot writer, the restore reader, and the offload daemon can share one implementation. No functional change intended. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7
Copy only populated extents when writing the snapshot file and when filling the restore memfd, leaving unwritten ranges as holes. Both the on-disk snapshot and the restored guest RAM stay sparse, so that untouched guest pages cost no disk space or host memory. This brings the offload daemon at feature parity with CH's internal implementation of snapshot/restore. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7
Extending the snapshot/restore documentation so that it explains what are the goals behind this offloaded snapshot/restore feature, how to use it in practice, and also by documenting the protocol used by the offload daemon so that anyone could write its own daemon. By relying on the existing local live migration support and reusing the semantics and the protocol associated with it, we intend to provide a way for snapshotting and restoring a VM to/from a dedicated process that we can call the offload daemon. By allowing an external to perform the snapshot/restore actions on behalf of Cloud Hypervisor, we give our users the opportunity to implement their own offloaded daemon. The goal is to avoid bloating Cloud Hypervisor with numerous features related to snapshot/restore, and let the user decide how to perform the snapshot/restore actions. One example is that we can decide to encrypt the guest RAM on the fly in order to avoid writing an unencrypted version to local disk. Another example is to be able to send guest RAM and associated state/config data over the network without having to persist the data first to local storage. There might be other reasons to choose going with an offloaded daemon to perform the snapshot/restore of the VM, but in every case, this empowers the user to make their own choice. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7
Introducing PageFault as the new wire command needed by both post-copy live migration and on-demand (fast) restore from the offload daemon. This new command describe the need from the destination to fault the page content in. This request describe the page through a MemoryRange structure, and the response can be either 0 or the actual page size. In case it's 0, that means the source had access to the guest memory and was able to copy the page content directly. In case the response is the actual page size, there's a payload associated which contains the page content. We can expect local live migration and offload restore to run locally and therefore have access to the guest memory. The remote live migration over the network is the case where we would expect the page content to be sent over the wire. This command is served through an additional connection happening on the UNIX or TCP socket. The goal is to keep the same codepath between local and remote migrations. This additional channel allows PageFault commands to be issued asynchronously so they can be served without blocking the main connection. A connection role is introduced in order to identify an additional connection related to pre-copy memory versus the newly introduced channel for serving post-copy requests. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7
Extract the page content provider out of the userfaultfd handler so it can be plugged with different backends in followup commits. No functional change intended. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7
Adding the socket backed UffdMemorySource that resolves each fault by sending a Command::PageFault request to the peer over a dedicated fault connection. This connection is brought up and ready to serve before restoring the VM. Also extending the receive-migration to accept a new postcopy boolean parameter to let the destination know if we're expecting postcopy migration or ondemand restore to happen. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7
Add a --lazy flag to the offload daemon's restore subcommand to support the postcopy mechanism from live migration protocol. Through this lazy mode, the daemon creates empty memfds to back the guest memory and send them over to the VMM. This allows the VM to be started quickly after the memfd is mapped into CH's address space. At runtime, when the guest accesses the pages (or when the prefault handler request the pages), the daemon faults every page by copying the page content to its shared memory mapping. Once the page content is copied, it replies to the PageFault request to notify the VMM that it can consider the page present. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7
Adding postcopy=on knob to vm.send-migration endpoint so that a remote migration over TCP can resume the destination's VM and stream pages on demand instead of running the pre-copy dirty-tracking loop. Signed-off-by: Sebastien Boeuf <sboeuf@meta.com> Assisted-by: Claude:claude-opus-4-7
By relying on the existing local live migration support and reusing the
semantics and the protocol associated with it, we intend to provide a
way for snapshotting and restoring a VM to/from a dedicated process that
we can call the offload daemon.
By allowing an external to perform the snapshot/restore actions on
behalf of Cloud Hypervisor, we give our users the opportunity to
implement their own offloaded daemon. The goal is to avoid bloating
Cloud Hypervisor with numerous features related to snapshot/restore, and
let the user decide how to perform the snapshot/restore actions. One
example is that we can decide to encrypt the guest RAM on the fly in
order to avoid writing an unencrypted version to local disk. Another
example is to be able to send guest RAM and associated state/config data
over the network without having to persist the data first to local
storage.
There might be other reasons to choose going with an offloaded daemon to
perform the snapshot/restore of the VM, but in every case, this empowers
the user to make their own choice.
Also, given we'd like to be able to support userfaultfd mechanism to be on
parity between the internal snapshot/restore implementation and the offload
daemon proposal, the post-copy feature had to be added to live migration.
With live migration protocol now supporting post-copy, we can expect both
remote migration over TCP to be performed with the post-copy mechanism,
as well as offloading VM restore with the ability to let the daemon fault the
pages.
This is a large PR that might need to be cut into smaller pieces, but it gives a
global understanding of what is the end goal here and what it takes to achieve
it.