NVIDIA Introduces Checkpointing for CUDA Applications with CRIU

Caroline Bishop  Jul 03, 2024 13:52  UTC 05:52

0 Min Read

NVIDIA has unveiled a new command-line utility, cuda-checkpoint, aimed at enhancing the checkpoint and restore functionalities for CUDA applications on Linux. This utility, which can be used in conjunction with the open-source checkpointing tool CRIU (Checkpoint/Restore in Userspace), promises to streamline the process of preserving and restoring the state of CUDA applications.

Checkpointing Overview

Transparent, per-process checkpointing provides a balance between virtual machine checkpointing and application-driven checkpointing. It can be particularly useful in scenarios requiring fault tolerance, task preemption, or cluster scheduling with migration. By combining cuda-checkpoint with CRIU, users can checkpoint the state of complex applications, thus facilitating greater flexibility and reliability in various computational tasks.

CRIU

CRIU, an open-source utility maintained outside of NVIDIA, is designed to checkpoint and restore Linux process trees. It handles various kernel mode resources such as anonymous memory, threads, regular files, sockets, and pipes. However, it lacks native support for NVIDIA GPUs, which is where cuda-checkpoint comes into play, extending CRIU's capabilities to include CUDA state management.

cuda-checkpoint

The cuda-checkpoint utility supports display driver version 550 and higher. It allows users to toggle the CUDA state of a process between suspended and running. The transition from running to suspended is termed as a suspend, while the reverse is termed as a resume. During suspension, CUDA driver APIs are locked, submitted CUDA work is completed, device memory is copied to the host, and all CUDA GPU resources are released. Conversely, during resumption, GPUs are re-acquired, device memory and GPU memory mappings are restored, CUDA objects are reinstated, and CUDA driver APIs are unlocked.

Checkpointing Example

An example application, counter, demonstrates the checkpointing process. The application increments GPU memory upon receiving a packet and replies with the updated value. Users can build this application using nvcc and observe the checkpointing and restoration processes using cuda-checkpoint and CRIU commands.

Functionality and Limitations

As of display driver version 550, the cuda-checkpoint utility is still under active development. Currently, it supports x64 architecture and acts on a single process rather than a process tree. It does not support UVM or IPC memory, GPU migration, and waits for already-submitted CUDA work to finish before completing a checkpoint. Future driver releases are expected to address these limitations without requiring updates to the utility itself.

Summary

The cuda-checkpoint utility, in combination with CRIU, enables transparent per-process checkpointing of Linux applications. For further information, visit the official NVIDIA Technical Blog.



Read More