Debarshi's den

Toolbx now enables the proprietary NVIDIA driver

leave a comment »

… and why did it take so long for that to happen?

If you build Toolbx from Git and install it to your usual prefix on a host operating system with the proprietary NVIDIA driver, then you will be able to use the driver on all freshly started Toolbx containers. Just like that. There’s no need to recreate your containers or to use some special option. It will just work.

How does it work?

Toolbx uses the NVIDIA Container Toolkit to generate a Container Device Interface specification on the host during the toolbox enter and toolbox run commands. This is a JSON or YAML description of the environment variables, files and other miscellaneous things required by the user space part of the proprietary NVIDIA driver. Containers share the kernel space driver with the host, so we don’t have to worry about that. This specification is then shared with the Toolbx container’s entry point, which is the toolbox init-container command running inside the container. The entry point handles the hooks and bind mounts, while the environment variables are handled by the podman exec process running on the host.

It’s worth pointing out that right now this neither uses nvidia-ctk cdi generate to generate the Container Device Interface specification nor podman create --device to consume it. We may decide to change this in the future, but right now this is the way it is.

The main problem with podman create is that the specification must be saved in /etc/cdi or /var/run/cdi, both of which require root access, for it to be visible to podman create --device. Toolbx containers are often used rootless, so requiring root privileges for hardware support, something that’s not necessary on the host, will be a problem.

Secondly, updating the toolbox(1) binary won’t enable the proprietary NVIDIA driver in existing containers, because podman create only affects new containers.

Therefore, Toolbx uses the tags.cncf.io/container-device-interface Go APIs, which are also used by podman create, to parse and apply the specification itself. The hooks in the specification are a bit awkward to deal with. So, at the moment only ldconfig(8) is supported.

The issue with nvidia-ctk is relatively minor and is because it’s another different binary. It makes error handling more difficult, and downstream distributors and users of Toolbx need to be aware of the dependency. Instead, it’s better to directly use the github.com/NVIDIA/go-nvlib and github.com/NVIDIA/nvidia-container-toolkit Go APIs that nvidia-ctk also uses. This offers all the usual error handling facilities in Go and ensures that the dependency won’t go missing.

Why did it take so long?

Well, hardware support needs hardware, and sometimes it takes time to get access to it. I didn’t want to optimistically throw together a soup of find(1), grep(1), sed(1), etc. calls without any testing in the hope that it will all work out. That approach may be fine for some projects, but not for Toolbx.

Red Hat recently got me a ThinkPad P72 laptop with a NVIDIA Quadro P600 GPU that let me proceed with this work.

Written by Debarshi Ray

17 June, 2024 at 20:50

Leave a comment