VCL with GPU
It has become very popular these days for software developers to package applications in Docker images. The idea is to have the software applications run anywhere, cloud or on-premise, without worrying about if there is any missing library, module, etc. Any missing dependency will cause the software unusable. Putting everything related to the targeted software into a Docker image has proven to be a very useful way to deliver applications. In that regard, building Docker image is one of the techniques which is quite useful for providing applications to end-users. Once the Docker images are delivered to end-users, applications can be run on machines regardless of how the OSes are set up.
To build and run Docker images, we can do that in a special VCL image at the University of North Carolina – Chapel Hill maintained by ITS Research Computing Center, TarHeel Linux, CentOS 7 (Full Blade with GPU).
VCL has long been a valuable teaching tool. Here we describe a way to use VCL for research. We have completely overhauled and updated the TarHeel Linux, CentOS 7 (Full Blade with GPU) VCL image to include SSHFS, Podman, etc. SSHFS allows us to mount remote filesystem in a secure way. Podman is a tool for building and running Docker images. In our implementation, Podman is configured to run rootless Docker containers. One can simply use any Docker command in Podman. Singularity is another way of working with containers and it is also installed.
For long running jobs, one can submit these jobs to the DGX cluster directly from the VCL. One can also check and monitor jobs running in DGX cluster.
This page is intended to tell you the basic usage of the VCL image. If you have a research project in mind and would like to see if VCL is going to help, please email email@example.com. For other questions and comments in using TarHeel Linux, CentOS 7 (Full Blade with GPU) VCL image, please email firstname.lastname@example.org too.
Accessing VCL of TarHeel Linux, CentOS 7 (Full Blade with GPU)
Use your favorite web browser to point to https://vcl.unc.edu. Select “Shibboleth (UNC-Chapel Hill)” in the pull-down menu and click “Proceed to Login” to continue. Authenticate with your ONYEN with your ONYEN password. Off-campus access will require a VPN connection in advance. In the menu on the left, click “Reservations”, then click “New Reservation”. In the “New Reservation” window, use the pull-down menu to choose “TarHeel Linux, CentOS 7 (Full Blade with GPU)”. You can also choose to start now or later and set the duration of the reservation. Click “Create Reservation” to continue. When the VCL is ready for you to use, it will present you with a “Connect” button. Click “Connect” to see the IP address of the VCL, username and password. Username is normally your ONYEN. For password, you can use the provided temporary password or your ONYEN password. Use your favorite terminal to ssh to the machine with the provided IP address. Assuming that the login ID of your local machine is also your ONYEN. Otherwise, you will have to specify your ONYEN in the command.
ssh <ip_address> ssh <onyen>@<ip_address>
If graphics need to be exported from VCL to your machine, add “-X” in the command to allow X11 port forwarding.
ssh -X <ip_address> ssh -X <onyen>@<ip_address>
Once you are done with the VCL, you can exit from the terminal. Then, click “Delete Reservation” to release the resources for others to use.
Accessing DGX Cluster with Kubernetes
Normal VCL session is limited to 10 hours or so. If your job needs to take more time to run, you can submit your job to the DGX cluster. The DGX cluster is using Kubernetes as job manager. If you have already created Kubernetes token to access DGX cluster in Longleaf login node, you are ready to submit Kubernetes job from VCL. Run the following command to set up your Kubernetes environment for DGX cluster by copying your Kubernetes token from Longleaf over.
The above command will check if you have Kubernetes config file in Longleaf. If you do, it will copy the file over to VCL. You will have to enter your ONYEN password twice to get the file. After you have your DGX Cluster Kubernetes config file. Run the following command to check.
kubectl get node
If the above command returns with a list of nodes in the DGX cluster, you are ready to submit jobs to the DGX cluster.
To see all the jobs currently running the DGX Kubernetes cluster, you can invoke the following command.
kubectl get job
If you have jobs completed and/or died, please delete those jobs to clean up the job list..
kubectl delete job <job_name>
A Kubernetes job will spin out one or more pods, you can see the status of all pods with this command. With the -owide option, you see more information about pods.
kubectl get pod kubectl get pod -owide
Please remove your jobs as soon as you are done using the DGX cluster resources. Any idle process in CPU or GPU in DGX cluster is a waste of resources.
In research, we normally have research data and files saved in the /proj filesystem. Storage space on the /proj filesystem is granted to projects and PI groups who request it. It is advantageous to be able to access /proj in the VCL if you already have space there. Also, the output files can be saved in /proj. Once we exit VCL, those output files will remain intact in /proj.
At the prompt running VCL of TarHeel Linux, CentOS 7 (Full Blade with GPU), you can type this command.
To unmount, use this command.
fusermount –u /proj
Accessing Longleaf Home Directory
If you have files in your Longleaf Home directory, you may want to access your Longleaf home directory in VCL. You may also want to save files in your Longleaf home directory. To access your Longleaf home directory, use the following command.
To navigate to your Longleaf home directory use cd command. replace <onyen> with your ONYEN.
To unmount your Longleaf home directory, use this command.
fusermount –u /nas/longleaf/home/<onyen>
Checking Podman and Singularity
The TarHeel Linux, CentOS 7 (Full Blade with GPU) VCL image provides the basis for computing in research. Software applications can be loaded in terms of Docker and Singularity containers. To check the availability of Docker and Singularity commands, we can simply invoke the following.
Using this VCL image, one can pull or build Docker and Singularity images and run Docker and Singularity containers.
Switching CUDA Version
In the VCL instance, we have the latest CUDA version 11.2 installed as of February 2021. However, some applications would like to have access to older versions of CUDA. Therefore, we have installed multiple versions of CUDA in the VCL instance. To check what versions of CUDA are installed, invoke the following command.
By default, CUDA version 11.2 is used when VCL is started. To switch to use different version of CUDA, use the following command, replaced <version> with the version number you desire.
source switch-cuda <version>
If the CUDA version you like to have is not there, please email email@example.com to express your need.
Pulling Docker Images from DockerHub
Once you log into the VCL instance, you can pull Docker images from Docker registries. There are numerous Docker images one can get from various registries such as NVIDIA GPU Cloud (NGC) and DockerHub. To pull Docker images from NGC, you will need an account there. Here we demonstrate how we can pull a Docker image from DockerHub and use it in VCL.
Use the following command to pull a CUDA 9.0 with Development Library in Ubuntu 16.04 Docker image from DockerHub registry.
docker pull docker.io/nvidia/cuda:9.0-devel-ubuntu16.04
If the image has already been pulled before, it will not pull again. You can check if the image is already there or not with this command.
Now, you are ready to run the image and create a container.
docker run -ti --rm docker.io/nvidia/cuda:9.0-devel-ubuntu16.04
The prompt will change to something like the following indicating you are inside the container with container ID as 07ac4ae5eea4. The prompt of the container changes to “#” at the end indicating root access. In fact, you are NOT running as root in the container since we have pre-configured Podman to run in rootless mode by default.
While you are in the container, you can run commands like you normally do in a Linux system. In the container, you can check the status of the GPUs with this command.
When you done with this container, you can type exit.
Then, the container will be removed since we have “–rm” in the “docker run” command. However, the image will stay in the node until we remove it with this command.
docker rmi docker.io/nvidia/cuda:9.0-devel-ubuntu16.04
Building Docker Image with GPU Access
There are numerous Docker images one can get from various registries such as NVIDIA GPU Cloud (NGC) or DockerHub. Sometimes, you want to build you own Docker images. Particularly, you want to have access to a GPU in your own containers. Here we demonstrate how we can build Docker images in VCL of TarHeel Linux, CentOS 7 (Full Blade with GPU).
Once you login to VCL of TarHeel Linux, CentOS 7 (Full Blade with GPU), change directory to wherever you like, create a file named “Dockerfile” with the following listing.
FROM docker.io/nvidia/cuda:9.2-devel-ubuntu18.04 # Avoid user interaction with tzdata ARG DEBIAN_FRONTEND=noninteractive # Update OS RUN apt-get -y update && apt-get -y upgrade # Set working directory WORKDIR /usr/local
In this file, we ask to pull a basic Docker image from DockerHub which has CUDA 9.2 with development library in Ubuntu 18.04. Then, we update the OS.
We use either one of the following commands to build the Docker image. The “.” in the first command is to indicate that we are using the default filename for the file as “Dockerfile”. If the file is not named as “Dockerfile”, we can use “-f” option to indicate the filename as in the second command below.
docker build . -t cuda:9.2-devel-ubuntu18.04 docker build -f Dockerfile -t cuda:9.2-devel-ubuntu18.04
It will take just a short while to build the Docker image. Once it is done, type command “docker images” to check its existence.
$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE localhost/cuda 9.2-devel-ubuntu18.04 e7796838f843 1 minutes ago 2.37 GB docker.io/nvidia/cuda 9.2-devel-ubuntu18.04 816085a0101a 8 weeks ago 2.21 GB
The one labelled as “localhost/cuda:9.2-devel-ubuntu18.04” is the one you have just created. The other one listed as “docker.io/nvidia/cuda: 9.2-devel-ubuntu18.04” is what we pull from DockerHub to build the local one. The new Docker image is saved locally.
We can then create a Docker container by running this image using this command.
docker run -ti --rm localhost/cuda:9.2-devel-ubuntu18.04
This Docker container has access to GPU too since Podman is pre-configured to do so. In the container, invoke “nvidia-smi” command to check the status of the GPU. When you are done with this container, type “exit” to finish and you will be back to VCL. In case you would like to list all the dead and/or active containers, invoke “docker ps -a” at the prompt of VCL. That list includes container ID. You can do “docker stop <container_id>” to stop running container, “docker rm <container_id>” to remove dead container.
Running GUI Application in Docker Container
There are a lot of Docker images we can pull from various Docker registries. Each of those Docker images serves at least one application. Here, we demonstrate how we can create a Docker image with an application build-in. When the application has a GUI, we can display the graphics too. Since we would like to display graphics on our screen, we need to allow X11 forwarding from remote VCL machine. To do that, use “ssh -X <ip_address>” to log into the VCL instance. Once we are in the VCL, create a file named “Dockerfile” with the following listing.
FROM fedora RUN yum -y update RUN yum -y install xorg-x11-apps && yum clean all CMD [ "/usr/bin/xclock" ]
Then, we can build a Docker image from this file using the following command. Basically, this Docker image is Fedora-based. We update the OS, and we then install xorg-x11-apps which includes the GUI application “xclock”. The last line in the file is to say running “xclock” if we run the Docker image.
docker build . -t xclockimage
When the build is finished, we can check the existence of the image with the “docker images” command.
$ docker images REPOSITORY TAG IMAGE ID CREATED SIZE localhost/xclockimage latest 868d45fd0eb3 26 minutes ago 558 MB docker.io/library/fedora latest 33c4a622f37c 10 days ago 183 MB
You can see that the one labelled as localhost/xclockimage is the one we have just created. Then, we can use the following command the run “xclock” from the Docker image. The “–net=host” indicates that we are using the VCL network for the Docker container. In the command, we have not specified which executable to run. In this case, it is going to run what CMD indicates in the last line of the “Dockerfile” which is “xclock”.
docker run -ti --rm -e DISPLAY --net=host -v ~/.Xauthority:/root/.Xauthority:Z xclockimage
A graphical clock should appear on your screen.
This demonstrates that one can easily build Docker images with any desired application build-in with or without GUI. Many users choose to utilize the “Dockerfile”. When the user wants to invoke the application, the Docker image will be built from the “Dockerfile”. If building the Docker image takes a long time, you may want to build and push the Docker image to one of the Docker registries, such as DockerHub. When you need to run the application, just pull the image from the Docker registry. Therefore, there is no need to build the Docker image.