Batch Job Submission with Volcano
In Spring 2021, we are migrating from native Kubernetes batch job scheduling to Volcano job scheduling. Volcano provides a bunch of job scheduling capability to Kubernetes and this will give a more efficient utilization of the DGX cluster. Using Volcano in DGX cluster is easy. Follow the steps below to learn about how to use Volcano to submit batch job to the DGX cluster.
First of all, to find out the queues configured in Volcano, use the following command.
kubectl get queue
From the list, we can tell we have several queues we can use in Volcano. To get detailed information of a particular queue, we can use the following command for example. Replace <queue> with the real queue name.
kubectl describe queue <queue>
In the output of the above command, we would like to pay close attention to the capability of the queue, Cpu, Memory, nvidia.com/gpu, etc. Also, we may want to know if the queue is reclaimable. If the queue is reclaimable, that means if the resource is needed from other non-reclaimable queue while running a job, that resource will be erased causing the job to fail. Not all the queues are usable for everyone. You should be told which queue(s) you have access to.
To submit batch job to the Volcano scheduling, one would create a YAML file just like in native Kubernetes. However, the composition of the YAML is a little different. Here is an overview of a typical YAML file for submitting job to Volcano.
apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: job_name spec: minAvailable:1 schedulerName: volcano queue: queue_name policies: - event: PodEvicted action: RestartJob tasks: - replicas: 1 name: task_name policies: - event: TaskCompleted action: CompleteJob backoffLimit: 5 activeDeadlineSeconds: time_limit template: metadata: name: volcano-job labels: environment: research spec: restartPolicy: OnFailure imagePullSecrets: - name: your_secret volumes: - name: proj hostPath: path: /proj type: Directory containers: - name: container_name image: docker_name resources: requests: cpu: 8 memory: 4Gi nvidia.com/gpu: 1 limits: cpu: 8 memory: 4Gi nvidia.com/gpu: 1 volumeMounts: - name: proj mountPath: /proj readOnly: false command: - "/bin/bash" - "-c" - > cd work_directory && command
In the above listing, it lists all the major components to configure the job in the YAML file. The job_name should be a unique name among all the jobs in Kubernetes. If there is a job either running or failed with the same job name. You will get an error in submitting the job. The queue_name is the queue you intend to use. In Volcano, you submit a task and you wan to give your task a name as task_name. If you are accessing private Docker registry to get your Docker image, you will have to put the credential into a Kubernetes secret named your_secret. Then, give the container a name as container_name and designate Docker image to use as docker_name. Request resources in terms of CPU, Memory, and GPU accordingly. Finally, run application as “cd work_directory && command“. In the above YAML file, we also designate mounting /proj in the container which is desirable because we have input files in /proj and we will save output file in /proj too.
We save the YAML in a file named <yaml_file>, and we use the following command to submit the job to Volcano.
kubectl create -f <yaml_file>
Once the job is submitted, we can check the status of the job with this command. In fact, this will list the status of all your submitted jobs.
kubelist
If you want to list all jobs by all users, here is the command.
kubectl get vcjob
When the job is done, it is important to remove the job from the job list for clean-up purpose with the following command.
kubectl delete vcjob <job_name>
If you run the kubelist command again, that job should not be on the list anymore.