Skip to main content

In Spring 2021, we are migrating from native Kubernetes batch job scheduling to Volcano job scheduling.  Volcano provides a bunch of job scheduling capability to Kubernetes and this will give a more efficient utilization of the DGX cluster.  Using Volcano in DGX cluster is easy.  Follow the steps below to learn about how to use Volcano to submit batch job to the DGX cluster.

First of all, to find out the queues configured in Volcano, use the following command.

kubectl get queue

From the list, we can tell we have several queues we can use in Volcano.  To get detailed information of a particular queue, we can use the following command for example.  Replace <queue> with the real queue name.

kubectl describe queue <queue>

In the output of the above command, we would like to pay close attention to the capability of the queue, Cpu, Memory, nvidia.com/gpu, etc.  Also, we may want to know if the queue is reclaimable.  If the queue is reclaimable, that means if the resource is needed from other non-reclaimable queue while running a job, that resource will be erased causing the job to fail.  Not all the queues are usable for everyone.  You should be told which queue(s) you have access to.

To submit batch job to the Volcano scheduling, one would create a YAML file just like in native Kubernetes.  However, the composition of the YAML is a little different.  Here is an overview of a typical YAML file for submitting job to Volcano.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job_name
spec:
  minAvailable:1
  schedulerName: volcano
  queue: queue_name
  policies:
  - event: PodEvicted
    action: RestartJob
  tasks:
  - replicas: 1
    name: task_name
    policies:
    - event: TaskCompleted
      action: CompleteJob
    backoffLimit: 5
    activeDeadlineSeconds: time_limit
    template: 
      metadata:
        name: volcano-job
        labels:
          environment: research
      spec:
        restartPolicy: OnFailure
        imagePullSecrets:
        - name: your_secret
        volumes:
        - name: proj
          hostPath:
            path: /proj
            type: Directory
        containers:
        - name: container_name
          image: docker_name
          resources:
            requests:
              cpu: 8
              memory: 4Gi
              nvidia.com/gpu: 1
            limits:
              cpu: 8
              memory: 4Gi
              nvidia.com/gpu: 1
          volumeMounts:
          - name: proj
            mountPath: /proj
            readOnly: false
          command:
            - "/bin/bash"
            - "-c"
            - > 
              cd work_directory && command

In the above listing, it lists all the major components to configure the job in the YAML file.  The job_name should be a unique name among all the jobs in Kubernetes.  If there is a job either running or failed with the same job name.  You will get an error in submitting the job.  The queue_name is the queue you intend to use.  In Volcano, you submit a task and you wan to give your task a name as task_name.  If you are accessing private Docker registry to get your Docker image, you will have to put the credential into a Kubernetes secret named your_secret.   Then, give the container a name as container_name and designate Docker image to use as docker_name.  Request resources in terms of CPU, Memory, and GPU accordingly.  Finally, run application as “cd work_directory && command“.  In the above YAML file, we also designate mounting /proj in the container which is desirable because we have input files in /proj and we will save output file in /proj too.

We save the YAML in a file named <yaml_file>, and we use the following command to submit the job to Volcano.

kubectl create -f <yaml_file>

Once the job is submitted, we can check the status of the job with this command.  In fact, this will list the status of all your submitted jobs.

kubelist

If you want to list all jobs by all users, here is the command.

kubectl get vcjob

When the job is done, it is important to remove the job from the job list for clean-up purpose with the following command.

kubectl delete vcjob <job_name>

If you run the kubelist command again, that job should not be on the list anymore.