Skip to main content

AlphaFold2 from DeepMind has been released as an open source application.  At UNC Research Computing Center, we are able to run AlphaFold2 in our machines to provide protein 3D structure from a chain of amino acids.  Following the steps below, we will be able to invoke AlphaFold2 in Longleaf cluster.

First of all, login to Longleaf login node.

ssh longleaf.its.unc.edu

Then, create a directory for your project keeping input and output files.  For example, we have this directory in scratch space in /pine to study the S protein structure of COVID-19 virus.

mkdir -p /pine/scr/c/d/cdpoon/covid

Once we have the project working directory in place, we can download the fasta sequence of the S protein.

cd /pine/scr/c/d/cdpoon/covid
wget https://www.rcsb.org/fasta/entry/7DDD -O rcsb_pdb_7DDD.fasta

We now have a file named rcsb_pdb_7DDD.fasta in the working directory.

We have packaged AlphaFold2 in a Singularity image.  To run the AlphaFold2 in Singularity image in Longleaf, we are going to submit the job to Slurm and let Slurm allocate resources and dispatch the job.  We provide an example Slurm job submission script for running AlphaFold2.  We can copy the example Slurm script over and customize it for this job.

cd /pine/scr/c/d/cdpoon/covid
cp /nas/longleaf/apps/alphafold/2.0/examples/run.slurm .

Use your favorite editor to edit the run.slurm file.  The most common text editor in Linux is vi.  Emacs is also very popular.

vi run.slurm

There are a few things we would like to change in the run.slurm file.

  1. The job-name can be changed to whatever you like, for example, AF2-7DDD.
  2. The WORKING_DIR should be edited to your project directory, for example, /pine/scr/c/d/cdpoon/covid.
  3. The FASTA_FILE should be changed to the name of the fasta file, for example, rcsb_pdb_7DDD.fasta.
  4. Since the S protein structure is already in PDB database, we set MAX_TEMPLATE_DATE earlier than the S protein release date to avoid AlphaFold2 from picking the experimental electron microscopy structure.  In other words, we would like AlphaFold2 to predict the structure instead of getting the experimental one.  The experimental 7DDD structure was published by the end of 2020.  So, we set MAX_TEMPLATE_DATE to be 2020-01-01.

The job is configured to use 1 GPU, 8 CPU cores, and 30GB of memory.  For CPU, AlphaFolds has hard-coded to use 8 cores.  For GPU, we will be using NVIDIA Ampere A100 GPU in beta-gpu partition.  This beta-gpu partition is under testing.  Once testing is finished, we will be moving these nodes to production.  Once we do that, the name of the partition to use will be different.

For the moment, we keep the AlphaFold2 databases in /datacommons/alphafold/db_20210827 directory.  This may change in the future.  If we download the databases later, we will save in a new directory.  Make sure that you point to the one you want to use.

The resulting file, run.slurm, will look like something like the following.

#!/bin/bash

# AlphaFold 2.0 Slurm Job Submission Script Example for Ampere A100
# In Longleaf, submit job with "sabtch run.slurm" command.

#SBATCH --job-name=AF2_7DDD
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=30G
#SBATCH --time=2:00:00
#SBATCH --partition=beta-gpu
#SBATCH --output=log.%x.%j
#SBATCH --gres=gpu:1
#SBATCH --qos=gpu_access

# View hostname and state of GPUs assigned
hostname
nvidia-smi

# Set working directory and filename of fasta file
# Save output in working directory
WORKING_DIR=/pine/scr/c/d/cdpoon/covid
FASTA_FILE=rcsb_pdb_7DDD.fasta

# Set job parameters, PRESET can be full_dbs, reduced_dbs, or casp14
MODEL_NAME=model_1
PRESET=full_dbs
MAX_TEMPLATE_DATE=2020-01-01

# Set environment variables for AlphaFold
export PYTHONNOUSERSITE=True
ALPHAFOLD_DATA_PATH=/datacommons/alphafold/db_20210827

# Set Singularity path and image name for AlphaFold
SINGULARITY_BASE=/nas/longleaf/apps/alphafold/2.0/sif
SINGULARITY_IMAGE=$SINGULARITY_BASE/alphafold2.0-cuda11.0-ubuntu20.04.sif

# Run Singularity image
singularity run --nv \
    -B $ALPHAFOLD_DATA_PATH:/data \
    -B $WORKING_DIR:/file \
    -B .:/etc \
    --pwd /app/alphafold $SINGULARITY_IMAGE \
    --fasta_paths=/file/$FASTA_FILE \
    --data_dir=/data \
    --output_dir=/file \
    --model_names=$MODEL_NAME \
    --preset=$PRESET \
    --max_template_date=$MAX_TEMPLATE_DATE \
    --uniref90_database_path=/data/uniref90/uniref90.fasta \
    --mgnify_database_path=/data/mgnify/mgy_clusters.fa \
    --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --pdb70_database_path=/data/pdb70/pdb70 \
    --template_mmcif_dir=/data/pdb_mmcif/mmcif_files \
    --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat

# Clean up
rm $WORKING_DIR/localtime
rm $WORKING_DIR/ld.so.cache

To submit the job to Longleaf beta-gpu partition, we run the following command.

sbatch run.slurm

The job will take less than an hour to finish.  Once it is finished, a new output directory will appear named rcsb_pdb_7DDD keeping all the output files.  To visualize the output PDB files, we can use Pymol in OpenOnDemand for example.  There is also a file named log.AF2_7DDD.<job_id>.  The job_id is set to the Slurm job ID.  It is a good idea to see how the job is behaved in terms of memory usage and CPU usage with the following command.

seff <job_id>

Check the output of seff command, adjust memory and time required for the job as needed next time.  The idea is not to use too excessive memory for the job.