AlphaFold2
AlphaFold2 from DeepMind has been released as an open source application. At UNC Research Computing Center, we are able to run AlphaFold2 in our machines to provide protein 3D structure from a chain of amino acids. Following the steps below, we will be able to invoke AlphaFold2 in Longleaf cluster.
First of all, login to Longleaf login node.
ssh longleaf.its.unc.edu
Then, create a directory for your project keeping input and output files. For example, we have this directory in scratch space in /pine to study the S protein structure of COVID-19 virus.
mkdir -p /pine/scr/c/d/cdpoon/covid
Once we have the project working directory in place, we can download the fasta sequence of the S protein.
cd /pine/scr/c/d/cdpoon/covid wget https://www.rcsb.org/fasta/entry/7DDD -O rcsb_pdb_7DDD.fasta
We now have a file named rcsb_pdb_7DDD.fasta in the working directory.
We have packaged AlphaFold2 in a Singularity image. To run the AlphaFold2 in Singularity image in Longleaf, we are going to submit the job to Slurm and let Slurm allocate resources and dispatch the job. We provide an example Slurm job submission script for running AlphaFold2. We can copy the example Slurm script over and customize it for this job.
cd /pine/scr/c/d/cdpoon/covid cp /nas/longleaf/apps/alphafold/2.0/examples/run.slurm .
Use your favorite editor to edit the run.slurm file. The most common text editor in Linux is vi. Emacs is also very popular.
vi run.slurm
There are a few things we would like to change in the run.slurm file.
- The job-name can be changed to whatever you like, for example, AF2-7DDD.
- The WORKING_DIR should be edited to your project directory, for example, /pine/scr/c/d/cdpoon/covid.
- The FASTA_FILE should be changed to the name of the fasta file, for example, rcsb_pdb_7DDD.fasta.
- Since the S protein structure is already in PDB database, we set MAX_TEMPLATE_DATE earlier than the S protein release date to avoid AlphaFold2 from picking the experimental electron microscopy structure. In other words, we would like AlphaFold2 to predict the structure instead of getting the experimental one. The experimental 7DDD structure was published by the end of 2020. So, we set MAX_TEMPLATE_DATE to be 2020-01-01.
The job is configured to use 1 GPU, 8 CPU cores, and 30GB of memory. For CPU, AlphaFolds has hard-coded to use 8 cores. For GPU, we will be using NVIDIA Ampere A100 GPU in beta-gpu partition. This beta-gpu partition is under testing. Once testing is finished, we will be moving these nodes to production. Once we do that, the name of the partition to use will be different.
For the moment, we keep the AlphaFold2 databases in /datacommons/alphafold/db_20210827 directory. This may change in the future. If we download the databases later, we will save in a new directory. Make sure that you point to the one you want to use.
The resulting file, run.slurm, will look like something like the following.
#!/bin/bash # AlphaFold 2.0 Slurm Job Submission Script Example for Ampere A100 # In Longleaf, submit job with "sabtch run.slurm" command. #SBATCH --job-name=AF2_7DDD #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=30G #SBATCH --time=2:00:00 #SBATCH --partition=beta-gpu #SBATCH --output=log.%x.%j #SBATCH --gres=gpu:1 #SBATCH --qos=gpu_access # View hostname and state of GPUs assigned hostname nvidia-smi # Set working directory and filename of fasta file # Save output in working directory WORKING_DIR=/pine/scr/c/d/cdpoon/covid FASTA_FILE=rcsb_pdb_7DDD.fasta # Set job parameters, PRESET can be full_dbs, reduced_dbs, or casp14 MODEL_NAME=model_1 PRESET=full_dbs MAX_TEMPLATE_DATE=2020-01-01 # Set environment variables for AlphaFold export PYTHONNOUSERSITE=True ALPHAFOLD_DATA_PATH=/datacommons/alphafold/db_20210827 # Set Singularity path and image name for AlphaFold SINGULARITY_BASE=/nas/longleaf/apps/alphafold/2.0/sif SINGULARITY_IMAGE=$SINGULARITY_BASE/alphafold2.0-cuda11.0-ubuntu20.04.sif # Run Singularity image singularity run --nv \ -B $ALPHAFOLD_DATA_PATH:/data \ -B $WORKING_DIR:/file \ -B .:/etc \ --pwd /app/alphafold $SINGULARITY_IMAGE \ --fasta_paths=/file/$FASTA_FILE \ --data_dir=/data \ --output_dir=/file \ --model_names=$MODEL_NAME \ --preset=$PRESET \ --max_template_date=$MAX_TEMPLATE_DATE \ --uniref90_database_path=/data/uniref90/uniref90.fasta \ --mgnify_database_path=/data/mgnify/mgy_clusters.fa \ --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \ --pdb70_database_path=/data/pdb70/pdb70 \ --template_mmcif_dir=/data/pdb_mmcif/mmcif_files \ --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat # Clean up rm $WORKING_DIR/localtime rm $WORKING_DIR/ld.so.cache
To submit the job to Longleaf beta-gpu partition, we run the following command.
sbatch run.slurm
The job will take less than an hour to finish. Once it is finished, a new output directory will appear named rcsb_pdb_7DDD keeping all the output files. To visualize the output PDB files, we can use Pymol in OpenOnDemand for example. There is also a file named log.AF2_7DDD.<job_id>. The job_id is set to the Slurm job ID. It is a good idea to see how the job is behaved in terms of memory usage and CPU usage with the following command.
seff <job_id>
Check the output of seff command, adjust memory and time required for the job as needed next time. The idea is not to use too excessive memory for the job.