Jupyter Notebooks to HPC
Erick Verleye gives a live coding demo utilizing Jupyter notebooks
Date: 01/03/2023
Speaker:
Erick Verleye, Earth Lab
Abstract:
Jupyter notebooks are great development and collaboration tools, however they are often not suitable for long-running, computationally expensive workflows. In this talk I will show how to create executable python scripts, runtime environments, and HPC job submissions for your Jupyter notebooks – allowing you to access incredibly powerful, and free, compute resources from your personal computer.
Bio:
Erick has a background in physics and programming, having completed a degree in physics from Michigan State University in 2019. Since joining Earth Lab in August 2022, Erick worked on NASA’s IXPE mission at Marshal Space Flight Center as a software engineer. Erick loves programming, and teaching people about it when he can.
Juptyer Notebooks to HPC
This post will explain how to get code from a Jupyter Notebook running on CU Boulder’s High Performance Computing environment. Although some steps will be specific to CU’s HPC, there is still value in this post for researchers working on other universities’ HPC, as most research computing environments are very similar. All files / code used in this demonstration can be found in the GitHub repo here: https://github.com/Ckster/jupyter_to_hpc_demo
Jupyter Notebooks are powerful development and collaborations tools, but in environments where setting up a Jupyter server can be anywhere from challenging to impossible, an alternative is needed. Thus, this tutorial will show how to convert your Notebook code to a lightweight and portable Python executable file perfect for deployment on the HPC.
Figure 1: Python executable script and Conda environment requirements file are created locally from the Jupyter Notebook. Files are transferred to the HPC using SCP and code is run using a SLURM batch file.
Prerequisites
·In order to run code on the HPC, you should be familiar with basic Linux commands and know how to navigate the file system.
·If you do not have an account on the HPC and are affiliated with CSU or CU Boulder, you can request an account at https://rcamp.rc.colorado.edu/accounts/account-request/create/organization
Connecting to the HPC
Before interacting with the HPC and its hardware, you will first need to use ssh to connect to a login node. From your local machine, using either terminal (MacOS / Linux) or Powershell (Windows) type the command:
ssh <your_identikey>@login.rc.colorado.edu
Provisioning Project Files
Because the home directory doesn’t have much disk space allocated, new projects should be provisioned in your user’s ‘projects’ directory. ‘cd’ into the directory at /projects/<your_identikey> and then make a new directory that is the name of your project. This demo will deploy a project called vnir, so the path of the project directory will be /projects/erve3705/vnir.
After making a directory for your project, make the following directories within
· data_in – for storing any input data file
· data_out – for storing the results from the code runs
· logs – for storing error and log files for each run
and create a blank file called sbatch.txt which will be filled in later with SLURM directives for running a job on the HPC. At this point the project directory should look like this:
Creating a Conda Environment
Next, we need to configure a conda environment for the code to run in. In order to do this, we need to be able to access the Anaconda installation on the HPC. Software installations on the HPC are managed as “modules” and in order to access these modules we must first load them. To load a module, first connect to a compile node by executing, from a login node, the command:
ssh scompile
Once connected, you can load anaconda using:
module load anaconda
You should now be in the base conda environment, indicated by the (base) next to your login name.
Before creating a new environment, due to lack of disk space we must first configure conda to write packages installed to new environments in a place other than the home directory. To do this, open the .condarc file in your home directory with a text editor (like vim) using:
vim ~/.condarc
and copy the following lines into the file:
pkgs_dirs:
- /projects/$USER/.conda_pkgs
envs_dirs:
- /projects/$USER/software/anaconda/envs
If you are using vim, press esc, type :wq, and press enter to write to and quit editing the file. Now new software installed by conda will be written to your projects directory.
Next, create a new python conda environment by running the following command with the name of your environment and python version:
conda create -n <name_of_env> python==<desired_python_version>
Answer yes to any prompts. Once the environment has been successfully created, it is time to start writing the executable file that will be run on the HPC.
Preparing Files for the HPC
Setting Up a Local Project
First create a new project in a scripting environment where you will write the executable file. Popular environments are PyCharm, VSCode, Spyder, and IDLE. This demo will be using PyCharm. Move the original Jupyter Notebook and any data files that will be used by the code into the project. Create a new, blank Python file which will be the executable.
Here the input_files folder contains the input data for the Notebook, the .ipynb file is the original notebook, and the .py file is the newly created, blank Python file.
Writing the Executable File
Start with the following template for your executable, the code for this can be copied from the template.py file in the linked GitHub repo, and make sure to edit the PROJECT_DIR variable to define your own project’s directory:
The DATA_IN and DATA_OUT constants are defined as the absolute paths to the directories that were created earlier in the HPC projects directory. The DATA_OUT variable contains one extra nested directory which is the SLURM_JOB_ID environment variable. This ensures the output data will have a uniquely and adequately named directory to be written into for each run.
Because PROJECT_DIR is defined as an absolute path, we will not have to worry about which working directory the script is run out of. In general, any i/o should not care about which working directory the script is running in, and the working directory should not be changed within a running executable script.
The first line of the main function creates this output file directory since there will be a new directory for each run.
The hook, if __name__ == ‘__main__’, will only be true if this file is the entry point for execution. In other words, code indented underneath this block will not be executed if any code from this file is imported into another file. Because this file will be executed from the command line, if __name__ == ‘__main__’ will evaluate to True.
Start filling out the executable file by copying over the import statements from the Notebook
Now begin to copy the functions, classes, and code that needs to be run to the executable script. Make sure paths to input data now utilize the DATA_IN variable, like so:
Because the training-data folder will be placed directly into data_in, the ‘DC’ from Notebook path can be dropped.
Also make sure any print statements now include a flush=True argument so that the statements are flushed to stdout and written to the .out log file that SLURM will create when the job runs.
Look out for any plots that were being viewed within the notebook – these will now have to be saved as images in order to view them after the job has run. Save them to the DATA_OUT directory like so:
Make sure any other file paths used for writing data now include the DATA_OUT directory variable as well. In the example below, DATA_OUT is added to the save_dir path on line 358:
Dependencies
Create a new file called requirements.txt and write each required dependency’s name as it is registered in PyPI, since we will be using pip to install them:
Using pip and a requirements.txt file is usually the simplest and most convenient way to install dependencies. You can specify version numbers for packages by writing, for example:
numpy==4.0.0
matplotlib>=2.0.0
If you would like to use conda to install your packages, feel free to use another method for configuring your conda environment on the HPC.
Transferring Files to the HPC
Once all the code is transferred to the executable and all dependencies have been included in the requirements.txt file, it is time to transfer files from the local machine to the HPC. The “scp” program, which stands for secure copy, is a simple way to transfer files from one machine to another. The executable script, requirements.txt file, and the input data directory must all be transferred to the projects directory created earlier. Directories can be transferred with scp by adding the -r flag to the command:
scp -r input_files/ erve3705@login.rc.colorado.edu:/projects/erve3705/vnir
scp requirements.txt erve3705@login.rc.colorado.edu:/projects/erve3705/vnir
scp vnir_resnet_DC_planet.py erve3705@login.rc.colorado.edu:/projects/erve3705/vnir
Running the Code on HPC
Installing Dependencies in Conda Environment
Now that the files have been transferred, ssh back into the HPC and connect to a compile node:
ssh scompile
load anaconda:
module load anaconda
and activate the environment created earlier:
conda activate vnirDemo
Cd into the project directory containing the requirements.txt file transferred earlier and install the dependencies with the command:
pip install -r requirments.txt
Pip should begin installing all the listed dependencies. This process could take a long time but keep an eye out for any install failures. You can verify that the packages have been installed successfully by trying to import each of them in a Python session.
Submitting a Job
NOTE: Do NOT run code on the login or compile nodes directly. There are only a small number of resources reserved for each. You will be contacted by the HPC administrators if you run computationally expensive code on either of these nodes.
Once the runtime environment has been successfully configured, make sure that any input files have been moved to the project’s data_in directory
Now that all the necessary files are in place and the runtime environment has been created, it’s time to submit the job to SLURM. In order to do this, edit the sbatch.txt file created earlier and use the below example to fill out the sbatch.txt for your own project. CURC documentation on sbatch files can be found here https://curc.readthedocs.io/en/latest/running-jobs/batch-jobs.html :
If your code stands to gain from using a lot of compute resources, don’t be scared to ask for them. The worst that will happen is that SLURM will estimate that your job won’t be run for a long time, in which case the job can be cancelled and resubmitted asking for less resources.
Information about the Alpine partitions, including partition names and hardware specifications useful for submitting jobs can be found here: https://curc.readthedocs.io/en/stable/clusters/alpine/alpine-hardware.html .
If your job will be running on GPUs, high memory nodes, or for a long period of time ( > 24h ) please see the previous link for adding special SLURM directives to the sbatch.txt file.
Assuming that an Alpine partition was specified, the Alpine module must first be loaded before the sbatch.txt file is submitted. From a login node (if on a compile node, use the exit command to get back to a login node) load the alpine module with the command:
module load slurm/alpine
You can now submit your job by executing the command:
sbatch sbatch.txt
Job Management
Job status can be viewed with the command:
squeue -u <identikey>
If SLURM is saying your job will take an unacceptable amount of time to be processed, try asking for less resources in the sbatch.txt file.
Jobs can be canceled with the command:
scancel <job_id>
After a Job Ends
If the job ends with an error, look in the .error log file that was named in the sbatch.txt’s ‘—error’ directive file for the stack trace. Anything written to stdout will be stored in the corresponding .out file.
Output data will be written to the subdirectory named with the job ID within the data_out directory. After a successful run, scp can be used to transfer the data from the HPC to your local machine or another server.
Conclusion
The HPC is a powerful resource and learning how to utilize it adds a useful tool to your coding repertoire. If questions remain after reading this tutorial, please feel free to reach out to me over email at erve3705@colorado.edu, or read the CURC HPC documentation at https://curc.readthedocs.io/en/latest/index.html.