Authors

Topics

Tags

EDS Seminar Series
Python

Jupyter Notebooks to HPC

Erick Verleye gives a live coding demo utilizing Jupyter notebooks

Date: 01/03/2023
Speaker:
Erick Verleye, Earth Lab
 

Abstract:
Jupyter notebooks are great development and collaboration tools, however they are often not suitable for long-running, computationally expensive workflows. In this talk I will show how to create executable python scripts, runtime environments, and HPC job submissions for your Jupyter notebooks – allowing you to access incredibly powerful, and free, compute resources from your personal computer.
 

Bio:
Erick has a background in physics and programming, having completed a degree in physics from Michigan State University in 2019. Since joining Earth Lab in August 2022, Erick worked on NASA’s IXPE mission at Marshal Space Flight Center as a software engineer. Erick loves programming, and teaching people about it when he can.

Juptyer Notebooks to HPC

This post will explain how to get code from a Jupyter Notebook running on CU Boulder’s High Performance Computing environment. Although some steps will be specific to CU’s HPC, there is still value in this post for researchers working on other universities’ HPC, as most research computing environments are very similar. All files / code used in this demonstration can be found in the GitHub repo here: https://github.com/Ckster/jupyter_to_hpc_demo 
Jupyter Notebooks are powerful development and collaborations tools, but in environments where setting up a Jupyter server can be anywhere from challenging to impossible, an alternative is needed. Thus, this tutorial will show how to convert your Notebook code to a lightweight and portable Python executable file perfect for deployment on the HPC. 

Figure 1: Python executable script and Conda environment requirements file are created locally from the Jupyter Notebook. Files are transferred to the HPC using SCP and code is run using a SLURM batch file.

Prerequisites
·In order to run code on the HPC, you should be familiar with basic Linux commands and know how to navigate the file system.
·If you do not have an account on the HPC and are affiliated with CSU or CU Boulder, you can request an account at https://rcamp.rc.colorado.edu/accounts/account-request/create/organization

Connecting to the HPC
Before interacting with the HPC and its hardware, you will first need to use ssh to connect to a login node. From your local machine, using either terminal (MacOS / Linux) or Powershell (Windows) type the command:
ssh <your_identikey>@login.rc.colorado.edu

After entering your password and accepting the Duo push notification, you should now have control of a Linux command line logged into your CURC account and be in your home directory.

Provisioning Project Files
Because the home directory doesn’t have much disk space allocated, new projects should be provisioned in your user’s ‘projects’ directory. ‘cd’ into the directory at /projects/<your_identikey> and then make a new directory that is the name of your project. This demo will deploy a project called vnir, so the path of the project directory will be /projects/erve3705/vnir.

After making a directory for your project, make the following directories within
·  data_in – for storing any input data file
·  data_out – for storing the results from the code runs
·  logs – for storing error and log files for each run


and create a blank file called sbatch.txt which will be filled in later with SLURM directives for running a job on the HPC. At this point the project directory should look like this:

Creating a Conda Environment
Next, we need to configure a conda environment for the code to run in. In order to do this, we need to be able to access the Anaconda installation on the HPC. Software installations on the HPC are managed as “modules” and in order to access these modules we must first load them. To load a module, first connect to a compile node by executing, from a login node, the command:
ssh scompile

Once connected, you can load anaconda using:
module load anaconda

You should now be in the base conda environment, indicated by the (base) next to your login name.

Before creating a new environment, due to lack of disk space we must first configure conda to write packages installed to new environments in a place other than the home directory. To do this, open the .condarc file in your home directory with a text editor (like vim) using:


vim ~/.condarc


and copy the following lines into the file:
pkgs_dirs:


- /projects/$USER/.conda_pkgs


envs_dirs:


 - /projects/$USER/software/anaconda/envs

If you are using vim, press esc, type :wq, and press enter to write to and quit editing the file. Now new software installed by conda will be written to your projects directory.

Next, create a new python conda environment by running the following command with the name of your environment and python version:


conda create -n <name_of_env> python==<desired_python_version>

Answer yes to any prompts. Once the environment has been successfully created, it is time to start writing the executable file that will be run on the HPC.

Preparing Files for the HPC


Setting Up a Local Project


First create a new project in a scripting environment where you will write the executable file. Popular environments are PyCharm, VSCode, Spyder, and IDLE. This demo will be using PyCharm. Move the original Jupyter Notebook and any data files that will be used by the code into the project. Create a new, blank Python file which will be the executable.

Here the input_files folder contains the input data for the Notebook, the .ipynb file is the original notebook, and the .py file is the newly created, blank Python file.

Writing the Executable File


Start with the following template for your executable, the code for this can be copied from the template.py file in the linked GitHub repo, and make sure to edit the PROJECT_DIR variable to define your own project’s directory:

The DATA_IN and DATA_OUT constants are defined as the absolute paths to the directories that were created earlier in the HPC projects directory. The DATA_OUT variable contains one extra nested directory which is the SLURM_JOB_ID environment variable. This ensures the output data will have a uniquely and adequately named directory to be written into for each run.

Because PROJECT_DIR is defined as an absolute path, we will not have to worry about which working directory the script is run out of. In general, any i/o should not care about which working directory the script is running in, and the working directory should not be changed within a running executable script.

The first line of the main function creates this output file directory since there will be a new directory for each run.

The hook, if __name__ == ‘__main__’, will only be true if this file is the entry point for execution. In other words, code indented underneath this block will not be executed if any code from this file is imported into another file. Because this file will be executed from the command line, if __name__ == ‘__main__’ will evaluate to True.

Start filling out the executable file by copying over the import statements from the Notebook

.