Note: Robo GPU (H100 nodes) needs CUDA 11.8 and above, a good starting docker image is here. It has ubuntu20.04, cuda 12.3, pytorch 2.1, ros-noetic, open3d 0.18, and so on. You can pull it using:
Docker images (containing CUDA for GPU support built on a local machine) pulled as singularity images can have compatibility issues on PSC GPU instances. To deal with this problem, it is recommended to use the Nvidia NGC images to bootstrap your singularity images or use the pre-built ones available under /ocean/containers/ngc.
4.1 Pre-built NGC Images
The /ocean/containers/ngc directory contains several popular pre-built NGC images:
The stable way to use singularity images is to use a virtual environment. Since the singularity images don't come with venv by default. You will have to install the virtual environment python package in your /jet/home directory:
PSC by default offers 3 kinds of allocations specified here, you can choose based on your need by "-p" and "--gres" arguments in the command line.
6.1 ROBO GPU Partition
In some groups, users will be automatically grant with ROBO GPU access. If you cannot access the ROBO GPU, please talk to your PIs or email help@psc.edu.
You can request the H100 resources using "-p ROBO --gres=gpu:h100:1"
6.2 Short-term Interactive Sessions
Short-term interactive sessions have a maximum time limit of 8 hours , but are fast to acquire.
Start a GPU interactive session:
interact -p GPU-shared -t4:00:00 -n2--gres=gpu:v100-32:1 #this asks 1 V100-32GB GPU for 4 hours, with 2 CPU.
# Below is the command for the ROBO Cluster
# Please don't use for debugging or non-polished code under dev (This is expensive!)
# Please replace the XXXXX.X.XXXXXXX with the Oracle String from Basti or Admins.
srun --partition=ROBO --mem=64G -t1:00:00 --mincpus=8--gres=gpu:h100:1 --job-name=YOUR_JOB_NAME --pty /bin/bash #this asks 1 H100 GPU for 1 hour, with 8 CPU.
Common workflow: Detach from a session without killing processes
local terminal -> ssh to login node of PSC → start tmux session → GPU allocation → start singularity → attach to singularity → script/running experiments → detach tmux (Ctrl+b, then d).
The GPU allocation will end based on the time config used.
Note the login node you started your tmux from, you will need to log back to the correct one in order to re-attach to your scrip process. Once there, use tmux attach -t <number> to re-attach.
6.3 Long-term Batched Jobs
Long-term batched jobs have a maximum time limit 48 hours (longer jobs are possible), but are more complex and need to wait for longer time.
Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily; job arrays with millions of tasks can be submitted in milliseconds (subject to configured size limits). All jobs must have the same initial sbatch options (e.g. size, time limit, etc.)
This is a powerful tool for running large number of processes with various config files or changed parameters. The below example can be further extended with for loops in the bash scripts and other recursive elements.
An example way of setting up and running job arrays:
Things contained in <> are placeholders, please replace them with your own values.
b) Install Rerun
Install Rerun on your local machine using the following instructions:
pip install rerun-sdk via pip
conda install -c conda-forge rerun-sdk via Conda
Likewise, install Rerun in your virtual environment or docker image or singularity image used on the remote cluster.
c) Port forward compute node & serve Rerun websocket server on remote device
Firstly, request compute for an interactive debugging session:
salloc -p RM-small -t 06:00:00 -n5# Asks for a CPU job
sleep 365d # Keeps the interactive job from timing out
Then, in a new terminal on the head/login node, SSH into the compute node which has been allocated to you.
The compute node can be known by looking at your above request interactive session (for example, r002.ib.bridges2.psc.edu or v001.ib.bridges2.psc.edu).
ssh-L<rr-port>:localhost:<rr-port>\
-L<ws-server-port>:localhost:<ws-server-post>\
-L<rr-viewer-port>:localhost:<rr-viewer-port>\
<node-name>.ib.bridges2.psc.edu
Now, in the SSH session, launch your code environment and run tmux to create a terminal for serving the rerun server.
For example:
cd path/to/singularity
singularity instance start --nv example.sif example