Tutorial: Fast Setup on PSC

Tutorial: Fast Setup on PSC

This is a fast-setup tutorial on PSC, consisting of the most important guidance.
For more detailed information, please see  PSC Bridges2 User Guide .

1. Preparation

  • Get an Access account
  •  Sign up for Access account 
  • Ask Allocation Managers to add your Access account
  • For AirLab: Basti, Wenshan, Yaoyu, Bowen, and Nikhil
  • For other labs: Please contact your respective PI or lab PoC for this step
  • Enable DUO two-factor authentication
  • Receive an e-mail from PSC and register the PSC account
  • Other helpful resources:
  •  Additional Other Slides on Intro to PSC 
  • Please do not fill out the google form on this slide deck
  •  PSC User Guide 

2. Log in to PSC

ssh [PSC username]@bridges2.psc.edu
This way needs your PSC password

How to switch between different login nodes

The login commands can't ensure which login node you are on, but once you log in you can switch to different ones:
[bli5@bridges2-login011 ~]$hostname
[bli5@bridges2-login011 ~]br011.ib.bridges2.psc.edu
[bli5@bridges2-login011 ~]$ssh br012.ib.bridges2.psc.edu
:password for psc
[bli5@bridges2-login012 ~]$
Note: Tmux sessions and singularity containers running on one login node (login012) can't be
detected by other login nodes (login011), so you may need to change to different login nodes once logged in.

3. Build Singularity Images from Dockerhub (Recommended)

You can use it by clicking the "use this template" button, submit github issue if you encounter any, and continue reading for advanced use cases
Please request an interactive session before trying the below steps:
    Push your docker image to docker hub. For example:
docker push nkeetha/daa:22.12
    Set singularity cache and tmp dir to avoid filling up of /jet/home/<user name> when pulling large docker images. For example:
export APPTAINER_CACHEDIR="/ocean/projects/cis220039p/nkeetha/data/singularity"
export APPTAINER_TMPDIR="/ocean/projects/cis220039p/nkeetha/data/singularity/tmp"
    Change to path where you want to save singularity image
cd /ocean/projects/cis220039p/nkeetha/data/singularity
    Pull docker image as singularity image (.sif or .simg format)
singularity pull <singularity image name>.sif docker://<dockerhub path>
Note: If you are using the AirLab's private docker registry please use the following and login with your Andrew ID and password.
singularity build --docker-login helloworld.sif docker://airlab-storage.andrew.cmu.edu:5001/basti/hello-world:latest
Note: Robo GPU (H100 nodes) needs CUDA 11.8 and above, a good starting docker image is  here . It has ubuntu20.04, cuda 12.3, pytorch 2.1, ros-noetic, open3d 0.18, and so on. You can pull it using:
singularity build noeticcuda12.sif docker://amigoshan/noeticcuda12:latest

4. Build Custom Singularity Images on PSC

Docker images (containing CUDA for GPU support built on a local machine) pulled as singularity images can have compatibility issues on PSC GPU instances. To deal with this problem, it is recommended to use the  Nvidia NGC images  to bootstrap your singularity images or use the pre-built ones available under /ocean/containers/ngc.

4.1 Pre-built NGC Images

The /ocean/containers/ngc directory contains several popular pre-built NGC images:
[nkeetha@bridges2-login012 ngc]$ pwd
/ocean/containers/ngc
[nkeetha@bridges2-login012 ngc]$ ls
benchmarks caffe2 digits inferenceserver mxnet rapidsai tensorrt theano
caffe cntk fresh_containers marked-for-deletion pytorch tensorflow tensorrtserver torch

4.2 Remote Build Custom Singularity Images

    Create a definition file (*.def) with your desired packages and their installation commands.  Singularity Definition File Docs . Here's an example:
BootStrap: docker
From: nvcr.io/nvidia/jax:23.08-py3

%post
apt-get update
apt-get install -y ffmpeg
    Singularity Remote Build Login.  Create a Sylabs account  and  generate an access token .
[nkeetha@bridges2-login014 singularity]$ singularity remote login
    Start Interactive job because compute on the login node is limited
interact -t 4:00:00 -n 4
    Set singularity cache and tmp dir to avoid filling up of /jet/home/<user name> when pulling large docker images. For example:
export SINGULARITY_CACHEDIR="/ocean/projects/cis220039p/nkeetha/data/singularity"
export SINGULARITY_TMPDIR="/ocean/projects/cis220039p/nkeetha/data/singularity/tmp"
    Change to path where you want to save singularity image
cd /ocean/projects/cis220039p/nkeetha/data/singularity
    Build custom singularity image
singularity build --remote <singularity image name>.sif <path to definition file>.def
Example command:
cd /ocean/projects/cis220039p/nkeetha/data/singularity
singularity build --remote jax_ngc_23_08_tapnet.sif def/jax_ngc_23_08_tapnet.def

5. Setting up Venv

The stable way to use singularity images is to use a virtual environment. Since the singularity images don't come with venv by default. You will have to install the virtual environment python package in your /jet/home directory:
[nkeetha@bridges2-login012 nkeetha]$ interact -p GPU-shared -t 00:30:00 -n 5 --gres=gpu:v100-32:1
[nkeetha@bridges2-login012 nkeetha]$ cd /ocean/containers/ngc/pytorch
[nkeetha@bridges2-login012 nkeetha]$ singularity instance start --nv pytorch_22.12-py3.sif venv
[nkeetha@bridges2-login012 nkeetha]$ singularity run --nv instance://venv
Singularity> pip install virtualenv --user
Singularity> cd <directory for venv, e.g /ocean/projects/cis220039p/nkeetha/data/singularity/venv>
Singularity> virtualenv <venv name> --system-site-packages
Singularity> source <venv name>/bin/activate
Singularity> pip install ...
Once the venv is setup, it can be always be sourced within a singularity container for running your code. You don't have to set it up everytime.

For example scripts that use the above setup:
  • Sbatch Script: /ocean/projects/cis220039p/shared/examples/train.sbatch
  • Job Script: /ocean/projects/cis220039p/shared/examples/train.job
  • Bash Script: /ocean/projects/cis220039p/shared/examples/train.sh

6. Using Resources on PSC

PSC by default offers 3 kinds of allocations specified  here , you can choose based on your need by "-p" and "--gres" arguments in the command line.

6.1 ROBO GPU Partition

In some groups, users will be automatically grant with ROBO GPU access. If you cannot access the ROBO GPU, please talk to your PIs or email  help@psc.edu .
You can request the H100 resources using "-p ROBO --gres=gpu:h100:1"


6.2 Short-term Interactive Sessions

Short-term interactive sessions have a maximum time limit of 8 hours , but are fast to acquire.

Start a GPU interactive session:
interact -p GPU-shared -t 4:00:00 -n 2 --gres=gpu:v100-32:1 #this asks 1 V100-32GB GPU for 4 hours, with 2 CPU.

# Below is the command for the ROBO Cluster
# Please don't use for debugging or non-polished code under dev (This is expensive!)
# Please replace the XXXXX.X.XXXXXXX with the Oracle String from Basti or Admins.
srun --partition=ROBO --mem=64G -t 1:00:00 --mincpus=8 --gres=gpu:h100:1 --job-name=YOUR_JOB_NAME --pty /bin/bash #this asks 1 H100 GPU for 1 hour, with 8 CPU.
-p: Resource type (suggested: GPU-shared)
-t: Time limit for job (hh:mm:ss)
-n: Num of CPUs (strongly suggested: 2*GPU num)
--gres: GPU type (v100-32/v100-16) : GPU num

Start Singulariy instance from image:
singularity instance start --nv /path/to/your/singularity/image/[singularity image name].sif [instance name]
--nv command is used for GPU support similar to nvidia-docker.

Attach to running Singularity instance:
singularity run --nv instance://[instance name]
Now you can use the shell to run any command/code within the singularity instance.

Exiting singularity instance: (use exit and stop instance)
Singularity> exit
[nketha@v016 ~] singularity instance stop [instance name]

Common workflow: Detach from a session without killing processes
  • local terminal -> ssh to login node of PSC → start tmux session → GPU allocation → start singularity → attach to singularity → script/running experiments → detach tmux (Ctrl+b, then d).
  • The GPU allocation will end based on the time config used.
  • Note the login node you started your tmux from, you will need to log back to the correct one in order to re-attach to your scrip process. Once there, use tmux attach -t <number> to re-attach.


6.3 Long-term Batched Jobs

Long-term batched jobs have a maximum time limit 48 hours (longer jobs are possible), but are more complex and need to wait for longer time.

Write a job script <job name>.job:
#!/bin/bash
#echo commands to stdout
set -x
# create sigularity container
source /etc/profile.d/modules.sh
SIF=/path/to/your/singularity/image/[singularity image name].sif
S_EXEC="singularity exec -B /ocean:/ocean --nv ${SIF}"
# implement the job in the container
YOUR_SCRIPT=/path/to/your/script
${S_EXEC} 'bash' ${YOUR_SCRIPT}
#END

Start the job:
# This is asking for 4*V100(16GB) with 10 CPUs to run a job for 48 hours
sbatch -p GPU-shared -n 10 --gpus=v100-16:4 -t 48:00:00 <job name>.job -o </path/to/output_files/output_name>.out

# Below is the command for the ROBO Cluster
# Please don't use for debugging or non-polished code under dev (This is expensive!)
# Please replace the XXXXX.X.XXXXXXX with the Oracle String from Basti or Admins.
# This is asking for 1*H100 with 10 CPUs to run a job for 48 hours
sbatch -p ROBO --comment="XXXXX.X.XXXXXXX" -n 10 --gres=h100:1 -t 48:00:00 <job name>.job -o </path/to/output_files/output_name>.out
-p: Resource type (suggest GPU-shared)
-t: using time hhss
-n: CPU nums (strongly suggest: 5*GPU num)
--gpus= GPU type (v100-32/v100-16) : GPU num
-o: Output file (record the terminal output) direction
You can also specify the arguments in the .job file by "#SBATCH" (instead of through command line), for example, the following .job:
#!/bin/bash

#SBATCH -N 1 # Number of nodes
#SBATCH -n 1
#SBATCH -p ROBO
#SBATCH --gpus=h100:1 #GPU specification. H100
#SBATCH -t 2-00:00 # Estimated time, 48hour max. DD-HH:MM.
#SBATCH --job-name test
#SBATCH -o job_%j.out
#SBATCH -e job_%j.err
#SBATCH --mail-type=END
#SBATCH --mail-user=xxx@andrew.cmu.edu

# echo commands to stdout
set -x

EXE=/bin/bash

WORKING_DIR=$PROJECT/SLURM/test_robo

cd $WORKING_DIR

singularity exec \
--nv xxx.sif \
$EXE \
$WORKING_DIR/xxx.sh

Note: If you use conda virtual environment in the singularity container, you need to add the following to YOUR_SCRIPT on the top:
source /opt/conda/etc/profile.d/conda.sh
conda activate your_env
When you build the docker image, it is strongly recommended that conda is installed in "/opt/conda".

6.4 Watch your jobs


In jobs/active jobs you'll see both your interactive sessions and batch jobs (may be queuing).

You can also login to the GPU computing nodes on psc login nodes using:
[bli5@bridges2-login011 ~]$ssh [GPU hostname]@bridges2.psc.edu
Use Wandb
Tensorboard is for boomers. Use Wandb:  https://wandb.ai 

7. Job Arrays

Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily; job arrays with millions of tasks can be submitted in milliseconds (subject to configured size limits). All jobs must have the same initial sbatch options (e.g. size, time limit, etc.)

This is a powerful tool for running large number of processes with various config files or changed parameters. The below example can be further extended with for loops in the bash scripts and other recursive elements.

An example way of setting up and running job arrays:

[nkeetha@bridges2-login013 sbatch]$ sbatch extract_frames.sbatch

extract_frames.sbatch
#!/bin/bash

#SBATCH -p GPU-shared
#SBATCH -t 1:00:00
#SBATCH -n 5
#SBATCH -J extract_frames
#SBATCH --gpus=v100-16:1
#SBATCH --output=/ocean/projects/cis220039p/nkeetha/sbatch/outputs/extract_frames/%A_%a.out
#SBATCH --array=1-12 # job array index
'bash' /ocean/projects/cis220039p/nkeetha/jobs/extract_frames.job ${SLURM_ARRAY_TASK_ID}

extract_frames.job
#!/bin/bash

A_ID=$1

#echo commands to stdout
set -x

# create sigularity container
source /etc/profile.d/modules.sh
SIF=/ocean/projects/cis220039p/nkeetha/data/singularity/nvidia_frames_base.sif
S_EXEC="singularity exec -B /ocean:/ocean --nv ${SIF}"

# implement the job in the container
SCRIPT=/ocean/projects/cis220039p/nkeetha/scripts/extract_frames.sh
${S_EXEC} 'bash' ${SCRIPT} ${A_ID}

# END

extract_frames.sh
#!/bin/bash

A_ID=$1

cd /ocean/projects/cis220039p/nkeetha/daa/nvidia_to_frames

python3 run.py configs/nea_heli/config_${A_ID}.yaml

8. Show job history

Use the following command
sacct -S 2023-01-01 \
--format="jobid,jobname%-12,partition,Start,ElapsedRaw,AllocTRES%-70"
To show your job history. More details of the sacct command can be found from the [SLURM documentation]( https://slurm.schedmd.com/slurm.conf.html ).

9. Example Discover Proposal

10. Rerun for 3D/4D Visualizations on Remote/Headless Cluster

Downloading files and visualizing on the local machine is for boomers. Same case for Open3D.
Instead use Rerun! Super easy multi-modal 4D visualizations which are snappy to interactive with and also easy to share with others!
Try the interactive demos here:  https://rerun.io/viewer 

Steps to setup Rerun for any cluster or remote computing device:

a) Setup port forwarding

To enable port forwarding for rerun, you can either ssh using the command line or  configure VSCode to always forward a port .
Command line example:
ssh -L <rr-port>:localhost:<rr-port> \
-L <ws-server-port>:localhost:<ws-server-post> \
-L <rr-viewer-port>:localhost:<rr-viewer-port> \
<user-name>@bridges2.psc.edu
VSCode config example:
Host bridges2.psc.edu
HostName bridges2.psc.edu
User <user-name> # For example, nkeetha
LocalForward localhost:<rr-port> localhost:<rr-port>
LocalForward localhost:<ws-server-port> localhost:<ws-server-port>
LocalForward localhost:<rr-viewer-port> localhost:<rr-viewer-port>
Things contained in <> are placeholders, please replace them with your own values.

b) Install Rerun

Install Rerun on your local machine using the following instructions:
  • pip install rerun-sdk via pip
  • conda install -c conda-forge rerun-sdk via Conda
Likewise, install Rerun in your virtual environment or docker image or singularity image used on the remote cluster.

c) Port forward compute node & serve Rerun websocket server on remote device

Firstly, request compute for an interactive debugging session:
salloc -p RM-small -t 06:00:00 -n 5 # Asks for a CPU job
sleep 365d # Keeps the interactive job from timing out
Then, in a new terminal on the head/login node, SSH into the compute node which has been allocated to you.
The compute node can be known by looking at your above request interactive session (for example, r002.ib.bridges2.psc.edu or v001.ib.bridges2.psc.edu).
ssh -L <rr-port>:localhost:<rr-port> \
-L <ws-server-port>:localhost:<ws-server-post> \
-L <rr-viewer-port>:localhost:<rr-viewer-port> \
<node-name>.ib.bridges2.psc.edu
Now, in the SSH session, launch your code environment and run tmux to create a terminal for serving the rerun server.
For example:
cd path/to/singularity
singularity instance start --nv example.sif example
singularity run --nv instance://example
cd path/to/venvs
source venv_name/bin/activate
tmux new-sess -s debug
rerun --serve --port <rr-port> --ws-server-port <ws-server-port> --web-viewer-port <rr-viewer-port>
Ctrl b + c # To create another terminal in tmux for running other code

d) Access Rerun viewer on local machine

Now, the Rerun Viewer should be accessible on your local machine by running the following command:
rerun ws://localhost:<ws-server-port>
Another option for viewing the viewer is to use your local browser:

You should see that the connection has been established with the remote server:
Connection to ws://localhost:<ws-server-port> established

e) Log interactive visualizations and much more!

Now you are all set to use Rerun! Start by trying out the demo code at the end of this section.
To learn more about Rerun try out the simple tutorials here:  https://rerun.io/docs/getting-started/data-in/python 
You should now have a nice tool for rich visualizations!
python3 demo.py --addr "0.0.0.0:<rr-port>"
from __future__ import annotations

import argparse
from math import tau

import numpy as np
import rerun as rr # pip install rerun-sdk
from rerun.utilities import bounce_lerp, build_color_spiral

DESCRIPTION = """
# DNA
This is a minimal example that logs synthetic 3D data in the shape of a double helix. The underlying data is generated
using numpy and visualized using Rerun.

The full source code for this example is available
[on GitHub](https://github.com/rerun-io/rerun/blob/latest/examples/python/dna).
""".strip()


def str2bool(v):
return bool(strtobool(v))


def script_add_rerun_args(parser: ArgumentParser) -> None:
"""
Add common Rerun script arguments to `parser`.

Parameters
----------
parser : ArgumentParser
The parser to add arguments to.

"""
parser.add_argument('--headless', type=str2bool, nargs='?', const=True, default=True, help="Don't show GUI")
parser.add_argument(
"--connect",
dest="connect",
type=str2bool, nargs='?', const=True, default=True,
help="Connect to an external viewer",
)
parser.add_argument(
"--serve",
dest="serve",
type=str2bool, nargs='?', const=True, default=False,
help="Serve a web viewer (WARNING: experimental feature)",
)
parser.add_argument("--addr", type=str, default="0.0.0.0:<rr-port>", help="Connect to this ip:port")
parser.add_argument("--save", type=str, default=None, help="Save data to a .rrd file at this path")
parser.add_argument(
"-o",
"--stdout",
dest="stdout",
action="store_true",
help="Log data to standard output, to be piped into a Rerun Viewer",
)



def log_data() -> None:
rr.log("description", rr.TextDocument(DESCRIPTION, media_type=rr.MediaType.MARKDOWN), static=True)

rr.set_time_seconds("stable_time", 0)

NUM_POINTS = 100

# points and colors are both np.array((NUM_POINTS, 3))
points1, colors1 = build_color_spiral(NUM_POINTS)
points2, colors2 = build_color_spiral(NUM_POINTS, angular_offset=tau * 0.5)
rr.log("helix/structure/left", rr.Points3D(points1, colors=colors1, radii=0.08))
rr.log("helix/structure/right", rr.Points3D(points2, colors=colors2, radii=0.08))

rr.log("helix/structure/scaffolding", rr.LineStrips3D(np.stack((points1, points2), axis=1), colors=[128, 128, 128]))

time_offsets = np.random.rand(NUM_POINTS)
for i in range(400):
time = i * 0.01
rr.set_time_seconds("stable_time", time)

times = np.repeat(time, NUM_POINTS) + time_offsets
beads = [bounce_lerp(points1[n], points2[n], times[n]) for n in range(NUM_POINTS)]
colors = [[int(bounce_lerp(80, 230, times[n] * 2))] for n in range(NUM_POINTS)]
rr.log(
"helix/structure/scaffolding/beads", rr.Points3D(beads, radii=0.06, colors=np.repeat(colors, 3, axis=-1))
)

rr.log(
"helix/structure",
rr.Transform3D(rotation=rr.RotationAxisAngle(axis=[0, 0, 1], radians=time / 4.0 * tau)),
)


def main() -> None:
parser = argparse.ArgumentParser(description="Logs rich data using the Rerun SDK.")
script_add_rerun_args(parser) # Options: --headless, --connect, --serve, --addr, --save, --stdout
args = parser.parse_args()

rr.script_setup(args, "rerun_example_dna_abacus")
log_data()
rr.script_teardown(args)


if __name__ == "__main__":
main()