Tutorial: Fast Setup on PSC

This is a fast-setup tutorial on PSC, consisting of the most important guidance. 
For more detailed information, please see  PSC Bridges2 User Guide .
﻿
Table of contents
1. Preparation
2. Log in to PSC
How to switch between different login nodes
3. Build Singularity Images from Dockerhub (Recommended)
4. Build Custom Singularity Images on PSC
4.1 Pre-built NGC Images
4.2 Remote Build Custom Singularity Images
5. Setting up Venv
6. Using Resources on PSC
6.1 ROBO GPU Partition
6.2 Short-term Interactive Sessions
6.3 Long-term Batched Jobs
6.4 Watch your jobs
7. Job Arrays
8. Show job history
9. Example Discover Proposal
10. Rerun for 3D/4D Visualizations on Remote/Headless Cluster
Steps to setup Rerun for any cluster or remote computing device:
a) Setup port forwarding
b) Install Rerun
c) Port forward compute node & serve Rerun websocket server on remote device
d) Access Rerun viewer on local machine
e) Log interactive visualizations and much more!
1. PreparationGet an Access account 
﻿ Sign up for Access account ﻿
Ask Allocation Managers to add your Access account
For AirLab: Basti, Wenshan, Yaoyu, Bowen, and Nikhil
For other labs: Please contact your respective PI or lab PoC for this step
Enable DUO two-factor authentication
Receive an e-mail from PSC and register the PSC account
Other helpful resources:
﻿ Additional Other Slides on Intro to PSC  
Please do not fill out the google form on this slide deck
﻿ PSC User Guide ﻿
2. Log in to PSCssh [PSC username]@bridges2.psc.edu
This way needs your PSC password
How to switch between different login nodesThe login commands can't ensure which login node you are on, but once you log in you can switch to different ones:
[bli5@bridges2-login011 ~]$hostname
[bli5@bridges2-login011 ~]br011.ib.bridges2.psc.edu
[bli5@bridges2-login011 ~]$ssh br012.ib.bridges2.psc.edu
:password for psc
[bli5@bridges2-login012 ~]$
Note: Tmux sessions and singularity containers running on one login node (login012) can't be
detected by other login nodes (login011), so you may need to change to different login nodes once logged in.
3. Build Singularity Images from Dockerhub (Recommended)Take a look at this repository template:  https://github.com/castacks/Cloud-Computing-Repository-Template 
﻿
GitHub - castacks/Cloud-Computing-Repository-Template: Boilerplate for running HPC workloads
https://github.com/castacks/Cloud-Computing-Repository-Template
Add a caption...
You can use it by clicking the "use this template" button, submit github issue if you encounter any, and continue reading for advanced use cases
Please request an interactive session before trying the below steps:
Push your docker image to docker hub. For example:
docker push nkeetha/daa:22.12
Set singularity cache and tmp dir to avoid filling up of /jet/home/<user name> when pulling large docker images. For example:
export APPTAINER_CACHEDIR="/ocean/projects/cis220039p/nkeetha/data/singularity"
export APPTAINER_TMPDIR="/ocean/projects/cis220039p/nkeetha/data/singularity/tmp"
Change to path where you want to save singularity image
cd /ocean/projects/cis220039p/nkeetha/data/singularity
Pull docker image as singularity image (.sif or .simg format)
singularity pull <singularity image name>.sif docker://<dockerhub path>
Note: If you are using the AirLab's private docker registry please use the following and login with your Andrew ID and password.
singularity build --docker-login helloworld.sif docker://airlab-storage.andrew.cmu.edu:5001/basti/hello-world:latest
Note: Robo GPU (H100 nodes) needs CUDA 11.8 and above, a good starting docker image is  here . It has ubuntu20.04, cuda 12.3, pytorch 2.1, ros-noetic, open3d 0.18, and so on. You can pull it using:
singularity build noeticcuda12.sif docker://amigoshan/noeticcuda12:latest
4. Build Custom Singularity Images on PSCDocker images (containing CUDA for GPU support built on a local machine) pulled as singularity images can have compatibility issues on PSC GPU instances. To deal with this problem, it is recommended to use the  Nvidia NGC images  to bootstrap your singularity images or use the pre-built ones available under /ocean/containers/ngc. 
4.1 Pre-built NGC ImagesThe /ocean/containers/ngc directory contains several popular pre-built NGC images:
[nkeetha@bridges2-login012 ngc]$ pwd
/ocean/containers/ngc
[nkeetha@bridges2-login012 ngc]$ ls
benchmarks  caffe2  digits            inferenceserver      mxnet    rapidsai    tensorrt        theano
caffe       cntk    fresh_containers  marked-for-deletion  pytorch  tensorflow  tensorrtserver  torch
4.2 Remote Build Custom Singularity ImagesCreate a definition file (*.def) with your desired packages and their installation commands.  Singularity Definition File Docs . Here's an example:
BootStrap: docker
From: nvcr.io/nvidia/jax:23.08-py3
﻿
%post
apt-get update
apt-get install -y ffmpeg
Singularity Remote Build Login.  Create a Sylabs account  and  generate an access token .
[nkeetha@bridges2-login014 singularity]$ singularity remote login
Start Interactive job because compute on the login node is limited
interact -t 4:00:00 -n 4 
Set singularity cache and tmp dir to avoid filling up of /jet/home/<user name> when pulling large docker images. For example:
export SINGULARITY_CACHEDIR="/ocean/projects/cis220039p/nkeetha/data/singularity"
export SINGULARITY_TMPDIR="/ocean/projects/cis220039p/nkeetha/data/singularity/tmp"
Change to path where you want to save singularity image
cd /ocean/projects/cis220039p/nkeetha/data/singularity
Build custom singularity image
singularity build --remote <singularity image name>.sif <path to definition file>.def
Example command:
cd /ocean/projects/cis220039p/nkeetha/data/singularity
singularity build --remote jax_ngc_23_08_tapnet.sif def/jax_ngc_23_08_tapnet.def
5. Setting up VenvThe stable way to use singularity images is to use a virtual environment. Since the singularity images don't come with venv by default. You will have to install the virtual environment python package in your /jet/home directory:
[nkeetha@bridges2-login012 nkeetha]$ interact -p GPU-shared -t 00:30:00 -n 5 --gres=gpu:v100-32:1
[nkeetha@bridges2-login012 nkeetha]$ cd /ocean/containers/ngc/pytorch
[nkeetha@bridges2-login012 nkeetha]$ singularity instance start --nv pytorch_22.12-py3.sif venv
[nkeetha@bridges2-login012 nkeetha]$ singularity run --nv instance://venv
Singularity> pip install virtualenv --user
Singularity> cd <directory for venv, e.g /ocean/projects/cis220039p/nkeetha/data/singularity/venv>
Singularity> virtualenv <venv name> --system-site-packages
Singularity> source <venv name>/bin/activate
Singularity> pip install ...
Once the venv is setup, it can be always be sourced within a singularity container for running your code. You don't have to set it up everytime.
﻿
For example scripts that use the above setup:
Sbatch Script: /ocean/projects/cis220039p/shared/examples/train.sbatch
Job Script: /ocean/projects/cis220039p/shared/examples/train.job
Bash Script: /ocean/projects/cis220039p/shared/examples/train.sh 
6. Using Resources on PSCPSC by default offers 3 kinds of allocations specified  here , you can choose based on your need by "-p" and "--gres" arguments in the command line.
6.1 ROBO GPU PartitionIn some groups, users will be automatically grant with ROBO GPU access. If you cannot access the ROBO GPU, please talk to your PIs or email  help@psc.edu .  
You can request the H100 resources using "-p ROBO --gres=gpu:h100:1"
﻿
6.2 Short-term Interactive SessionsShort-term interactive sessions have a maximum time limit of 8 hours , but are fast to acquire.
﻿
Start a GPU interactive session:
interact -p GPU-shared -t 4:00:00 -n 2 --gres=gpu:v100-32:1 #this asks 1 V100-32GB GPU for 4 hours, with 2 CPU.
﻿
# Below is the command for the ROBO Cluster
# Please don't use for debugging or non-polished code under dev (This is expensive!)
# Please replace the XXXXX.X.XXXXXXX with the Oracle String from Basti or Admins.
srun --partition=ROBO --mem=64G -t 1:00:00 --mincpus=8 --gres=gpu:h100:1 --job-name=YOUR_JOB_NAME --pty /bin/bash #this asks 1 H100 GPU for 1 hour, with 8 CPU.
-p: Resource type (suggested: GPU-shared)
-t:  Time limit for job (hh:mm:ss)
-n: Num of CPUs (strongly suggested: 2*GPU num)
--gres: GPU type (v100-32/v100-16) : GPU num
﻿
Start Singulariy instance from image:
singularity instance start --nv /path/to/your/singularity/image/[singularity
image name].sif [instance name]
--nv command is used for GPU support similar to nvidia-docker.
﻿
Attach to running Singularity instance:
singularity run --nv instance://[instance name]
Now you can use the shell to run any command/code within the singularity instance.
﻿
Exiting singularity instance: (use exit and stop instance)
Singularity> exit
[nketha@v016 ~] singularity instance stop [instance name]
﻿
Common workflow: Detach from a session without killing processes
local terminal -> ssh to login node of PSC → start tmux session → GPU allocation → start singularity → attach to singularity  → script/running experiments → detach tmux (Ctrl+b, then d).
The GPU allocation will end based on the time config used.
Note the login node you started your tmux from, you will need to log back to the correct one in order to re-attach to your scrip process. Once there, use tmux attach -t <number>  to re-attach.
﻿
6.3 Long-term Batched JobsLong-term batched jobs have a maximum time limit 48 hours (longer jobs are possible), but are more complex and need to wait for longer time.
﻿
Write a job script <job name>.job:
#!/bin/bash
#echo commands to stdout
set -x
# create sigularity container
source /etc/profile.d/modules.sh
SIF=/path/to/your/singularity/image/[singularity image name].sif
S_EXEC="singularity exec -B /ocean:/ocean --nv ${SIF}"
# implement the job in the container

YOUR_SCRIPT=/path/to/your/script


${S_EXEC} 'bash' ${YOUR_SCRIPT}

#END
﻿
Start the job:
# This is asking for 4*V100(16GB) with 10 CPUs to run a job for 48 hours
sbatch -p GPU-shared -n 10 --gpus=v100-16:4 -t 48:00:00 <job name>.job -o </path/to/output_files/output_name>.out
﻿
# Below is the command for the ROBO Cluster
# Please don't use for debugging or non-polished code under dev (This is expensive!)
# Please replace the XXXXX.X.XXXXXXX with the Oracle String from Basti or Admins.
# This is asking for 1*H100 with 10 CPUs to run a job for 48 hours
sbatch -p ROBO --comment="XXXXX.X.XXXXXXX" -n 10 --gres=h100:1 -t 48:00:00 <job name>.job -o </path/to/output_files/output_name>.out
-p: Resource type (suggest GPU-shared)
-t: using time hhss
-n: CPU nums (strongly suggest: 5*GPU num)
--gpus= GPU type (v100-32/v100-16) : GPU num
-o: Output file (record the terminal output) direction
You can also specify the arguments in the .job file by "#SBATCH" (instead of through command line), for example, the following .job:
#!/bin/bash
﻿
#SBATCH -N 1 # Number of nodes
#SBATCH -n 1
#SBATCH -p ROBO
#SBATCH --gpus=h100:1 #GPU specification. H100
#SBATCH -t 2-00:00 # Estimated time, 48hour max. DD-HH:MM.
#SBATCH --job-name test
#SBATCH -o job_%j.out
#SBATCH -e job_%j.err
#SBATCH --mail-type=END
#SBATCH --mail-user=xxx@andrew.cmu.edu
﻿
# echo commands to stdout
set -x
﻿
EXE=/bin/bash
﻿
WORKING_DIR=$PROJECT/SLURM/test_robo
﻿
cd $WORKING_DIR
﻿
singularity exec \
        --nv xxx.sif \
        $EXE \
        $WORKING_DIR/xxx.sh
﻿
Note: If you use conda virtual environment in the singularity container, you need to add the following to YOUR_SCRIPT on the top:
source /opt/conda/etc/profile.d/conda.sh
conda activate your_env
When you build the docker image, it is strongly recommended that conda is installed in "/opt/conda".
6.4 Watch your jobsLogin to OnDemand:  https://ondemand.bridges2.psc.edu/pun/sys/dashboard ﻿
﻿
In jobs/active jobs you'll see both your interactive sessions and batch jobs (may be queuing).
﻿
You can also login to the GPU computing nodes on psc login nodes using:
[bli5@bridges2-login011 ~]$ssh [GPU hostname]@bridges2.psc.edu
Use Wandb
Tensorboard is for boomers. Use Wandb:  https://wandb.ai 
Placeholder:  https://medium.com/mlearning-ai/remote-tensorboard-viewing-on-your-local-browser-b0dc5c5a634a 
7. Job ArraysJob arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily; job arrays with millions of tasks can be submitted in milliseconds (subject to configured size limits). All jobs must have the same initial sbatch options (e.g. size, time limit, etc.)
﻿
This is a powerful tool for running large number of processes with various config files or changed parameters. The below example can be further extended with for loops in the bash scripts and other recursive elements.
﻿
An example way of setting up and running job arrays:
﻿
[nkeetha@bridges2-login013 sbatch]$ sbatch extract_frames.sbatch
﻿
extract_frames.sbatch
#!/bin/bash
﻿
#SBATCH -p GPU-shared
#SBATCH -t 1:00:00
#SBATCH -n 5
#SBATCH -J extract_frames
#SBATCH --gpus=v100-16:1
#SBATCH --output=/ocean/projects/cis220039p/nkeetha/sbatch/outputs/extract_frames/%A_%a.out
#SBATCH --array=1-12 # job array index
 
'bash' /ocean/projects/cis220039p/nkeetha/jobs/extract_frames.job ${SLURM_ARRAY_TASK_ID}
﻿
extract_frames.job
#!/bin/bash
﻿
A_ID=$1
﻿
#echo commands to stdout
set -x
﻿
# create sigularity container
source /etc/profile.d/modules.sh
SIF=/ocean/projects/cis220039p/nkeetha/data/singularity/nvidia_frames_base.sif
S_EXEC="singularity exec -B /ocean:/ocean --nv ${SIF}"
﻿
# implement the job in the container
SCRIPT=/ocean/projects/cis220039p/nkeetha/scripts/extract_frames.sh
${S_EXEC} 'bash' ${SCRIPT} ${A_ID}
﻿
# END
﻿
extract_frames.sh
#!/bin/bash
﻿
A_ID=$1
﻿
cd /ocean/projects/cis220039p/nkeetha/daa/nvidia_to_frames
﻿
python3 run.py configs/nea_heli/config_${A_ID}.yaml
8. Show job historyUse the following command
sacct -S 2023-01-01 \
--format="jobid,jobname%-12,partition,Start,ElapsedRaw,AllocTRES%-70"
To show your job history. More details of the sacct command can be found from the [SLURM documentation]( https://slurm.schedmd.com/slurm.conf.html ). 
9. Example Discover Proposal﻿ If you would like to setup a new partition for a new investigator or group please follow the PSC onboarding document located here. It also contains a sample ACCESS DISCOVER proposal at the end of the document. ﻿
10. Rerun for 3D/4D Visualizations on Remote/Headless ClusterDownloading files and visualizing on the local machine is for boomers. Same case for Open3D.
Instead use Rerun! Super easy multi-modal 4D visualizations which are snappy to interactive with and also easy to share with others!
Try the interactive demos here:  https://rerun.io/viewer ﻿
About Rerun:  https://rerun.io/docs/getting-started/what-is-rerun ﻿
﻿
Add a caption...
﻿
Add a caption...
Steps to setup Rerun for any cluster or remote computing device:a) Setup port forwardingTo enable port forwarding for rerun, you can either ssh using the command line or  configure VSCode to always forward a port .
Command line example:
ssh -L <rr-port>:localhost:<rr-port> \
    -L <ws-server-port>:localhost:<ws-server-post> \ 
    -L <rr-viewer-port>:localhost:<rr-viewer-port> \
    <user-name>@bridges2.psc.edu
VSCode config example:
Host bridges2.psc.edu
    HostName bridges2.psc.edu
    User <user-name> # For example, nkeetha
    LocalForward localhost:<rr-port> localhost:<rr-port>
    LocalForward localhost:<ws-server-port> localhost:<ws-server-port>
    LocalForward localhost:<rr-viewer-port> localhost:<rr-viewer-port>
Things contained in <> are placeholders, please replace them with your own values.
b) Install RerunInstall Rerun on your local machine using the following instructions:
pip install rerun-sdk via pip
conda install -c conda-forge rerun-sdk via Conda
Likewise, install Rerun in your virtual environment or docker image or singularity image used on the remote cluster.
c) Port forward compute node & serve Rerun websocket server on remote deviceFirstly, request compute for an interactive debugging session:
salloc -p RM-small -t 06:00:00 -n 5 # Asks for a CPU job
sleep 365d # Keeps the interactive job from timing out
Then, in a new terminal on the head/login node, SSH into the compute node which has been allocated to you. 
The compute node can be known by looking at your above request interactive session (for example, r002.ib.bridges2.psc.edu or v001.ib.bridges2.psc.edu).
ssh -L <rr-port>:localhost:<rr-port> \
    -L <ws-server-port>:localhost:<ws-server-post> \
    -L <rr-viewer-port>:localhost:<rr-viewer-port> \
    <node-name>.ib.bridges2.psc.edu
Now, in the SSH session, launch your code environment and run tmux to create a terminal for serving the rerun server.
For example:
cd path/to/singularity
singularity instance start --nv example.sif example
singularity run --nv instance://example
cd path/to/venvs
source venv_name/bin/activate
tmux new-sess -s debug
rerun --serve --port <rr-port> --ws-server-port <ws-server-port> --web-viewer-port <rr-viewer-port>
Ctrl b + c # To create another terminal in tmux for running other code
d) Access Rerun viewer on local machineNow, the Rerun Viewer should be accessible on your local machine by running the following command:
rerun ws://localhost:<ws-server-port>
Another option for viewing the viewer is to use your local browser:
﻿ http://localhost:<rr-viewer-port>?url=ws://localhost:<ws-server-port> ﻿
﻿
You should see that the connection has been established with the remote server:
Connection to ws://localhost:<ws-server-port> established
e) Log interactive visualizations and much more!Now you are all set to use Rerun! Start by trying out the demo code at the end of this section.
To learn more about Rerun try out the simple tutorials here:  https://rerun.io/docs/getting-started/data-in/python ﻿
You should now have a nice tool for rich visualizations!
﻿
Add a caption...
﻿
Add a caption...
python3 demo.py --addr "0.0.0.0:<rr-port>"
from __future__ import annotations
﻿
import argparse
from math import tau
﻿
import numpy as np
import rerun as rr  # pip install rerun-sdk
from rerun.utilities import bounce_lerp, build_color_spiral
﻿
DESCRIPTION = """
# DNA
This is a minimal example that logs synthetic 3D data in the shape of a double helix. The underlying data is generated
using numpy and visualized using Rerun.
﻿
The full source code for this example is available
[on GitHub](https://github.com/rerun-io/rerun/blob/latest/examples/python/dna).
""".strip()
﻿
﻿
def str2bool(v):
    return bool(strtobool(v))
﻿
﻿
def script_add_rerun_args(parser: ArgumentParser) -> None:
    """
    Add common Rerun script arguments to `parser`.
﻿
    Parameters
    ----------
    parser : ArgumentParser
        The parser to add arguments to.
﻿
    """
    parser.add_argument('--headless', type=str2bool, nargs='?', const=True, default=True, help="Don't show GUI")
    parser.add_argument(
        "--connect",
        dest="connect",
        type=str2bool, nargs='?', const=True, default=True,
        help="Connect to an external viewer",
    )
    parser.add_argument(
        "--serve",
        dest="serve",
        type=str2bool, nargs='?', const=True, default=False,
        help="Serve a web viewer (WARNING: experimental feature)",
    )
    parser.add_argument("--addr", type=str, default="0.0.0.0:<rr-port>", help="Connect to this ip:port")
    parser.add_argument("--save", type=str, default=None, help="Save data to a .rrd file at this path")
    parser.add_argument(
        "-o",
        "--stdout",
        dest="stdout",
        action="store_true",
        help="Log data to standard output, to be piped into a Rerun Viewer",
    )
﻿
﻿
﻿
def log_data() -> None:
    rr.log("description", rr.TextDocument(DESCRIPTION, media_type=rr.MediaType.MARKDOWN), static=True)
﻿
    rr.set_time_seconds("stable_time", 0)
﻿
    NUM_POINTS = 100
﻿
    # points and colors are both np.array((NUM_POINTS, 3))
    points1, colors1 = build_color_spiral(NUM_POINTS)
    points2, colors2 = build_color_spiral(NUM_POINTS, angular_offset=tau * 0.5)
    rr.log("helix/structure/left", rr.Points3D(points1, colors=colors1, radii=0.08))
    rr.log("helix/structure/right", rr.Points3D(points2, colors=colors2, radii=0.08))
﻿
    rr.log("helix/structure/scaffolding", rr.LineStrips3D(np.stack((points1, points2), axis=1), colors=[128, 128, 128]))
﻿
    time_offsets = np.random.rand(NUM_POINTS)
    for i in range(400):
        time = i * 0.01
        rr.set_time_seconds("stable_time", time)
﻿
        times = np.repeat(time, NUM_POINTS) + time_offsets
        beads = [bounce_lerp(points1[n], points2[n], times[n]) for n in range(NUM_POINTS)]
        colors = [[int(bounce_lerp(80, 230, times[n] * 2))] for n in range(NUM_POINTS)]
        rr.log(
            "helix/structure/scaffolding/beads", rr.Points3D(beads, radii=0.06, colors=np.repeat(colors, 3, axis=-1))
        )
﻿
        rr.log(
            "helix/structure",
            rr.Transform3D(rotation=rr.RotationAxisAngle(axis=[0, 0, 1], radians=time / 4.0 * tau)),
        )
﻿
﻿
def main() -> None:
    parser = argparse.ArgumentParser(description="Logs rich data using the Rerun SDK.")
    script_add_rerun_args(parser) # Options: --headless, --connect, --serve, --addr, --save, --stdout
    args = parser.parse_args()
﻿
    rr.script_setup(args, "rerun_example_dna_abacus")
    log_data()
    rr.script_teardown(args)
﻿
﻿
if __name__ == "__main__":
    main()