Healthtech cluster
Getting an account
Contact Gabriel Renaud and we will contact Peter Wad Sackett.
First time login
- Do you have a DTU account? If not contact Gabriel.
- Have you changed DTU password after August 2021? If not, change it at password.dtu.dk
- If you can log in to DTU webmail https://mail.dtu.dk, your credentials should be in order.
- If you can not login in to DTU webmail, you should set up the MFA (2-factor authentication). We refer to DTU general instructions on how to do that.
- At this stage try to login using your DTU username and password:
ssh -XC <username>@login.healthtech.dtu.dk Example: ssh -XC gabre@login.healthtech.dtu.dk
- You will be prompted for the password (you can not see the chars you type), and the MFA code.
- If you can not login at this stage, using your DTU username, password and MFA, then you are probably not enrolled into Microsoft Azure. Enroll here. Use your DTU email as username. You will be taken to the DTU login procedure for verification of your identity.
Efter the enrollment, try to login again. - Logout again using Ctrl D or typing "logout".
Subsequent logins
- Login using:
ssh -XC <username>@login.healthtech.dtu.dk
- Enter your DTU password and the MFA password when prompted
- You should see:
<username>@login ~$
This is the login node, do not run anything there.
- Then select a node from 01 to 14:
ssh -XC <username>@nodeX
where X is 01, 02, 03, ..., 14
- if this is your first time logging in, read the:
/home/README.txt
very carefully WHEN on a node. It will tell you where the different directories are.
- Check if the Node is busy using:
htop
You will see the CPUs on top and the memory.
Nodes
NodeID | CPUs | GHz (7z) | MIPS (7z) | Memory (kb) |
---|---|---|---|---|
node01 | 24 | 2.67GHz | 3197 | 64,304 |
node02 | 16 | 2.40GHz | 2898 | 64,402 |
node03 | 24 | 3.07GHz | 3713 | 96,656 |
node04 | 24 | 2.93GHz | 3551 | 193,424 |
node05 | 24 | 3.33GHz | 3773 | 193,424 |
node06 | 24 | 2.40GHz | 3017 | 64,400 |
node07 | 64 | 2.70GHz | 3463 | 515,942 |
node08 | 20 | 2.30GHz | 3456 | 128,796 |
node09 | 16 | 2.53GHz | 2991 | 120,853 |
node10 | 16 | 2.53GHz | 2985 | 120,853 |
node11 | 40 | 2.60GHz | 3872 | 257,847 |
node12 | 16 | 2.80GHz | 3458 | 24,081 |
node13 | 16 | 2.53GHz | 3036 | 24,081 |
node14 | 16 | 2.43GHz | 3002 | 128,913 |
compute01 | 128 | 3.53GHz | 3671 | 1,056,315 |
compute02 | 192 | 2.62GHz | 2917 | 1,031,526 |
compute03 | 128 | 3.20GHz | 2449 | 515,453 |
Running things in parallel versus interactive
You can login to node07, node10, node12 and compute01 directly. The rest are non-interactive meaning you need to use slurm (see below). On the iterative nodes, you can use parallel. First, write a file with the commands, let's call it "list_cmds". We need full paths, no relative paths.
Then pick an interative server to run the commands and you write:
cat list_cmds | parallel
General considerations
You can ssh to the login node and, once on the login node to other nodes without 2 factor or a password using ssh keys. To set up the ssh keys and not have to type your password+2FA, find instructions here: Sshnopassword.
Do not forget to nice -19 your commands.
Using SLURM
cancel your jobs
For a specific job:
scancel JOBID
For all your jobs:
scancel -u gabre
get info on jobs
To get info on the nodes:
sinfo -N -l
submit jobs using sbatch
We will launch 100 jobs of generating random numbers and check if they are prime. First make sure that you have a directory called ~/slurm_out/ and then run:
for i in `seq 100`; do echo " /home/ctools/bin//python3 /home/projects/MAAG/FileDescriptors/random_int_generator.py --min 1000000000000 --max 100000000000000 1000 | /home/ctools/bin//python3 /home/projects/MAAG/FileDescriptors/prime_checker.py"; done | xargs -I CMD sbatch --job-name=gabriel_job --mem=2G --time=00:10:00 --cpus-per-task=1 --output=~/slurm_out/slurm_output_%j.out --error=~/slurm_out/slurm_error_%j.err --wrap="CMD"
checking jobs
Either use:
squeue
or:
watch -n1 "squeue"
to refresh every 1s. You can limit to a user using -u [username]