Healthtech cluster

From maag
Jump to navigation Jump to search

Getting an account

Contact Gabriel Renaud and we will contact Peter Wad Sackett.

First time login

  • Do you have a DTU account? If not contact Gabriel.
  • Have you changed DTU password after August 2021? If not, change it at password.dtu.dk
  • If you can log in to DTU webmail https://mail.dtu.dk, your credentials should be in order.
  • If you can not login in to DTU webmail, you should set up the MFA (2-factor authentication). We refer to DTU general instructions on how to do that.
  • At this stage try to login using your DTU username and password:
ssh -XC  <username>@login.healthtech.dtu.dk
Example: ssh -XC gabre@login.healthtech.dtu.dk
  • You will be prompted for the password (you can not see the chars you type), and the MFA code.
  • If you can not login at this stage, using your DTU username, password and MFA, then you are probably not enrolled into Microsoft Azure. Enroll here. Use your DTU email as username. You will be taken to the DTU login procedure for verification of your identity.
    Efter the enrollment, try to login again.
  • Logout again using Ctrl D or typing "logout".

Subsequent logins

  • Login using:
ssh -XC  <username>@login.healthtech.dtu.dk
  • Enter your DTU password and the MFA password when prompted
  • You should see:
<username>@login ~$ 

This is the login node, do not run anything there.

  • Then select a node from 01 to 14:
ssh -XC  <username>@nodeX

where X is 01, 02, 03, ..., 14

  • if this is your first time logging in, read the:
/home/README.txt

very carefully WHEN on a node. It will tell you where the different directories are.

  • Check if the Node is busy using:
htop

You will see the CPUs on top and the memory.

Nodes

NodeID CPUs GHz (7z) MIPS (7z) Memory (kb)
node01 24 2.67GHz 3197 64,304
node02 16 2.40GHz 2898 64,402
node03 24 3.07GHz 3713 96,656
node04 24 2.93GHz 3551 193,424
node05 24 3.33GHz 3773 193,424
node06 24 2.40GHz 3017 64,400
node07 64 2.70GHz 3463 515,942
node08 20 2.30GHz 3456 128,796
node09 16 2.53GHz 2991 120,853
node10 16 2.53GHz 2985 120,853
node11 40 2.60GHz 3872 257,847
node12 16 2.80GHz 3458 24,081
node13 16 2.53GHz 3036 24,081
node14 16 2.43GHz 3002 128,913
compute01 128 3.53GHz 3671 1,056,315
compute02 192 2.62GHz 2917 1,031,526
compute03 128 3.20GHz 2449 515,453


Running things in parallel versus interactive

You can login to node07, node10, node12 and compute01 directly. The rest are non-interactive meaning you need to use slurm (see below). On the iterative nodes, you can use parallel. First, write a file with the commands, let's call it "list_cmds". We need full paths, no relative paths.

Then pick an interative server to run the commands and you write:

   cat list_cmds | parallel 

General considerations

You can ssh to the login node and, once on the login node to other nodes without 2 factor or a password using ssh keys. To set up the ssh keys and not have to type your password+2FA, find instructions here: Sshnopassword.

Do not forget to nice -19 your commands.


Using SLURM

cancel your jobs

For a specific job:

    scancel JOBID

For all your jobs:


    scancel -u gabre


get info on jobs

To get info on the nodes:

    sinfo -N -l


submit jobs using sbatch

We will launch 100 jobs of generating random numbers and check if they are prime. First make sure that you have a directory called ~/slurm_out/ and then run:

     for i in `seq 100`; do echo " /home/ctools/bin//python3  /home/projects/MAAG/FileDescriptors/random_int_generator.py   --min 1000000000000 --max 100000000000000 1000 | /home/ctools/bin//python3 /home/projects/MAAG/FileDescriptors/prime_checker.py"; done | xargs -I CMD sbatch     --job-name=gabriel_job     --mem=2G     --time=00:10:00     --cpus-per-task=1     --output=~/slurm_out/slurm_output_%j.out     --error=~/slurm_out/slurm_error_%j.err     --wrap="CMD"

checking jobs

Either use:

     squeue 

or:


     watch -n1 "squeue"

to refresh every 1s. You can limit to a user using -u [username]