Frequently Asked Questions

What GPU cards are installed on the GPU hosts

Answer: Nvidia Tesla T4 (16GB vRAM), Tesla A16 (16GB vRAM), Tesla A30 (24GB vRAM), Tesla A40 (48GB vRAM) and Tesla A100 (80GB vRAM)
What are the general platform characteristics of the GPU hosts?

Answer: 24-core/48 thread Intel Xeon CPUs with 256GB RAM, AMD EPYC 7702P 64-Core CPUs, AMD 9564 96-Core CPUs

How do I see what Slurm jobs are running?

Answer: invoke any one of the following commands on gpucluster:

# List all your current Slurm jobs in brief format
squeue
# List all your current Slurm jobs in extended format.
squeue -l
# List your jobs in the queue
squeue --me

Please run 'man squeue' for additional information.

How do I delete a Slurm job?

Answer: First, run squeue to get the Slurm job ID from the JOBID column, then run:
```
scancel <id of my job>
```
You can only delete your own Slurm jobs.
How many GPU hosts are there?

Answer: As of July 2023, there are 12 host GPU servers, with some hosting DoC Cloud GPU Virtual machines.
How do I analyse a specific error in the Slurm output file/e-mail after running a Slurm job?

Answer: If the reason for the error is not apparent from your job’s output, make a post on the Edstem CSG board , including all relevant information – for example:
- the context of the Slurm command that you are running. That is, what are you trying to achieve and how have you gone about achieving it? Have you, created a Python virtual environment? Are you using a particular server or deep learning framework?
- the Slurm script/command that you have used to submit the job. Please include the full paths to the scripts if they live under /vol/bitbucket
- what you believe should be the expected output.
- the details of any error message displayed. You would be surprised at how many forget to include this.
I receive no output from a Slurm job. How do I go about debugging that?

Answer: This is an open-ended question. Please first confirm that your Slurm job does indeed generate output when run interactively. You may be able to use one of the ‘gpu01-36’ interactive lab computers to perform an interactive test. If you still need assistance, please follow the advice in the preceding FAQ entry (Number vi).

How do I customise my job submission options?

Answer: Add a Slurm comment directive to your job script – for example:

# To request 1 or more GPUs (default is 1):
#SBATCH --gres=gpu:1

# To request a 24GB Tesla A30 GPU:
#SBATCH --partition a30

# Please note, there are only a few 48GB/80GB GPUs available, interactive jobs are not permitted
# For other GPUs, refer to 6b. GPU types, including the research equivalents of the above

# To receive email notifications
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your_username>

#Customise job output name
#SBATCH --output=<your_job_name>%j.out

How do I run a job interactively?

Answer: Use srun and specify a gpu, and other resources. eg. for a bash shell:
```
srun --pty --gres=gpu:1 bash
```
Update: use 'salloc' as detailed in Step 1c
Please quit an interactive job when finished to free up the GPU for others, eg:
```
exit
```
I need a particular software package to be installed on a GPU host.

Answer: Have you first tried installing the package in a Python virtual environment or in your own home directory with the command:
```
pip install --user <packagename>
```
If the above options do not work then make a post on the Edstem CSG board with details of the package that you would like to be installed on the GPU server(s).

Please note: CSG are only able to install standard Ubuntu packages if doing so does not conflict with any exisiting package or functionality on all the GPU servers.
My job is stuck in queued status, what does this mean?

Answer: This could be because all GPUs are in use. PD status occurs if you are already running two jobs, and will run (R) when one of your previous tasks is complete. (QOSMaxGRESPerUser) means you are using your maximum of two GPUs at any one time.
When will my job start?

Run:
```
squeue --me --start
```
An estimated start time is listed based on the maximum runtime of current jobs, but your job may start sooner as jobs finish before the maximum end or 'walltime'
What are the CUDA compute capabilities for each GPU?

Please consult the NVIDIA Compatiiblity Index for more information.

The cluster GPUs support the following levels:

7.5 (T4), 8.0 (A30, A100), 8.6 (A40, A16)

These should be considered when, for example, using older versions of Pytorch and receiving 'not supported' errors
How many GPUs/CPUs/RAM may I use for my jobs?

Three GPUs per user, 32 CPU cores, 200Gb RAM (subject to change due to teaching)

Please note: FairShare now enabled - the more resources you consume the longer your future jobs will take to run, ie. heavy consumption will lead to longer wait times; after a week of lower usage, your FairShare score will 'decay' to normal levels. Please read the ICT HPC FAQ for a rationale

Frequently Asked Questions

Frequently Asked Questions

results matching ""

No results matching ""