Differences

This shows you the differences between two versions of the page.

--- gpu [2017/08/28 14:49]
kocmanek [Set-up CUDA and CUDNN]
+++ gpu [2017/10/17 16:38]
popel [Rules]
@@ Line 11: / Line 11: @@
 | twister1; twister2; kronos | Tesla K40c          |  cc3.5 |   1|  12 GB |  |
 | dll1; dll2                 | GeForce GTX 1080    |  cc6.1 |   8|   8 GB |  |
-| titan                      | GeForce GTX 1080 Ti |  cc6.1 |   1|  11 GB |  |
+| titan                      | GeForce GTX 1080    |  cc6.1 |   1|   8 GB |  |
 | dll3; dll4; dll5           | GeForce GTX 1080 Ti |  cc6.1 |  10|  11 GB | dll3 has only 9 GPUs since 2017/07 |
+| dll6                       | GeForce GTX 1080 Ti |  cc6.1 |   3|  11 GB |  |
 Desktop machines:
@@ Line 22: / Line 23: @@
 All machines have CUDA8.0 and should support both Theano and TensorFlow.
-=== Disk space ===
+===== Rules =====
-All the GPU machines are at Malá Strana (not at Troja), so you should not use ''/lnet/tspec/work/'', but you should use:
+  * First, read [[internal:Linux network]] and [[:Grid]].
-- ''/lnet/spec/work/'' (alias ''/net/work/'') - Lustre disk space at Malá Strana
+  * All the rules from [[:Grid]] apply, even more strictly than for CPU because there are too many GPU users and not as many GPUs available. So as a reminder: always use GPUs via ''qsub'' (or ''qrsh''), never via ''ssh''. You can ssh to any machine e.g. to run ''nvidia-smi'' or ''htop'', but not to start computing on GPU. Don't forget to specify you RAM requirements with e.g. ''-l mem_free=8G,act_mem_free=8G,h_vmem=12G''.
-- ''/net/cluster/TMP'' - NFS hard disk for temporary files, so slower than Lustre for most tasks
+  * Always specify the number of GPU cards (e.g. ''gpu=1''), the minimal Cuda capability you need (e.g. ''gpu_cc_min3.5=1'') and you GPU memory requirements (e.g. ''gpu_ram=2G''). Thus e.g. <code>qsub -q gpu.q -l gpu=1,gpu_cc_min3.5=1,gpu_ram=2G</code>
-- ''/net/cluster/SSD'' - also NFS, but faster then TMP because of SSD
+  * If you need more than one GPU card, always require as many CPU cores as many GPU cards you need. E.g. <code>qsub -q gpu.q -l gpu=4,gpu_cc_min3.5=1,gpu_ram=7G -pe smp 4</code>
-- ''/COMP.TMP'' - local (for each machine) space for temporary files (use it instead of ''/tmp''; over-filling ''/COMP.TMP'' should not halt the system).
+  * For interactive jobs, you can use ''qrsh'', but make sure to end your job as soon as you don't need the GPU (so don't use qrsh for long training). E.g. <code>qrsh -q gpu.q -l gpu=1,gpu_ram=2G -pty yes bash</code>
-=== Individual acquisitions: NVIDIA Academic Hardware Grants ==
-There is an easy way to get one high-end GPU: [[https://developer.nvidia.com/academic_gpu_seeding|ask NVIDIA for an Academic Hardware Grant]]. All it takes is writing a short grant application (at most ~2 hrs of work from scratch; if you have a GAUK, ~15 minutes of copy-pasting). Due to the GPU housing issues (mainly rack space and cooling), Milan F. said we should request the Tesla-line cards (2017 check with Milan about this issue). If you want to have a look at an application, feel free to ask at hajicj@ufal.mff.cuni.cz :)
-Take care, however, to coordinate the grant applications a little, so that not too many arrive from UFAL within a short time: these grants are explicitly //not// intended to build GPU clusters, they are "seeding" grants meant for researchers to try out GPUs (and fall in love with them, and buy a cluster later). If you are planning to submit the hardware grant, have submitted one, or have already been awarded one, please add yourself here.
-Known NVIDIA Academic Hardware Grants:
-  * Ondřej Plátek - granted (2015)
-  * Jan Hajič jr. - granted (early 2016)
 ===== How to use cluster =====
-In this section will be explained how to use cluster properly.
 ==== Set-up CUDA and CUDNN ====
@@ Line 80: / Line 65: @@
 This environment have TensorFlow 1.0 and all necessary requirements for NeuralMonkey.
+==== Pytorch Environment ====
+If you want to use pytorch, there is a ready-made environment in
+  /home/hajicj/anaconda3/envs/pytorch/bin
+It does rely on the CUDA and CuDNN setup above.
 ==== Using cluster ====
@@ Line 115: / Line 108: @@
   /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
     # shows CUDA capability etc.
+  ssh dll1; ~popel/bin/gpu_allocations
+    # who occupies which card on a given machine
 === Select GPU device ===
-Use variable CUDA_VISIBLE_DEVICES to constrain tensorflow to compute only on the selected one. For the use of first GPU use (GPU queue do this for you):
+The variable CUDA_VISIBLE_DEVICES constrains tensorflow and other toolkits to compute only on the selected GPUs. **Do not set this variable yourself** (unless debugging SGE), it is set for you automatically by SGE if you ask for some GPUs (see above).
-  export CUDA_VISIBLE_DEVICES=0
 To list available devices, use:
@@ Line 166: / Line 160: @@
 GPU specs for those GPUs we have:
   * [[http://www.nvidia.com/content/PDF/kepler/Tesla-K40-Active-Board-Spec-BD-06949-001_v03.pdf|Tesla K40c]]
+==== Individual acquisitions: NVIDIA Academic Hardware Grants ====
+There is an easy way to get one high-end GPU: [[https://developer.nvidia.com/academic_gpu_seeding|ask NVIDIA for an Academic Hardware Grant]]. All it takes is writing a short grant application (at most ~2 hrs of work from scratch; if you have a GAUK, ~15 minutes of copy-pasting). Due to the GPU housing issues (mainly rack space and cooling), Milan F. said we should request the Tesla-line cards (2017 check with Milan about this issue). If you want to have a look at an application, feel free to ask at hajicj@ufal.mff.cuni.cz :)
+Take care, however, to coordinate the grant applications a little, so that not too many arrive from UFAL within a short time: these grants are explicitly //not// intended to build GPU clusters, they are "seeding" grants meant for researchers to try out GPUs (and fall in love with them, and buy a cluster later). If you are planning to submit the hardware grant, have submitted one, or have already been awarded one, please add yourself here.
+Known NVIDIA Academic Hardware Grants:
+  * Ondřej Plátek - granted (2015)
+  * Jan Hajič jr. - granted (early 2016)

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences