Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
gpu [2017/09/11 11:26] fucik |
gpu [2017/10/17 16:37] popel |
All machines have CUDA8.0 and should support both Theano and TensorFlow. | All machines have CUDA8.0 and should support both Theano and TensorFlow. |
| |
=== Disk space === | ===== Rules ===== |
All the GPU machines are at Malá Strana (not at Troja), so you should not use ''/lnet/tspec/work/'', but you should use: | * First, read [[internal:Linux network]] and [[:Grid]]. |
- ''/lnet/spec/work/'' (alias ''/net/work/'') - Lustre disk space at Malá Strana | * All the rules from [[:Grid]] apply, even more strictly than for CPU because there are too many GPU users and not as many GPUs available. So as a reminder: always use GPUs via ''qsub'' (or ''qrsh''), never via ''ssh''. You can ssh to any machine e.g. to run ''nvidia-smi'' or ''htop'', but not to start computing on GPU. Don't forget to specify you RAM requirements with e.g. ''-l mem_free=8G,act_mem_free=8G,h_vmem=12G''. |
- ''/net/cluster/TMP'' - NFS hard disk for temporary files, so slower than Lustre for most tasks | * Always specify the number of GPU cards (e.g. ''gpu=1''), the minimal Cuda capability you need (e.g. ''gpu_cc_min3.5=1'') and you GPU memory requirements (e.g. ''gpu_ram=2G''). Thus e.g. <code>qsub -q gpu.q -l gpu=1,gpu_cc_min3.5=1,gpu_ram=2G</code> |
- ''/net/cluster/SSD'' - also NFS, but faster then TMP because of SSD | * If you need more than one GPU card, always require as many CPU cores as many GPU cards you need. E.g. <code>qsub -q gpu.q -l gpu=4,gpu_cc_min3.5=1,gpu_ram=7G -pe smp 4</code> |
- ''/COMP.TMP'' - local (for each machine) space for temporary files (use it instead of ''/tmp''; over-filling ''/COMP.TMP'' should not halt the system). | * For interactive jobs, you can use ''qrsh'', but make sure to end your job as soon as you don't need the GPU. E.g. <code>qrsh -q gpu.q -l gpu=1,gpu_ram=2G -pty yes bash</code> |
| |
=== Individual acquisitions: NVIDIA Academic Hardware Grants == | |
| |
There is an easy way to get one high-end GPU: [[https://developer.nvidia.com/academic_gpu_seeding|ask NVIDIA for an Academic Hardware Grant]]. All it takes is writing a short grant application (at most ~2 hrs of work from scratch; if you have a GAUK, ~15 minutes of copy-pasting). Due to the GPU housing issues (mainly rack space and cooling), Milan F. said we should request the Tesla-line cards (2017 check with Milan about this issue). If you want to have a look at an application, feel free to ask at hajicj@ufal.mff.cuni.cz :) | |
| |
Take care, however, to coordinate the grant applications a little, so that not too many arrive from UFAL within a short time: these grants are explicitly //not// intended to build GPU clusters, they are "seeding" grants meant for researchers to try out GPUs (and fall in love with them, and buy a cluster later). If you are planning to submit the hardware grant, have submitted one, or have already been awarded one, please add yourself here. | |
| |
Known NVIDIA Academic Hardware Grants: | |
| |
* Ondřej Plátek - granted (2015) | |
* Jan Hajič jr. - granted (early 2016) | |
| |
| |
| |
| |
===== How to use cluster ===== | ===== How to use cluster ===== |
| |
In this section will be explained how to use cluster properly. | |
| |
==== Set-up CUDA and CUDNN ==== | ==== Set-up CUDA and CUDNN ==== |
| |
This environment have TensorFlow 1.0 and all necessary requirements for NeuralMonkey. | This environment have TensorFlow 1.0 and all necessary requirements for NeuralMonkey. |
| |
| ==== Pytorch Environment ==== |
| |
| If you want to use pytorch, there is a ready-made environment in |
| |
| /home/hajicj/anaconda3/envs/pytorch/bin |
| |
| It does rely on the CUDA and CuDNN setup above. |
| |
==== Using cluster ==== | ==== Using cluster ==== |
/usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery | /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery |
# shows CUDA capability etc. | # shows CUDA capability etc. |
| ssh dll1; ~popel/bin/gpu_allocations |
| # who occupies which card on a given machine |
| |
=== Select GPU device === | === Select GPU device === |
| |
Use variable CUDA_VISIBLE_DEVICES to constrain tensorflow to compute only on the selected one. For the use of first GPU use (GPU queue do this for you): | The variable CUDA_VISIBLE_DEVICES constrains tensorflow and other toolkits to compute only on the selected GPUs. **Do not set this variable yourself** (unless debugging SGE), it is set for you automatically by SGE if you ask for some GPUs (see above). |
export CUDA_VISIBLE_DEVICES=0 | |
| |
To list available devices, use: | To list available devices, use: |
GPU specs for those GPUs we have: | GPU specs for those GPUs we have: |
* [[http://www.nvidia.com/content/PDF/kepler/Tesla-K40-Active-Board-Spec-BD-06949-001_v03.pdf|Tesla K40c]] | * [[http://www.nvidia.com/content/PDF/kepler/Tesla-K40-Active-Board-Spec-BD-06949-001_v03.pdf|Tesla K40c]] |
| |
| ==== Individual acquisitions: NVIDIA Academic Hardware Grants ==== |
| |
| There is an easy way to get one high-end GPU: [[https://developer.nvidia.com/academic_gpu_seeding|ask NVIDIA for an Academic Hardware Grant]]. All it takes is writing a short grant application (at most ~2 hrs of work from scratch; if you have a GAUK, ~15 minutes of copy-pasting). Due to the GPU housing issues (mainly rack space and cooling), Milan F. said we should request the Tesla-line cards (2017 check with Milan about this issue). If you want to have a look at an application, feel free to ask at hajicj@ufal.mff.cuni.cz :) |
| |
| Take care, however, to coordinate the grant applications a little, so that not too many arrive from UFAL within a short time: these grants are explicitly //not// intended to build GPU clusters, they are "seeding" grants meant for researchers to try out GPUs (and fall in love with them, and buy a cluster later). If you are planning to submit the hardware grant, have submitted one, or have already been awarded one, please add yourself here. |
| |
| Known NVIDIA Academic Hardware Grants: |
| |
| * Ondřej Plátek - granted (2015) |
| * Jan Hajič jr. - granted (early 2016) |
| |
| |
| |