ASL_leonhard_euler
ASL_leonhard_euler copied to clipboard
Leonhard and Euler Guide
- Leonhard and Euler Guide
- Preliminary
- Resources
- Connecting to the cluster
- Modules
- Python
- Using the pre-compiled binaries
- Setup Miniconda
- Creating Conda Environment
- Test your Python Installation
- Using Jupyter Notebook for Visualizing Results
- Storing Data
- General Procedure
- Implementation Commands
- Taring a folder without compression
- Copying a folder from local computer to the cluster
- Copying a folder from cluster to yor local computer
- Untaring the tar file to the local storage of a node
- Accessing locale scratch directory in python
- Performance
- Scheduling Jobs
- Interactive jobs
- Monitoring Jobs
- Scheduling Python-Job Manual
- Python Debugging Tipps
- Template Environment
- Template Project Overview
- Setting up Environment Variables on the cluster
- Using Ansible
- Installation
- Configure
- Testing the settings
- Scheduling Jobs Using Playbooks
- Preliminary
Preliminary
This is the Readme copied from If you feel like you have found a nice tool or trick to help other people using the cluster feel free to open and issue!
If you would like to contribute just feel free to mail me: [email protected]
Feel free to give the repository a star if it helps you!
What is coming:
- Example Project Tensorflow
- Example Project PyTorch
- Example Project PyTorch lightning
- Nepune.ai Logging
- Dataset Examples
This repository is under construction.
Resources
Read the Leonhard/Euler cluster guides: Getting_started_with_clusters
Connecting to the cluster

For this you can follow the guide on the official cluster web-side which shows you how to generate and copy your local ssh key to the cluster.
Steps in short:
- Connect to the ETH network via VPN CiscoAnyConnect is highly recommended. (most stable)
- Generate your local ssh key.
- Copy your local ssh key to the cluster by running:
cat ~/.ssh/id_rsa.pub | ssh [email protected] "mkdir -p ~/.ssh && chmod 700 ~/.ssh && cat >> ~/.ssh/authorized_keys"
Here we assume you already created your ssh key at '~/.ssh/id_rsa.pub' on your locale machine.
- Try to connect. 'ssh [email protected]'
Modules
When you connect to the cluster you connect to a login node. There exists a variety of modules pre-installed.
(https://scicomp.ethz.ch/wiki/Leonhard_applications_and_libraries)
At first make sure to use the new software stack with the following command:
env2lmod
You can list the currently loaded modules with:
module list
When you want to develop something in Python you can either use pre-compiled binaries by loading the correct python module
module load gcc/6.3.0 python_gpu/3.7.4 cuda/10.1.243
module load gcc/6.3.0 python_gpu/3.8.5 cuda/11.0.3
or create your own Python installation.
The job execution nodes are not directly connected to the internet, but you can access the internet by loading the proxy module.
module load eth_proxy
Python
Using the pre-compiled binaries
You can take a look into the provided pre-compiled python binaries here: https://scicomp.ethz.ch/wiki/Python_on_Euler
In general we recommend setting up miniconda to manage your python environment. This allows you to fully match the cluster and your locale setup.
Setup Miniconda
Using anaconda to setup a custom python environment. (https://docs.conda.io/en/latest/miniconda.html)
To install miniconda:
- Connect to the cluster
- Navigate to $HOME
- Run the following:
cd ~ && wget https://repo.anaconda.com/miniconda/Miniconda3-py38_4.9.2-Linux-x86_64.sh
chmod +x . Miniconda3-py38_4.9.2-Linux-x86_64.sh
./Miniconda3-py38_4.9.2-Linux-x86_64.sh
In general:
It is important to install the conda environment (which will contain a lot of small files), to your $HOME folder (/cluster/home/username/miniconda3). This directory will always before running a job be copied to the compute node. Your home folder is quite small < 15GB but perfect for storing your code and the python environments.
- Source the .bashrc file or open a new shell.
source ~/.bashrc
- Verify your installation:
You should now see the currently loaded conda environments in brackets before your username.
(base) [username@login-noden ~]$
Creating Conda Environment
Follow this guide on how to setup a new environment. When using GPUs make sure to match the CUDA Version. You can load different CUDA-Versions with module load. Also be aware of the GCC Version. We recommend GCC version 6.3.0 and CUDA 11.0.
Guide how to ,manage conda environments
Execute the following command to create your Python environment named myenv (you can change the name):
conda create -n myenv python=3.8.5
conda activate myenv
Install some packages:
Example PyTorch Installation (Here it's important to match the cudatoolkit version!):
conda install pytorch==1.7.1 \
torchvision==0.8.2 \
torchaudio==0.7.2 \
cudatoolkit=11.0 -c pytorch
Test your Python Installation
- At first check your python path: Command:
which python
Result:
/cluster/home/username/miniconda3/envs/myenv/bin/python
If an other path is given try to execute conda deactivate.
Reactivate your environment: conda activate myenv
- Open an interactive python shell:
python
import torch
torch.__version__
To check that you have installed the correct pytorch version.
Exit the shell with exit()
Using Jupyter Notebook for Visualizing Results
Jupyter_on_Euler_and_Leonhard_Open
Storing Data
General Procedure
It's important to manage your data storage correctly on the cluster.
All large datasets should be stored under the /cluster/work/riner folder.
Also if your experiment results are large store them under the /cluster/work/riner as well.
It's important to not store small files. When you need to train your model on a large dataset the workflow is the following.
- Tar the dataset folder without compression!
- Schedule the job and request SCRATCH storage (will be discussed in the job-section)
- Untar the dataset to the SCRATCH partition of the compute node ($TMPDIR). The SCRATCH partition is mounted under $TMPDIR
- Now you can access the small files individually very fast given that they are on the SSD directly on the compute-node and no network transfer is needed.
If you don't follow this procedure and try to access a lot of small files on a network storage (/cluster/work/riner) you will slow down the network and your bandwidth will be massively reduced when you hit a certain file number limit.
Implementation Commands
Taring a folder without compression
cd directory/containing/datasets
tar -cvf dataset.tar dataset_folder
Copying a folder from local computer to the cluster
Open a shell on your local computer
scp -r ./path/to/local_folder [email protected]:/cluster/work/riner/some_folder
Copying a folder from cluster to your local computer
Open a shell on your local computer
scp -r [email protected]:/cluster/work/riner/results ./path/to/local_results
Untaring the tar file to the local storage of a node
tar -xvf /cluster/work/riner/datasets.tar -C $TMPDIR
Accessing locale scratch directory in python
Given that the TMPDIR variable is automatically set you can access the location of the dataset as follows:
import os
tmpdir = os.getenv('TMPDIR)
os.system(f'tar \cluster\work\riner\yourtarfile -C {tmpdir}')
Performance
Don't use a compression if you already have compressed files such as images stored as jpgs or pngs.
HDF5 files are also handy to use.
If your dataset is small you can consider loading all files into the RAM given that you can request a huge amount of RAM.
Scheduling Jobs
Read the Using the batch system section. Getting_started_with_clusters
Interactive jobs
At first let's start an interactive job running a shell.
bsub -n 16 -W 1:00 -R "rusage[mem=5000,ngpus_excl_p=2]" -R "select[gpu_mtotal0>=10000]" -R "rusage[scratch=10000]" -Is bash
This command will return an interactive bash session (-Is) with 16 cores (-n 16) that runs for 1 hour (-W 1:00) with 2 GPUS with more then 10GB of memory. A total RAM of 16x5000MB and a total SSD Scratch of 10000x16MB.
We can run the following two commands to see the GPU utilization nvidia-smi and CPU usage htop.
You can now simply activate the correct conda environment and run your python code as on your local computer. This is especially useful for debugging. If your code crashes it might happen that the terminal freezes and you have to submit a new interactive session.
If you know a workaround for this freezing problem I please share it!
Monitoring Jobs
Jo can see the running Jobs with bjobs or bbjobs for more details.
Jo can use the JOB-IDS to stop or peek the job.
bkill JOB-ID # Sends stop signal to the selected job
bkill 0 # Sends stop signals to ALL-jobs.
bpeek JOB-ID # Prints STD OUT of the selected job to the terminal.
When you want the evaluate or debug certain problems its helpful to connect to the job-execution directly.
bjob_connect JOB-ID
You will see in brackets how the node changes from a login node to the execution node.
Scheduling Python-Job Manual
To schedule a python job we will create shell-script submit.sh
Don't forget to set the correct permissions for execution:
chmod +x submit.sh
# Always reload all-modules before execution for consistency.
module list &> /dev/null || source /cluster/apps/modules/init/bash
module purge
module load legacy new gcc/6.3.0 hdf5 eth_proxy
# Navigate to the folder containing your python project.
$HOME/my_project/
# Specify the conda version.
# $@ allows you to pass arguments to the python file
$HOME/miniconda3/envs/myenv/bin/python main.py $@
Scheduling the Job:
bsub -I -n 4 -W 1:00 -R "rusage[mem=5000]" $HOME/run.sh --env=hello --exp=world
main.py
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--exp', help='Some flag.')
parser.add_argument('--env', help='Other flag')
args = parser.parse_args()
print( args.exp, args.env )
Python Debugging Tipps
When using interactive bash sessions, you would like to break the program using Ctrl-C without freezing the terminal; it helps to explicitly catch the signal.
By adding the following to the main script:
import signal
def signal_handler(signal, frame):
print('exiting on CRTL-C')
logger.experiment.stop()
sys.exit(0)
signal.signal(signal.SIGINT, signal_handler)
signal.signal(signal.SIGTERM, signal_handler)
Template Environment
Tested on Leonhard and Euler.
Machine learning and vision tasks.
Python 3.8.5 & GCC/6.3.0
Install:
conda env create -f ./conda/py38.yml
FRAMEWORKS:
- torch=1.7.1+cu110
- scikit-learn=0.24
- scipy=1.6.1
- numpy=1.19.2
- pandas=1.2.3
- pytorch-lightning=1.2.3
- opencv=4.5.1
UTILS:
- imageio=2.9.0
- pillow=8.1.2
- torchvision=0.8.2+cu110
- h5py=h5py
- matplotlib=3.3.4
MONITORING:
- neptune-client=0.5.1
- tensorboard=2.4.1
Template Project Overview
Setting up Environment Variables on the cluster
Append the following lines to the end of your ~/.bashrc file. vi ~/.bashrc
export NEPTUNE_API_TOKEN="""torken"""
export ENV_WORKSTATION_NAME="""leonhard"""
Specify your neptune.ai key for debugging. (only necessary if you want to use neptune)
Specify the name of the cluster. This allows later to access this variable from your python script. Therefore you're able to keep track on which cluster you're on. Also this variable will be used to load the correct environment yaml file with the same name /home/jonfrey/ASL_leonhard_euler/cfg/env/euler.yml where you are able to specify cluster specific paths and settings.
This allows you to easily move between your workstation and cluster.
Using Ansible
Installation
Follow the installing ansible on Ubuntu guide.
(https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html)
Configure
Configure ansible settings by modifying the following files.
sudo vi /etc/ansible/ansible.cfg
[defaults]
remote_user=username
host_key_checking = False
sudo_flags=-H -S
private_key_file = /home/jonfrey/.ssh/id_rsa
[ssh_connection]
pipelining = True
sudo vi /etc/ansible/hosts
[leonhard]
login.leonhard.ethz.ch ansible_ssh_user=username
[euler]
euler.ethz.ch ansible_ssh_user=username
Replace the username with your ETH email abbreviation.
Testing the settings
You should now be able to ping the configured hosts: Command:
sudo ansible all -m ping
Result:
euler.ethz.ch | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": false,
"ping": "pong"
}
login.leonhard.ethz.ch | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python"
},
"changed": false,
"ping": "pong"
}
Scheduling Jobs Using Playbooks
At first have a look into the official documentation (https://docs.ansible.com/ansible/latest/user_guide/playbooks_intro.html)
Example Playbook (ansible/queue_jobs.yml):
---
- name: Schedule Experiments
hosts: euler
vars:
- project_dir: "{{ ansible_env.HOME }}/"
tasks:
- name: Sync
synchronize:
src: /home/jonfrey/ASL_leonhard_euler
dest: "{{ project_dir }}"
- name: Load variables
include_vars:
file: /home/jonfrey/ASL_leonhard_euler/ansible/experiments.yml
name: experiments
- name: Schedule all experiments
shell: >
bsub -n 1 -W 0:10 -R "rusage[mem=5000,ngpus_excl_p=2]" -R "select[gpu_mtotal0>=10000]" -R "rusage[scratch=1000]" $HOME/ASL_leonhard_euler/scripts/submit.sh --exp={{ item.exp }}
loop: "{{ experiments.jobs }}"
Playbook Explanation:
- Specify the execution host:
hosts: euler
The available hosts can be found in the previously setup/etc/ansible/hostsfile - Synchronize your local code with the cluster:
synchronize
You can modify thedestandsrcpath as needed.
Also it's possible to usersyncinstead here. - Load variables:
Loads theansible/experiments.ymlwhere paths to experiment files are listed.
Each of the entries in thejobslist will be handled separately. We will loop over the jobs list in the next command. - Scheduling:
Schedule the job with the bash command. Sets the correct exp-file-path for each experiment. Thescripts/submit.shfile loads the correct module. And starts themain.pywith the template conda environment. the arguments that are passed to the script (--exp=) will be passed to the main.py file. With the loop command ansible knows it is supposed to loop over the list.
loop: "{{ experiments.jobs }}"
Command:
sudo ansible-playbook ansible/queue_jobs.yml
Result:
PLAY [Schedule Experiments] *********************************************************************************
TASK [Gathering Facts] **************************************************************************************
ok: [euler.ethz.ch]
TASK [Sync] *************************************************************************************************
changed: [euler.ethz.ch]
TASK [Load experiments] *************************************************************************************
ok: [euler.ethz.ch]
TASK [Schedule all experiments] ******************************************************************************
changed: [euler.ethz.ch] => (item={u'exp': u'/home/jonfrey/ASL_leonhard_euler/cfg/exp/exp.yml'})
changed: [euler.ethz.ch] => (item={u'exp': u'/home/jonfrey/ASL_leonhard_euler/cfg/exp/exp.yml'})
PLAY RECAP **************************************************************************************************
euler.ethz.ch : ok=4 changed=2 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
(base) [jonfrey@eu-login-11 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
165381072 jonfrey PEND gpu.4h eu-login-21 *p/exp.yml Mar 15 07:00
165381081 jonfrey PEND gpu.4h eu-login-21 *p/exp.yml Mar 15 07:00
TODO:
- git clone instructions and repository overview
- Workflow diagram ansible
- Add my paramiko script
- Extend ansible to use -o correctly
- neptuneai