name,url,text AI tooling and methodology handbook,https://docs.mila.quebec/Handbook.html#ai-tooling-and-methodology-handbook,"AI tooling and methodology handbook This section seeks to provide researchers with insightful articles pertaining to aspects of methodology in their work. " What is a computer cluster?,https://docs.mila.quebec/Theory_cluster.html#what-is-a-computer-cluster,"What is a computer cluster? A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. " Parts of a computing cluster,https://docs.mila.quebec/Theory_cluster.html#parts-of-a-computing-cluster,"Parts of a computing cluster To provide high performance computation capabilities, clusters can combine hundreds to thousands of computers, called nodes, which are all inter-connected with a high-performance communication network. Most nodes are designed for high-performance computations, but clusters can also use specialized nodes to offer parallel file systems, databases, login nodes and even the cluster scheduling functionality as pictured in the image below. We will overview the different types of nodes which you can encounter on a typical cluster. " The login nodes,https://docs.mila.quebec/Theory_cluster.html#the-login-nodes,"The login nodes To execute computing processes on a cluster, you must first connect to a cluster and this is accomplished through a login node. These so-called login nodes are the entry point to most clusters. Another entry point to some clusters such as the Mila cluster is the JupyterHub web interface, but we’ll read about that later. For now let’s return to the subject of this section; Login nodes. To connect to these, you would typically use a remote shell connection. The most usual tool to do so is SSH. You’ll hear and read a lot about this tool. Imagine it as a very long (and somewhat magical) extension cord which connects the computer you are using now, such as your laptop, to a remote computer’s terminal shell. You might already know what a terminal shell is if you ever used the command line. " The compute nodes,https://docs.mila.quebec/Theory_cluster.html#the-compute-nodes,"The compute nodes In the field of artificial intelligence, you will usually be on the hunt for GPUs. In most clusters, the compute nodes are the ones with GPU capacity. While there is a general paradigm to tend towards a homogeneous configuration for nodes, this is not always possible in the field of artificial intelligence as the hardware evolve rapidly as is being complemented by new hardware and so on. Hence, you will often read about computational node classes. Some of which might have different GPU models or even no GPU at all. For the Mila cluster you will find this information in the Node profile description section. For now, you should note that is important to keep in mind that you should be aware of which nodes your code is running on. More on that later. " The storage nodes,https://docs.mila.quebec/Theory_cluster.html#the-storage-nodes,"The storage nodes Some computers on a cluster function to only store and serve files. While the name of these computers might matter to some, as a user, you’ll only be concerned about the path to the data. More on that in the Processing data section. " Different nodes for different uses,https://docs.mila.quebec/Theory_cluster.html#different-nodes-for-different-uses,"Different nodes for different uses It is important to note here the difference in intended uses between the compute nodes and the login nodes. While the compute nodes are meant for heavy computation, the login nodes are not. The login nodes however are used by everyone who uses the cluster and care must be taken not to overburden these nodes. Consequently, only very short and light processes should be run on these otherwise the cluster may become inaccessible. In other words, please refrain from executing long or compute intensive processes on login nodes because it affects all other users. In some cases, you will also find that doing so might get you into trouble. " UNIX,https://docs.mila.quebec/Theory_cluster.html#unix,"UNIX All clusters typically run on GNU/Linux distributions. Hence a minimum knowledge of GNU/Linux and BASH is usually required to use them. See the following tutorial for a rough guide on getting started with Linux. " The workload manager,https://docs.mila.quebec/Theory_cluster.html#the-workload-manager,"The workload manager On a cluster, users don’t have direct access to the compute nodes but instead connect to a login node and add jobs to the workload manager queue. Whenever there are resources available to execute these jobs they will be allocated to a compute node and run, which can be immediately or after a wait of up to several days. A job is comprised of a number of steps that will run one after the other. This is done so that you can schedule a sequence of processes that can use the results of the previous steps without having to manually interact with the scheduler. Each step can have any number of tasks which are groups of processes that can be scheduled independently on the cluster but can run in parallel if there are resources available. The distinction between steps and tasks is that multiple tasks, if they are part of the same step, cannot depend on results of other tasks because there are no guarantees on the order in which they will be executed. Finally each process group is the basic unit that is scheduled in the cluster. It comprises of a set of processes (or threads) that can run on a number of resources (CPU, GPU, RAM, …) and are scheduled together as a unit on one or more machines. Each of these concepts lends itself to a particular use. For multi-gpu training in AI workloads you would use one task per GPU for data paralellism or one process group if you are doing model parallelism. Hyperparameter optimisation can be done using a combination of tasks and steps but is probably better left to a framework outside of the scope of the workload manager. If this all seems complicated, you should know that all these things do not need to always be used. It is perfectly acceptable to sumbit jobs with a single step, a single task and a single process. The available resources on the cluster are not infinite and it is the workload manager’s job to allocate them. Whenever a job request comes in and there are not enough resources available to start it immediately, it will go in the queue. Once a job is in the queue, it will stay there until another job finishes and then the workload manager will try to use the newly freed resources with jobs from the queue. The exact order in which the jobs will start is not fixed, because it depends on the local policies which can take into account the user priority, the time since the job was requested, the amount of resources requested and possibly other things. There should be a tool that comes with the manager where you can see the status of your queued jobs and why they remain in the queue. The workload manager will divide the cluster into partitions according to the configuration set by the admins. A partition is a set of machi" The workload manager,https://docs.mila.quebec/Theory_cluster.html#the-workload-manager,"nes typically reserved for a particular purpose. An example might be CPU-only machines for preprocessing setup as a separate partition. It is possible for multiple partitions to share resources. There will always be at least one partition that is the default partition in which jobs without a specific request will go. Other partitions can be requested, but might be restricted to a group of users, depending on policy. Partitions are useful for a policy standpoint to ensure efficient use of the cluster resources and avoid using up too much of one resource type blocking use of another. They are also useful for heterogenous clusters where different hardware is mixed in and not all software is compatible with all of it (for example x86 and POWER cpus). To ensure a fair share of the computing resources for all, the workload manager establishes limits on the amount of resources that a single user can use at once. These can be hard limits which prevent running jobs when you go over or soft limits which will let you run jobs, but only until some other job needs the resources. Admin policy will determine what those exact limits are for a particular cluster or user and whether they are hard or soft limits. The way soft limits are enforced is using preemption, which means that when another job with higher priority needs the resources that your job is using, your job will receive a signal that it needs to save its state and exit. It will be given a certain amount of time to do this (the grace period, which may be 0s) and then forcefully terminated if it is still running. Depending on the workload manager in use and the cluster configuration a job that is preempted like this may be automatically rescheduled to have a chance to finish or it may be up to the job to reschedule itself. The other limit you can encounter with a job that goes over its declared limits. When you schedule a job, you declare how much resources it will need (RAM, CPUs, GPUs, …). Some of those may have default values and not be explicitely defined. For certain types of devices, like GPUs, access to units over your job limit is made unavailable. For others, like RAM, usage is monitored and your job will be terminated if it goes too much over. This makes it important to ensure you estimate resource usage accurately. Mila as well as Digital Research Alliance of Canada use the workload manager Slurm to schedule and allocate resources on their infrastructure. Slurm client commands are available on the login nodes for you to submit jobs to the main controller and add your job to the queue. Jobs are of 2 types: batch jobs and interactive jobs. For practical examples of Slurm commands on the Mila cluster, see Running your code." Processing data,https://docs.mila.quebec/Theory_cluster.html#processing-data,"Processing data For processing large amounts of data common for deep learning, either for dataset preprocessing or training, several techniques exist. Each has typical uses and limitations. " Data parallelism,https://docs.mila.quebec/Theory_cluster.html#data-parallelism,"Data parallelism The first technique is called data parallelism (aka task parallelism in formal computer science). You simply run lots of processes each handling a portion of the data you want to process. This is by far the easiest technique to use and should be favored whenever possible. A common example of this is hyperparameter optimisation. For really small computations the time to setup multiple processes might be longer than the processing time and lead to waste. This can be addressed by bunching up some of the processes together by doing sequential processing of sub-partitions of the data. For the cluster systems it is also inadvisable to launch thousands of jobs and even if each job would run for a reasonable amount of time (several minutes at minimum), it would be best to make larger groups until the amount of jobs is in the low hundreds at most. Finally another thing to keep in mind is that the transfer bandwidth is limited between the filesystems (see Filesystem concerns) and the compute nodes and if you run too many jobs using too much data at once they may end up not being any faster because they will spend their time waiting for data to arrive. " Model parallelism,https://docs.mila.quebec/Theory_cluster.html#model-parallelism,"Model parallelism The second technique is called model parallelism (which doesn’t have a single equivalent in formal computer science). It is used mostly when a single instance of a model will not fit in a computing resource (such as the GPU memory being too small for all the parameters). In this case, the model is split into its constituent parts, each processed independently and their intermediate results communicated with each other to arrive at a final result. This is generally harder but necessary to work with larger, more powerful models like GPT. " Communication concerns,https://docs.mila.quebec/Theory_cluster.html#communication-concerns,"Communication concerns The main difference of these two approaches is the need for communication between the multiple processes. Some common training methods, like stochastic gradient descent sit somewhere between the two, because they require some communication, but not a lot. Most people classify it as data parallelism since it sits closer to that end. In general for data parallelism tasks or tasks that communicate infrequently it doesn’t make a lot of difference where the processes sit because the communication bandwidth and latency will not have a lot of impact on the time it takes to complete the job. The individual tasks can generally be scheduled independently. On the contrary for model parallelism you need to pay more attention to where your tasks are. In this case it is usually required to use the facilities of the workload manager to group the tasks so that they are on the same machine or machines that are closely linked to ensure optimal communication. What is the best allocation depends on the specific cluster architecture available and the technologies it support (such as InfiniBand, RDMA, NVLink or others) " Filesystem concerns,https://docs.mila.quebec/Theory_cluster.html#filesystem-concerns,"Filesystem concerns When working on a cluster, you will generally encounter several different filesystems. Usually there will be names such as ‘home’, ‘scratch’, ‘datasets’, ‘projects’, ‘tmp’. The reason for having different filesystems available instead of a single giant one is to provide for different use cases. For example, the ‘datasets’ filesystem would be optimized for fast reads but have slow write performance. This is because datasets are usually written once and then read very often for training. Different filesystems have different performance levels. For instance, backed up filesystems (such as $PROJECT in Digital Research Alliance of Canada clusters) provide more space and can handle large files but cannot sustain highly parallel accesses typically required for high speed model training. The set of filesystems provided by the cluster you are using should be detailed in the documentation for that cluster and the names can differ from those above. You should pay attention to their recommended use case in the documentation and use the appropriate filesystem for the appropriate job. There are cases where a job ran hundreds of times slower because it tried to use a filesystem that wasn’t a good fit for the job. One last thing to pay attention to is the data retention policy for the filesystems. This has two subpoints: how long is the data kept for, and are there backups. Some filesystems will have a limit on how long they keep their files. Typically the limit is some number of days (like 90 days) but can also be ‘as long as the job runs’ for some. As for backups, some filesystems will not have a limit for data, but will also not have backups. For those it is important to maintain a copy of any crucial data somewhere else. The data will not be purposefully deleted, but the filesystem may fail and lose all or part of its data. If you have any data that is crucial for a paper or your thesis keep an additional copy of it somewhere else. " Software on the cluster,https://docs.mila.quebec/Theory_cluster.html#software-on-the-cluster,"Software on the cluster This section aims to raise awareness to problems one can encounter when trying to run a software on different computers and how this is dealt with on typical computation clusters. The Mila cluster and the Digital Research Alliance of Canada clusters both provide various useful software and computing environments, which can be activated through the module system. Alternatively, you may build containers with your desired software and run them on compute nodes. Regarding Python development, we recommend using virtual environments to install Python packages in isolation. " Cluster software modules,https://docs.mila.quebec/Theory_cluster.html#cluster-software-modules,"Cluster software modules Modules are small files which modify your environment variables to point to specific versions of various software and libraries. For instance, a module might provide the python command to point to Python 3.7, another might activate CUDA version 11.0, another might provide the torch package, and so on. For more information, see The module command. " Containers,https://docs.mila.quebec/Theory_cluster.html#containers,"Containers Containers are a special form of isolation of software and its dependencies. A container is essentially a lightweight virtual machine: it encapsulates a virtual file system for a full OS installation, as well as a separate network and execution environment. For example, you can create an Ubuntu container in which you install various packages using apt, modify settings as you would as a root user, and so on, but without interfering with your main installation. Once built, a container can be run on any compatible system. For more information, see Using containers on clusters. " Python Virtual environments,https://docs.mila.quebec/Theory_cluster.html#python-virtual-environments,"Python Virtual environments A virtual environment in Python is a local, isolated environment in which you can install or uninstall Python packages without interfering with the global environment (or other virtual environments). In order to use a virtual environment, you first have to activate it. For more information, see Virtual environments. " "Who, what, where is IDT",https://docs.mila.quebec/IDT.html#who-what-where-is-idt,"Who, what, where is IDT This section seeks to help Mila researchers understand the mission and role of the IDT team. " IDT’s mission,https://docs.mila.quebec/IDT.html#idt-s-mission,"IDT’s mission " The IDT team,https://docs.mila.quebec/IDT.html#the-idt-team,"The IDT team See https://mila.quebec/en/mila/team/?cat_id=143 " Purpose of this documentation,https://docs.mila.quebec/Purpose.html#purpose-of-this-documentation,"Purpose of this documentation This documentation aims to cover the information required to run scientific and data-intensive computing tasks at Mila and the available resources for its members. It also aims to be an outlet for sharing know-how, tips and tricks and examples from the IDT team to the Mila researcher community. " Intended audience,https://docs.mila.quebec/Purpose.html#intended-audience,"Intended audience This documentation is mainly intended for Mila researchers having access to the Mila cluster. This access is determined by your researcher status. See Roles and authorizations for more information. The core of the information with this purpose can be found in the following section: Computing infrastructure and policies. However, we also aim to provide more general information which can be useful outside the scope of using the Mila cluster. For instance, more general theory on computational considerations and such. In this perspective, we hope the documentation can be of use for all of Mila members. " Contributing,https://docs.mila.quebec/Purpose.html#contributing,"Contributing See the following file for contribution guidelines : # Contributing to the Mila Docs Thank you for your interest into making a better documentation for all at Mila. Here are some guidelines to help bring your contributions to life. ## What should be included in the Mila Docs * Mila cluster usage * Digital Research Alliance of Canada cluster usage * Job management tips / tricks * Research good practices * Software development good practices * Useful tools **_NOTE_**: Examples should aim to not consume much more than 1 GPU/hour and 2 CPU/hour ## Issues / Pull Requests ### Issues Issues can be used to report any error in the documentation, missing or unclear sections, broken tools or other suggestions to improve the overall documentation. ### Pull Requests PRs are welcome and we value the contents of contributions over the appearance or functionality of the pull request. If you don't know how to write the proper markup in reStructuredText, simply provide the content you would like to add in the PR text form which supports markdown or with instructions to format the content. In the PR, reference the related issues like this: ``` Resolves: #123 See also: #456, #789 ``` If you would like to contribute directly in the code of the documentation, keep the lines width to 80 characters or less. You can attempt to build the docs yourself to see if the formating is right: ```console python3 -m pip install -r docs/requirements.txt sphinx-build -b html docs/ docs/_build/ ``` This will produce the html version of the documentation which you can navigate by opening the local file `docs/_build/index.html`. If you have any trouble building the docs, don't hesitate to open an issue to request help. Regarding the restructured text format" Contributing,https://docs.mila.quebec/Purpose.html#contributing,", you can simply provide the content you would like to add in markdown or plain text format if more convenient for you and someone down the line should take responsibility to convert the format. ## Sphinx / reStructuredText (reST) The markup language used for the Mila Docs is [reStructuredText](http://docutils.sourceforge.net/rst.html) and we follow the [Python’s Style Guide for documenting](https://docs.python.org/devguide/documenting.html#style-guide). Here are some of reST syntax directives which are useful to know : (more can be found in [Sphinx's reST Primer](https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html)): ### Inline markup * one asterisk: `*text*` for *emphasis* (italics), * two asterisks: `**text**` for **strong emphasis** (boldface), and * backquotes: ` ``text`` ` for `code samples`, and * external links: `` `Link text `_ ``. ### Lists ```reST * this is * a list * with a nested list * and some subitems * and here the parent list continues ``` ### Sections ```reST ################# This is a heading ################# ``` There are no heading levels assigned to certain characters as the structure is determined from the succession of headings. However, the Python documentation suggests the following convention: * `#` with overline, for parts * `*` with overline, for chapters * `=`, for sections * `-`, for subsections * `^`, for subsubsections * `""`, for paragraphs ### Note box ```reST .. note:: This is a long long long note ``` ### Collapsible boxes This is a local extension, not part of Sphinx itself. It works like this: ```reST .. container:: toggle .. container:: header **Show/Hide Code** .. code-block:: ... ``` " Visual Studio Code,https://docs.mila.quebec/VSCode.html#visual-studio-code,"Visual Studio Code One editor of choice for many researchers is VSCode. One feature of VSCode is remote editing through SSH. This allows you to edit files on the cluster as if they were local. You can also debug your programs using VSCode’s debugger, open terminal sessions, etc. " Connecting to the cluster,https://docs.mila.quebec/VSCode.html#connecting-to-the-cluster,"Connecting to the cluster VSCode cannot be used to edit code on the login nodes, because it is a heavy enough process (a node process, plus the language server, linter, and possibly other plugins depending on your configured environment) that there is a risk of overloading the login nodes if too many researchers did it at the same time. Therefore, to use VSCode on the cluster, you first need to allocate a compute node, then connect to that node. The milatools package provides a command to make the operation easier. More info can be found here. " Activating an environment,https://docs.mila.quebec/VSCode.html#activating-an-environment,"Activating an environment Reference To activate a conda or pip environment, you can open the command palette with Ctrl+Shift+P and type “Python: Select interpreter”. This will prompt you for the path to the Python executable for your environment. Tip If you already have the environment activated in a terminal session, you can run the command which python to get the path for this environment. This path can be pasted into the interpreter selection prompt in VSCode to use that same environment. " Troubleshooting,https://docs.mila.quebec/VSCode.html#troubleshooting,"Troubleshooting " “Cannot reconnect”,https://docs.mila.quebec/VSCode.html#cannot-reconnect,"“Cannot reconnect” When connecting to multiple compute nodes (and/or from multiple computers), some instances may crash with that message because of conflicts in the lock files VSCode installs in ~/.vscode-server (which is shared on all compute nodes). To fix this issue, you can change this setting in your settings.json file: { ""remote.SSH.lockfilesInTmp"": true } This will store the necessary lockfiles in /tmp on the compute nodes (which are local to the node). " Debugger timeouts,https://docs.mila.quebec/VSCode.html#debugger-timeouts,"Debugger timeouts Sometimes, slowness on the compute node or the networked filesystem might cause the VSCode debugger to timeout when starting a remote debug process. As a quick fix, you can add this to your ~/.bashrc or ~/.profile or equivalent resource file for your preferred shell, to increase the timeout delay to 500 seconds: export DEBUGPY_PROCESS_SPAWN_TIMEOUT=500 " Computational resources outside of Mila,https://docs.mila.quebec/Extra_compute.html#computational-resources-outside-of-mila,"Computational resources outside of Mila This section seeks to provide insights and information on computational resources outside the Mila cluster itself. " Digital Research Alliance of Canada Clusters,https://docs.mila.quebec/Extra_compute.html#digital-research-alliance-of-canada-clusters,"Digital Research Alliance of Canada Clusters The clusters named Beluga, Cedar, Graham, Narval and Niagara are clusters provided by the Digital Research Alliance of Canada organisation (the Alliance). For Mila researchers, these clusters are to be used for larger experiments having many jobs, multi-node computation and/or multi-GPU jobs as well as long running jobs. If you use these resources for your research, please remember to acknowledge their use in your papers. Note Compute Canada ceased its operational responsibilities for supporting Canada’s national advanced research computing (ARC) platform on March 31, 2022. The services will be supported by the new Digital Research Alliance of Canada. https://ace-net.ca/compute-canada-operations-move-to-the-digital-research-alliance-of-canada-(the-alliance).html " Current allocation description,https://docs.mila.quebec/Extra_compute.html#current-allocation-description,"Current allocation description Clusters of the Alliance are shared with researchers across the country. Allocations are given by the Alliance to selected research groups to ensure to a minimal amount of computational resources throughout the year. Depending on your affiliation, you will have access to different allocations. If you are a student at University of Montreal, you can have access to the rrg-bengioy-ad allocation described below. For students from other universities, you should ask your advisor to know which allocations you could have access to. From the Alliance’s documentation: An allocation is an amount of resources that a research group can target for use for a period of time, usually a year. To be clear, it is not a maximal amount of resources that can be used simultaneously, it is a weighting factor of the workload manager to balance jobs. For instance, even though we are allocated 400 GPU-years across all clusters, we can use more or less than 400 GPUs simultaneously depending on the history of usage from our group and other groups using the cluster at a given period of time. Please see the Alliance’s documentation for more information on how allocations and resource scheduling are configured for these installations. The table below provides information on the allocation for rrg-bengioy-ad for the period which spans from April 2022 to April 2023. Note that there are no special allocations for GPUs on Graham and therefore jobs with GPUs should be submitted with the account def-bengioy. Cluster CPUs GPUs # account Model # SLURM type specifier account Beluga 238 rrg-bengioy-ad V100-16G 77 v100 rrg-bengioy-ad Cedar 34 rrg-bengioy-ad V100-32G 138 v100l rrg-bengioy-ad Graham 34 rrg-bengioy-ad various – – def-bengioy Narval 34 rrg-bengioy-ad A100-40G 185 a100 rrg-bengioy-ad " Account Creation,https://docs.mila.quebec/Extra_compute.html#account-creation,"Account Creation To access the Alliance clusters you have to first create an account at https://ccdb.computecanada.ca. Use a password with at least 8 characters, mixed case letters, digits and special characters. Later you will be asked to create another password with those rules, and it’s really convenient that the two password are the same. Then, you have to apply for a role at https://ccdb.computecanada.ca/me/add_role, which basically means telling the Alliance that you are part of the lab so they know which cluster you can have access to, and track your usage. You will be asked for the CCRI (See screenshot below). Please reach out to your sponsor to get the CCRI. You will need to wait for your sponsor to accept before being able to login to the Alliance clusters. " Clusters,https://docs.mila.quebec/Extra_compute.html#clusters,"Clusters Beluga:(Mila doc) (Digital Research Alliance of Canada doc) For most students, Beluga is the best choice for both CPU and GPU jobs because of larger allocations on this cluster. Narval:(Mila doc) (Digital Research Alliance of Canada doc) Narval is the newest cluster, and contains the most powerful GPUs (A100). If your job can benefit from the A100’s features, such as TF32 floating-point math, Narval is the best choice. Cedar:(Mila doc) (Digital Research Alliance of Canada doc) Cedar is a good alternative to Beluga if you absolutely need to have an internet connection on the compute nodes. Graham:(Mila doc) (Digital Research Alliance of Canada doc) We do not have a GPU allocation on Graham anymore but it remains an alternative for CPU jobs. Niagara:(Mila doc) (Digital Research Alliance of Canada doc) Niagara is not recommended for most students. It is a CPU-only cluster with unusual configurations. Access is not automatic; It is opt-in and must be requested via CCDB manually. Compute resources in Niagara are not assigned to jobs on a per-CPU, but on a per-node basis. " Beluga,https://docs.mila.quebec/Extra_compute.html#beluga,"Beluga Beluga is a cluster located at ÉTS in Montreal. It uses SLURM to schedule jobs. Its full documentation can be found here, and its current status here. You can access Beluga via ssh: ssh @beluga.computecanada.ca Where is the username you created previously (see Account Creation). " Launching Jobs,https://docs.mila.quebec/Extra_compute.html#launching-jobs,"Launching Jobs Users must specify the resource allocation Group Name using the flag --account=rrg-bengioy-ad. To launch a CPU-only job: sbatch --time=1:0:0 --account=rrg-bengioy-ad job.sh Note The account name will differ based on your affiliation. To launch a GPU job: sbatch --time=1:0:0 --account=rrg-bengioy-ad --gres=gpu:1 job.sh And to get an interactive session, use the salloc command: salloc --time=1:0:0 --account=rrg-bengioy-ad --gres=gpu:1 The full documentation for jobs launching on Beluga can be found here. " Beluga nodes description,https://docs.mila.quebec/Extra_compute.html#beluga-nodes-description,"Beluga nodes description Each GPU node consists of: 40 CPU cores 186 GB RAM 4 GPU NVIDIA V100 (16GB) Tip You should ask for max 10 CPU cores and 32 GB of RAM per GPU you are requesting (as explained here), otherwise, your job will count for more than 1 allocation, and will take more time to get scheduled. " Beluga Storage,https://docs.mila.quebec/Extra_compute.html#beluga-storage,"Beluga Storage Storage Path Usage $HOME /home// Code Specific libraries $HOME/projects /project/rpp-bengioy Compressed raw datasets $SCRATCH /scratch/ Processed datasets Experimental results Logs of experiments $SLURM_TMPDIR Temporary job results They are roughly listed in order of increasing performance and optimized for different uses: The $HOME folder on NFS is appropriate for codes and libraries which are small and read once. Do not write experiemental results here! The $HOME/projects folder should only contain compressed raw datasets (processed datasets should go in $SCRATCH). We have a limit on the size and number of file in $HOME/projects, so do not put anything else there. If you add a new dataset there (make sure it is readable by every member of the group using chgrp -R rpp-bengioy ). The $SCRATCH space can be used for short term storage. It has good performance and large quotas, but is purged regularly (every file that has not been used in the last 3 months gets deleted, but you receive an email before this happens). $SLURM_TMPDIR points to the local disk of the node on which a job is running. It should be used to copy the data on the node at the beginning of the job and write intermediate checkpoints. This folder is cleared after each job. When an experiment is finished, results should be transferred back to Mila servers. More details on storage can be found here. " Modules,https://docs.mila.quebec/Extra_compute.html#modules,"Modules Many software, such as Python or MATLAB are already compiled and available on Beluga through the module command and its subcommands. Its full documentation can be found here. module avail Displays all the available modules module load Loads module spider Shows specific details about In particular, if you with to use Python 3.6 you can simply do: module load python/3.6 Tip If you wish to use Python on the cluster, we strongly encourage you to read Alliance Python Documentation, and in particular the Pytorch and/or Tensorflow pages. The cluster has many Python packages (or wheels), such already compiled for the cluster. See here for the details. In particular, you can browse the packages by doing: avail_wheels Such wheels can be installed using pip. Moreover, the most efficient way to use modules on the cluster is to build your environnement inside your job. See the script example below. " Script Example,https://docs.mila.quebec/Extra_compute.html#script-example,"Script Example Here is a sbatch script that follows good practices on Beluga: 1#!/bin/bash 2#SBATCH --account=rrg-bengioy-ad # Yoshua pays for your job 3#SBATCH --cpus-per-task=6 # Ask for 6 CPUs 4#SBATCH --gres=gpu:1 # Ask for 1 GPU 5#SBATCH --mem=32G # Ask for 32 GB of RAM 6#SBATCH --time=3:00:00 # The job will run for 3 hours 7#SBATCH -o /scratch//slurm-%j.out # Write the log in $SCRATCH 8 9# 1. Create your environement locally 10module load python/3.6 11virtualenv --no-download $SLURM_TMPDIR/env 12source $SLURM_TMPDIR/env/bin/activate 13pip install --no-index torch torchvision 14 15# 2. Copy your dataset on the compute node 16# IMPORTANT: Your dataset must be compressed in one single file (zip, hdf5, ...)!!! 17cp $SCRATCH/ $SLURM_TMPDIR 18 19# 3. Eventually unzip your dataset 20unzip $SLURM_TMPDIR/ -d $SLURM_TMPDIR 21 22# 4. Launch your job, tell it to save the model in $SLURM_TMPDIR 23# and look for the dataset into $SLURM_TMPDIR 24python main.py --path $SLURM_TMPDIR --data_path $SLURM_TMPDIR 25 26# 5. Copy whatever you want to save on $SCRATCH 27cp $SLURM_TMPDIR/ $SCRATCH " Using CometML and Wandb,https://docs.mila.quebec/Extra_compute.html#using-cometml-and-wandb,"Using CometML and Wandb The compute nodes for Beluga don’t have access to the internet, but there is a special module that can be loaded in order to allow training scripts to access some specific servers, which includes the necessary servers for using CometML and Wandb (“Weights and Biases”). module load httpproxy More documentation about this can be found here. " Graham,https://docs.mila.quebec/Extra_compute.html#graham,"Graham Graham is a cluster located at University of Waterloo. It uses SLURM to schedule jobs. Its full documentation can be found here, and its current status here. You can access Graham via ssh: ssh @graham.computecanada.ca Where is the username you created previously (see Account Creation). Since its structure is similar to Beluga, please look at the Beluga documentation, as well as relevant parts of the Digital Research Alliance of Canada Documentation. Note For GPU jobs the ressource allocation Group Name is the same as Beluga, so you should use the flag --account=rrg-bengioy-ad for GPU jobs. " Cedar,https://docs.mila.quebec/Extra_compute.html#cedar,"Cedar Cedar is a cluster located at Simon Fraser University. It uses SLURM to schedule jobs. Its full documentation can be found here, and its current status here. You can access Cedar via ssh: ssh @cedar.computecanada.ca Where is the username you created previously (see Account Creation). Since its structure is similar to Beluga, please look at the Beluga documentation, as well as relevant parts of the Digital Research Alliance of Canada Documentation. Note However, we don’t have any CPU priority on Cedar, in this case you can use --account=def-bengioy for CPU. Thus, it might take some time before they start. " Niagara,https://docs.mila.quebec/Extra_compute.html#niagara,"Niagara Niagara is a cluster located at University of Toronto. It uses SLURM to schedule jobs. Its full documentation can be found here, and its current status here. You can access Niagara via ssh: ssh @niagara.computecanada.ca Where is the username you created previously (see Account Creation). Since its structure is similar to Beluga, please look at the Beluga documentation, as well as relevant parts of the Digital Research Alliance of Canada Documentation. " FAQ,https://docs.mila.quebec/Extra_compute.html#faq,"FAQ " What to do with ImportError: /lib64/libm.so.6: version GLIBC_2.23 not found?,https://docs.mila.quebec/Extra_compute.html#what-to-do-with-importerror-lib64-libm-so-6-version-glibc-2-23-not-found,"What to do with ImportError: /lib64/libm.so.6: version GLIBC_2.23 not found? The structure of the file system is different than a classical Linux, so your code has trouble finding libraries. See how to install binary packages. " Disk quota exceeded error on /project file systems,https://docs.mila.quebec/Extra_compute.html#disk-quota-exceeded-error-on-project-file-systems,"Disk quota exceeded error on /project file systems You have files in /project with the wrong permissions. See how to change permissions. " Computing infrastructure and policies,https://docs.mila.quebec/Information.html#computing-infrastructure-and-policies,"Computing infrastructure and policies This section seeks to provide factual information and policies on the Mila cluster computing environments. " Roles and authorizations,https://docs.mila.quebec/Information.html#roles-and-authorizations,"Roles and authorizations There are mainly two types of researchers statuses at Mila : Core researchers Affiliated researchers This is determined by Mila policy. Core researchers have access to the Mila computing cluster. See your supervisor’s Mila status to know what is your own status. " Overview of available computing resources at Mila,https://docs.mila.quebec/Information.html#overview-of-available-computing-resources-at-mila,"Overview of available computing resources at Mila The Mila cluster is to be used for regular development and relatively small number of jobs (< 5). It is a heterogeneous cluster. It uses SLURM to schedule jobs. " Mila cluster versus Digital Research Alliance of Canada clusters,https://docs.mila.quebec/Information.html#mila-cluster-versus-digital-research-alliance-of-canada-clusters,"Mila cluster versus Digital Research Alliance of Canada clusters There are a lot of commonalities between the Mila cluster and the clusters from Digital Research Alliance of Canada (the Alliance). At the time being, the Alliance clusters where we have a large allocation of resources are beluga, cedar, graham and narval. We also have comparable computational resources in the Mila cluster, with more to come. The main distinguishing factor is that we have more control over our own cluster than we have over the ones at the Alliance. Notably, also, the compute nodes in the Mila cluster all have unrestricted access to the Internet, which is not the case in general for the Alliance clusters (although cedar does allow it). At the current time of this writing (June 2021), Mila students are advised to use a healthy diet of a mix of Mila and Alliance clusters. This is especially true in times when your favorite cluster is oversubscribed, because you can easily switch over to a different one if you are used to it. " Guarantees about one GPU as absolute minimum,https://docs.mila.quebec/Information.html#guarantees-about-one-gpu-as-absolute-minimum,"Guarantees about one GPU as absolute minimum There are certain guarantees that the Mila cluster tries to honor when it comes to giving at minimum one GPU per student, all the time, to be used in interactive mode. This is strictly better than “one GPU per student on average” because it’s a floor meaning that, at any time, you should be able to ask for your GPU, right now, and get it (although it might take a minute for the request to be processed by SLURM). Interactive sessions are possible on the Alliance clusters, and there are generally special rules that allow you to get resources more easily if you request them for a very short duration (for testing code before queueing long jobs). You do not get the same guarantee as on the Mila cluster, however. " Node profile description,https://docs.mila.quebec/Information.html#node-profile-description,"Node profile description Name GPU CPUs Sockets Cores/Socket Threads/Core Memory (GB) TmpDisk (TB) Arch Slurm Features Model Mem # GPU Arch and Memory GPU Compute Nodes cn-a[001-011] RTX8000 48 8 40 2 20 1 384 3.6 x86_64 turing,48gb cn-b[001-005] V100 32 8 40 2 20 1 384 3.6 x86_64 volta,nvlink,32gb cn-c[001-040] RTX8000 48 8 64 2 32 1 384 3 x86_64 turing,48gb cn-g[001-026] A100 80 4 64 2 32 1 1024 7 x86_64 ampere,nvlink,80gb DGX Systems cn-d[001-002] A100 40 8 128 2 64 1 1024 14 x86_64 ampere,nvlink,40gb cn-d[003-004] A100 80 8 128 2 64 1 2048 28 x86_64 ampere,nvlink,80gb cn-e[002-003] V100 32 8 40 2 20 1 512 7 x86_64 volta,32gb CPU Compute Nodes cn-f[001-004] 32 1 32 1 256 10 x86_64 rome cn-h[001-004] 64 2 32 1 768 7 x86_64 milan Legacy GPU Compute Nodes kepler5 V100 16 2 16 2 4 2 256 3.6 x86_64 volta,16gb TITAN RTX rtx[1,3-5,7] titanrtx 24 2 20 1 10 2 128 0.93 x86_64 turing,24gb " Special nodes and outliers,https://docs.mila.quebec/Information.html#special-nodes-and-outliers,"Special nodes and outliers " DGX A100,https://docs.mila.quebec/Information.html#dgx-a100,"DGX A100 DGX A100 nodes are NVIDIA appliances with 8 NVIDIA A100 Tensor Core GPUs. Each GPU has 40 GB of memory, for a total of 320 GB per appliance. The GPUs are interconnected via 6 NVSwitches which allows 4.8 TB/s bi-directional bandwidth. In order to run jobs on a DGX A100, add the flags below to your Slurm commands: --gres=gpu:a100: --reservation=DGXA100 " MIG,https://docs.mila.quebec/Information.html#mig,"MIG MIG (Multi-Instance GPU) is an NVIDIA technology allowing certain GPUs to be partitioned into multiple instances, each of which has a roughly proportional amount of compute resources, device memory and bandwidth to that memory. NVIDIA supports MIG on its A100 GPUs and allows slicing the A100 into up to 7 instances. Although this can theoretically be done dynamically, the SLURM job scheduler does not support doing so in practice as it does not model reconfigurable resources very well. Therefore, the A100s must currently be statically partitioned into the required number of instances of every size expected to be used. The cn-g series of nodes include A100-80GB GPUs. One third have been configured to offer regular (non-MIG mode) a100l GPUs. The other two-thirds have been configured in MIG mode, and offer the following profiles: Name GPU Cluster-wide Model Memory Compute # a100l.1g.10gb a100l.1 A100 10GB (1/8th) 1/7th of full 72 a100l.2g.20gb a100l.2 A100 20GB (2/8th) 2/7th of full 108 a100l.3g.40gb a100l.3 A100 40GB (4/8th) 3/7th of full 72 And can be requested using a SLURM flag such as --gres=gpu:a100l.1 The partitioning may be revised as needs and SLURM capabilities evolve. Other MIG profiles exist and could be introduced. Warning MIG has a number of important limitations, most notably that a GPU in MIG mode does not support graphics APIs (OpenGL/Vulkan), nor P2P over NVLink and PCIe. We have therefore chosen to limit every MIG job to exactly one MIG slice and no more. Thus, --gres=gpu:a100l.3 will work (and request a size-3 slice of an a100l GPU) but --gres=gpu:a100l.1:3 (with :3 requesting three size-1 slices) will not. " AMD,https://docs.mila.quebec/Information.html#amd,"AMD Warning As of August 20 2019 the GPUs had to return back to AMD. Mila will get more samples. You can join the amd slack channels to get the latest information Mila has a few node equipped with MI50 GPUs. srun --gres=gpu -c 8 --reservation=AMD --pty bash first time setup of AMD stack conda create -n rocm python=3.6 conda activate rocm pip install tensorflow-rocm pip install /wheels/pytorch/torch-1.1.0a0+d8b9d32-cp36-cp36m-linux_x86_64.whl " Data sharing policies,https://docs.mila.quebec/Information.html#data-sharing-policies,"Data sharing policies Note /network/scratch aims to support Access Control Lists (ACLs) to allow collaborative work on rapidly changing data, e.g. work in process datasets, model checkpoints, etc… /network/projects aims to offer a collaborative space for long-term projects. Data that should be kept for a longer period then 90 days can be stored in that location but first a request to Mila’s helpdesk has to be made to create the project directory. " Monitoring,https://docs.mila.quebec/Information.html#monitoring,"Monitoring Every compute node on the Mila cluster has a Netdata monitoring daemon allowing you to get a sense of the state of the node. This information is exposed in two ways: For every node, there is a web interface from Netdata itself at .server.mila.quebec:19999. This is accessible only when using the Mila wifi or through SSH tunnelling. SSH tunnelling: on your local machine, run ssh -L 19999:.server.mila.quebec:19999 -p 2222 login.server.mila.quebec or ssh -L 19999:.server.mila.quebec:19999 mila if you have already setup your SSH Login, then open http://localhost:19999 in your browser. The Mila dashboard at dashboard.server.mila.quebec exposes aggregated statistics with the use of grafana. These are collected internally to an instance of prometheus. In both cases, those graphs are not editable by individual users, but they provide valuable insight into the state of the whole cluster or the individual nodes. One of the important uses is to collect data about the health of the Mila cluster and to sound the alarm if outages occur (e.g. if the nodes crash or if GPUs mysteriously become unavailable for SLURM). " Example with Netdata on cn-c001,https://docs.mila.quebec/Information.html#example-with-netdata-on-cn-c001,"Example with Netdata on cn-c001 For example, if we have a job running on cn-c001, we can type cn-c001.server.mila.quebec:19999 in a browser address bar and the following page will appear. " Example watching the CPU/RAM/GPU usage,https://docs.mila.quebec/Information.html#example-watching-the-cpu-ram-gpu-usage,"Example watching the CPU/RAM/GPU usage Given that compute nodes are generally shared with other users who are also running jobs at the same time and consuming resources, this is not generally a good way to profile your code in fine details. However, it can still be a very useful source of information for getting an idea of whether the machine that you requested is being used in its full capacity. Given how expensive the GPUs are, it generally makes sense to try to make sure that this resources is always kept busy. CPU iowait (pink line): High values means your model is waiting on IO a lot (disk or network). CPU RAM You can see how much CPU RAM is being used by your script in practice, considering the amount that you requested (e.g. `sbatch --mem=8G ...`). GPU usage is generally more important to monitor than CPU RAM. You should not cut it so close to the limit that your experiments randomly fail because they run out of RAM. However, you should not request blindly 32GB of RAM when you actually require only 8GB. GPU Monitors the GPU usage using an nvidia-smi plugin for Netdata. Under the plugin interface, select the GPU number which was allocated to you. You can figure this out by running echo $SLURM_JOB_GPUS on the allocated node or, if you have the job ID, scontrol show -d job YOUR_JOB_ID | grep 'GRES' and checking IDX You should make sure you use the GPUs to their fullest capacity. Select the biggest batch size if possible to increase GPU memory usage and the GPU computational load. Spawn multiple experiments if you can fit many on a single GPU. Running 10 independent MNIST experiments on a single GPU will probably take less than 10x the time to run a single one. This assumes that you have more experiments to run, because nothing is gained by gratuitously running experiments. You can request a less powerful GPU and leave the more powerful GPUs to other researchers who have experiments that can make best use of them. Sometimes you really just need a k80 and not a v100. Other users or jobs If the node seems unresponsive or slow, it may be useful to check what other tasks are running at the same time on that node. This should not be an issue in general, but in practice it is useful to be able to inspect this to diagnose certain problems. " Example with Mila dashboard,https://docs.mila.quebec/Information.html#example-with-mila-dashboard,"Example with Mila dashboard " Storage,https://docs.mila.quebec/Information.html#storage,"Storage Path Performance Usage Quota (Space/Files) Backup Auto-cleanup /network/datasets/ High Curated raw datasets (read only) $HOME or /home/mila/// Low Personal user space Specific libraries, code, binaries 100GB/1000K Daily no $SCRATCH or /network/scratch/// High Temporary job results Processed datasets Optimized for small Files no no 90 days $SLURM_TMPDIR Highest High speed disk for temporary job results 4TB/- no at job end /network/projects// Fair Shared space to facilitate collaboration between researchers Long-term project storage 200GB/1000K Daily no $ARCHIVE or /network/archive/// Low Long-term personal storage 500GB no no Note The $HOME file system is backed up once a day. For any file restoration request, file a request to Mila’s IT support with the path to the file or directory to restore, with the required date. Warning Currently there is no backup system for any other file systems of the Mila cluster. Storage local to personal computers, Google Drive and other related solutions should be used to backup important data " $HOME,https://docs.mila.quebec/Information.html#home,"$HOME $HOME is appropriate for codes and libraries which are small and read once, as well as the experimental results that would be needed at a later time (e.g. the weights of a network referenced in a paper). Quotas are enabled on $HOME for both disk capacity (blocks) and number of files (inodes). The limits for blocks and inodes are respectively 100GiB and 1 million per user. The command to check the quota usage from a login node is: beegfs-ctl --cfgFile=/etc/beegfs/home.d/beegfs-client.conf --getquota --uid $USER " $SCRATCH,https://docs.mila.quebec/Information.html#scratch,"$SCRATCH $SCRATCH can be used to store processed datasets, work in progress datasets or temporary job results. Its block size is optimized for small files which minimizes the performance hit of working on extracted datasets. Note Auto-cleanup: this file system is cleared on a weekly basis, files not used for more than 90 days will be deleted. " $SLURM_TMPDIR,https://docs.mila.quebec/Information.html#slurm-tmpdir,"$SLURM_TMPDIR $SLURM_TMPDIR points to the local disk of the node on which a job is running. It should be used to copy the data on the node at the beginning of the job and write intermediate checkpoints. This folder is cleared after each job. " projects,https://docs.mila.quebec/Information.html#projects,"projects projects can be used for collaborative projects. It aims to ease the sharing of data between users working on a long-term project. Quotas are enabled on projects for both disk capacity (blocks) and number of files (inodes). The limits for blocks and inodes are respectively 200GiB and 1 million per user and per group. Note It is possible to request higher quota limits if the project requires it. File a request to Mila’s IT support. " $ARCHIVE,https://docs.mila.quebec/Information.html#archive,"$ARCHIVE $ARCHIVE purpose is to store data other than datasets that has to be kept long-term (e.g. generated samples, logs, data relevant for paper submission). $ARCHIVE is only available on the login nodes. Because this file system is tuned for large files, it is recommended to archive your directories. For example, to archive the results of an experiment in $SCRATCH/my_experiment_results/, run the commands below from a login node: cd $SCRATCH tar cJf $ARCHIVE/my_experiment_results.tar.xz --xattrs my_experiment_results Disk capacity quotas are enabled on $ARCHIVE. The soft limit per user is 500GB, the hard limit is 550GB. The grace time is 7 days. This means that one can use more than 500GB for 7 days before the file system enforces quota. However, it is not possible to use more than 550GB. The command to check the quota usage from a login node is df: df -h $ARCHIVE Note There is NO backup of this file system. " datasets,https://docs.mila.quebec/Information.html#datasets,"datasets datasets contains curated datasets to the benefit of the Mila community. To request the addition of a dataset or a preprocessed dataset you think could benefit the research of others, you can fill this form. Datasets can also be browsed from the web : Mila Datasets Datasets in datasets/restricted are restricted and require an explicit request to gain access. Please submit a support ticket mentioning the dataset’s access group (ex.: scannet_users), your cluster’s username and the approbation of the group owner. You can find the dataset’s access group by listing the content of /network/datasets/restricted with the ls command. Those datasets are mirrored to the Alliance clusters in ~/projects/rrg-bengioy-ad/data/curated/ if they follow Digital Research Alliance of Canada’s good practices on data. To list the local datasets on an Alliance cluster, you can execute the following command: ssh [CLUSTER_LOGIN] -C ""projects/rrg-bengioy-ad/data/curated/list_datasets_cc.sh"" " Data Transmission,https://docs.mila.quebec/Information.html#data-transmission,"Data Transmission Multiple methods can be used to transfer data to/from the cluster: rsync --bwlimit=10mb; this is the favored method since the bandwidth can be limited to prevent impacting the usage of the cluster: rsync Digital Research Alliance of Canada: Globus " Getting started,https://docs.mila.quebec/Getting_started.html#getting-started,"Getting started See User’s guide. " User’s guide,https://docs.mila.quebec/Userguide.html#user-s-guide,"User’s guide …or IDT’s list of opinionated howtos This section seeks to provide users of the Mila infrastructure with practical knowledge, tips and tricks and example commands. " Quick Start,https://docs.mila.quebec/Userguide.html#quick-start,"Quick Start Users first need login access to the cluster. It is recommended to install milatools which will help in the set up of the ssh configuration needed to securely and easily connect to the cluster. " mila code,https://docs.mila.quebec/Userguide.html#mila-code,"mila code milatools also makes it easy to run and debug code on the Mila cluster. Using the mila code command will allow you to use VSCode on the server. Simply run: mila code path/on/cluster The details of the command can be found on the github page of the package. Note that you need to first setup your ssh configuration using mila init before the mila code command can be used. The initialisation of the ssh configuration is explained here and on the github page of the package. " Logging in to the cluster,https://docs.mila.quebec/Userguide.html#logging-in-to-the-cluster,"Logging in to the cluster To access the Mila Cluster clusters, you will need a Mila account. Please contact Mila systems administrators if you don’t have it already. Our IT support service is available here: https://it-support.mila.quebec/ You will also need to complete and return an IT Onboarding Training to get access to the cluster. Please refer to the Mila Intranet for more informations: https://sites.google.com/mila.quebec/mila-intranet/it-infrastructure/it-onboarding-training IMPORTANT : Your access to the Cluster is granted based on your status at Mila (for students, your status is the same as your main supervisor’ status), and on the duration of your stay, set during the creation of your account. The following have access to the cluster : Current Students of Core Professors - Core Professors - Staff " SSH Login,https://docs.mila.quebec/Userguide.html#ssh-login,"SSH Login You can access the Mila cluster via ssh: # Generic login, will send you to one of the 4 login nodes to spread the load ssh @login.server.mila.quebec -p 2222 # To connect to a specific login node, X in [1, 2, 3, 4] ssh @login-X.login.server.mila.quebec -p 2222 Four login nodes are available and accessible behind a load balancer. At each connection, you will be redirected to the least loaded login-node. The ECDSA, RSA and ED25519 fingerprints for Mila’s login nodes are: SHA256:baEGIa311fhnxBWsIZJ/zYhq2WfCttwyHRKzAb8zlp8 (ECDSA) SHA256:Xr0/JqV/+5DNguPfiN5hb8rSG+nBAcfVCJoSyrR0W0o (RSA) SHA256:gfXZzaPiaYHcrPqzHvBi6v+BWRS/lXOS/zAjOKeoBJg (ED25519) Important Login nodes are merely entry points to the cluster. They give you access to the compute nodes and to the filesystem, but they are not meant to run anything heavy. Do not run compute-heavy programs on these nodes, because in doing so you could bring them down, impeding cluster access for everyone. This means no training or experiments, no compiling programs, no Python scripts, but also no zip of a large folder or anything that demands a sustained amount of computation. Rule of thumb: never run a program that takes more than a few seconds on a login node. Note In a similar vein, you should not run VSCode remote SSH instances directly on login nodes, because even though they are typically not very computationally expensive, when many people do it, they add up! See Visual Studio Code for specific instructions. " mila init,https://docs.mila.quebec/Userguide.html#mila-init,"mila init To make it easier to set up a productive environment, Mila publishes the milatools package, which defines a mila init command which will automatically perform some of the below steps for you. You can install it with pip and use it, provided your Python version is at least 3.8: $ pip install milatools $ mila init " SSH Config,https://docs.mila.quebec/Userguide.html#ssh-config,"SSH Config The login nodes support the following authentication mechanisms: publickey,keyboard-interactive. If you would like to set an entry in your .ssh/config file, please use the following recommendation: Host mila User YOUR-USERNAME Hostname login.server.mila.quebec PreferredAuthentications publickey,keyboard-interactive Port 2222 ServerAliveInterval 120 ServerAliveCountMax 5 Then you can simply write ssh mila to connect to a login node. You will also be able to use mila with scp, rsync and other such programs. Tip You can run commands on the login node with ssh directly, for example ssh mila squeue -u '$USER' (remember to put single quotes around any $VARIABLE you want to evaluate on the remote side, otherwise it will be evaluated locally before ssh is even executed). " Passwordless login,https://docs.mila.quebec/Userguide.html#passwordless-login,"Passwordless login To save you some repetitive typing it is highly recommended to set up public key authentication, which means you won’t have to enter your password every time you connect to the cluster. # ON YOUR LOCAL MACHINE # You might already have done this in the past, but if you haven't: ssh-keygen # Press ENTER 3x # Copy your public key over to the cluster # You will need to enter your password ssh-copy-id mila " Connecting to compute nodes,https://docs.mila.quebec/Userguide.html#connecting-to-compute-nodes,"Connecting to compute nodes If (and only if) you have a job running on compute node “cnode”, you are allowed to SSH to it directly, if for some reason you need a second terminal. That session will be automatically ended when your job is relinquished. First, however, you need to have password-less ssh either with a key present in your home or with an ssh-agent. To generate a key pair on the login node: # ON A LOGIN NODE ssh-keygen # Press ENTER 3x cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys chmod 600 ~/.ssh/authorized_keys chmod 700 ~/.ssh Then from the login node you can write ssh . From your local machine, you can use ssh -J mila USERNAME@ (-J represents a “jump” through the login node, necessary because the compute nodes are behind a firewall). If you wish, you may also add the following wildcard rule in your .ssh/config: Host *.server.mila.quebec !*login.server.mila.quebec HostName %h User YOUR-USERNAME ProxyJump mila This will let you connect to a compute node with ssh .server.mila.quebec. " Running your code,https://docs.mila.quebec/Userguide.html#running-your-code,"Running your code " SLURM commands guide,https://docs.mila.quebec/Userguide.html#slurm-commands-guide,"SLURM commands guide " Basic Usage,https://docs.mila.quebec/Userguide.html#basic-usage,"Basic Usage The SLURM documentation provides extensive information on the available commands to query the cluster status or submit jobs. Below are some basic examples of how to use SLURM. " Submitting jobs,https://docs.mila.quebec/Userguide.html#submitting-jobs,"Submitting jobs " Batch job,https://docs.mila.quebec/Userguide.html#batch-job,"Batch job In order to submit a batch job, you have to create a script containing the main command(s) you would like to execute on the allocated resources/nodes. 1#!/bin/bash 2#SBATCH --job-name=test 3#SBATCH --output=job_output.txt 4#SBATCH --error=job_error.txt 5#SBATCH --ntasks=1 6#SBATCH --time=10:00 7#SBATCH --mem=100Gb 8 9module load python/3.5 10python my_script.py Your job script is then submitted to SLURM with sbatch (ref.) sbatch job_script sbatch: Submitted batch job 4323674 The working directory of the job will be the one where your executed sbatch. Tip Slurm directives can be specified on the command line alongside sbatch or inside the job script with a line starting with #SBATCH. " Interactive job,https://docs.mila.quebec/Userguide.html#interactive-job,"Interactive job Workload managers usually run batch jobs to avoid having to watch its progression and let the scheduler run it as soon as resources are available. If you want to get access to a shell while leveraging cluster resources, you can submit an interactive jobs where the main executable is a shell with the srun/salloc (srun/salloc) commands salloc Will start an interactive job on the first node available with the default resources set in SLURM (1 task/1 CPU). srun accepts the same arguments as sbatch with the exception that the environment is not passed. Tip To pass your current environment to an interactive job, add --preserve-env to srun. salloc can also be used and is mostly a wrapper around srun if provided without more info but it gives more flexibility if for example you want to get an allocation on multiple nodes. " Job submission arguments,https://docs.mila.quebec/Userguide.html#job-submission-arguments,"Job submission arguments In order to accurately select the resources for your job, several arguments are available. The most important ones are: Argument Description -n, –ntasks= The number of task in your script, usually =1 -c, –cpus-per-task= The number of cores for each task -t, –time=