Docker for researchers

Jørgen Aarmo Lund

5/9/23

Introduction

  • Jørgen Aarmo Lund, industry PhD student at the UiT Machine Learning group for DIPS AS
    • “Data-driven pathways”: inferring usage patterns in patient record systems from auditing logs
    • Researching explainability, natural language processing
  • DIPS develops e-health systems: patient records, laboratory services, hospital kiosks, and more
    • Gradually moving over applications to containers

Agenda

  • Part 1: Introduction to containers
    • What are containers?
    • Why are they useful for ML research?
    • Getting started with Docker
  • Part 2: Putting together our own container images
    • Basic Dockerfile syntax
    • Where does the model go?
    • Debugging tips
  • Part 3: Deploying containers for ML research
    • GPU and device access
    • Deploying to UiT’s GPU cluster
    • Deploying to NRIS HPC clusters

Follow along

Files available on

https://github.com/jaalu/vigs-docker-workshop

Motivation for software developers

  • IT around 20081: developers handing applications to sysadmins maintaining long-lived servers
    • Downtime for manual installation
    • Server and application maintenance intertwined
    • Conflicts between dependencies
  • Containers allow isolating applications and running them with their own set of dependencies

Motivation for ML researchers

  • Replicability: making experimental conditions visible
  • Flexibility: easing transition from laptop tests to HPC training
  • Reusability: showing findings work in other settings too!

Docker

  • Docker allows isolating your script into a container, which:
    • Runs isolated from other processes while sharing the OS
    • Can package their own set of dependencies
    • Can be packaged and started on other servers, including HPC clusters
  • Maintained by Docker Inc., runtime open source
  • Docker Desktop packages the software with a GUI, free for researchers

Docker - structure

Key concepts

We separate between containers and images:

  • A container is a standalone environment with your script and the dependencies it needs
  • An image is the template for making your container
    • Images can be saved to a registry, like Docker Hub

Containers are meant to be disposable: changes you want to keep - like your trained model - should be outside of the container!

Installing Docker - options

  • Docker Desktop: https://www.docker.com/
  • Play With Docker: https://labs.play-with-docker.com
    • Free online lab with VMs provisioned
  • Docker also provides an apt repository

Checking that Docker works

  • When Docker is running, we can get a list of running containers with
$ docker ps 
CONTAINER ID   IMAGE      COMMAND                  CREATED          STATUS         PORTS      NAMES
  • We can then retrieve an image with docker pull:
$ docker pull hello-world
  • We can then build a container from the image with docker run:
$ docker run hello-world

Running containers - custom commands

  • Images specify a default command, but we can specify one ourselves in docker run:
$ docker run ubuntu echo Hello!
Hello!

Running containers - detached

  • Default: containers do not accept any input, but write to the terminal
  • More likely you want a container which runs detached in the background, with --detach or -d:
$ docker run -d hello-world
$ docker ps -a
CONTAINER ID   IMAGE         COMMAND         CREATED              STATUS                          PORTS     NAMES
d3a5ee04babd   hello-world   "/hello"        About a minute ago   Exited (0) About a minute ago             elated_feistel
$ docker logs elated_feistel
  • NOTE: Docker options placed before the container image and the command

Running containers - interactive

  • Alternatively, we can specify that the container should set up a shell and accept input with --interactive --tty, or -it for short:
$ docker run -it python:3.9
Python 3.9.16 (main, May  4 2023, 06:16:43) 
[GCC 10.2.1 20210110] on linux
Type "help", "copyright", "credits" or "license" for more information.

Running containers - cleaning up:

  • Containers will stick around after they finish running
  • Nice for checking logs, restarting, but list easily clogged
  • Passing --rm will delete the container after it exits:
$ docker run --rm hello-world
$ docker ps -a

Running containers - configuration

  • We can set environment variables in the container with --env or -e:
$ docker run -e MODEL_ARCH=resnet ubuntu
  • If we want to expose network ports (e.g. for dashboards) we can map ports from the container to the host with -p:
$ docker run -p 8080:80 httpd
  • NOTE: the order is host-container, so -p 8080:80 will connect port 80 on the container to port 8080 on the host

Where do we keep the model?

“Containers are meant to be disposable: changes you want to keep should be outside of the container”

So where do we store the trained models?

Where could we keep the model?

  • Embed it as part of the image
    • Not an option for training, gives us large images
  • Copy it to/from the container after start
    • docker cp can copy files
    • Can bump into runtime storage limits
  • Upload to/download from online server
    • Extra warmup time, network traffic
    • Weights & Biases, Hugging Face libraries provide functionality for this

Where should we keep the model?

  • Bind mounts
    • Creates a temporary link between a directory on the host PC and a directory in the container
    • Pros: Can see the directory, pull files quickly
    • Cons: Assumes the storage is on your PC, not as flexible as volumes
  • Volumes
    • Docker creates and manages a persistent directory
    • Pros: More flexible, can set up plugins to mount cloud storage as volumes
    • Cons: Requires a running (temporary) container to copy files to host

Mapping bind mounts

To mount a directory with a bind mount we can use --mount:

$ docker run --mount type=bind,source=$(pwd)/assets/,target=/pictures/ ubuntu

source points to the folder on the host (assets in the working directory), and target is the folder it will appear as in the container (/pictures/)

Setting up a volume

To set up a volume we run docker volume create:

$ docker volume create my-volume
$ docker volume inspect my-volume

We can then mount it in the same way, but with type=volume:

$ docker run --mount type=volume,source=my-volume,target=/results/ ubuntu

Break

Creating an image - FROM

  • Template file conventionally named Dockerfile (no extension)
  • Images start with a FROM statement, which specifies which image (ubuntu) and tag (22.04) to build on:
FROM ubuntu:22.04
  • If you omit the tag, Docker will automatically use latest - but use one if you can!

Building an image

  • This turns out to be a complete Dockerfile:
FROM ubuntu:22.04
  • Save this as Dockerfile (no extension!) in a project directory, and run
$ docker build . -t my-image
  • docker build expects a build context - the directory to pull project files from when building the image - like the current directory .
  • -t ties the new image to a name and a tag

Creating an image - RUN

  • Once we have a base, we specify which commands to run to build the environment
  • RUN statements are followed by commands to run - for instance, we can install packages:
RUN apt-get update && apt-get install -y python3

Creating an image - COPY

  • Earlier, we specified the build context (usually the project directory)
  • COPY copies files from the build context to the container:
COPY train.py /experiment/train.py
  • We can set the working directory in the container with WORKDIR
WORKDIR /experiment/
  • The .dockerignore file specifies which files not to copy from the context

Creating an image - ENV

  • We can also specify which settings the container expects/respects when it runs - conventionally done through environment variables
  • ENV sets the default value for an environment variable:
  • Default values can be overridden with docker run -e

Creating an image - CMD

  • Finally, we specify what should happen when the container starts with CMD:
CMD python train.py

Full image example

Using all of the commands:

FROM ubuntu:22.04
ENV BATCH_SIZE=128
RUN apt-get update && apt-get install -y python3
COPY train.py /experiment/train.py
CMD python3 /experiment/train.py

Debugging tip 1

  • What if: your image fails to build on step 14?
  • You can temporarily turn off Buildkit to get intermediate images for each layer: On Linux:
$ DOCKER_BUILDKIT=0 docker build .

On Windows:

set DOCKER_BUILDKIT=0& docker build .
  • You can create a container from the intermediate image and retry the step:
$ docker exec -it <intermediate-image-id> bash 
# pip install ...

Debugging tip 2

  • What if: the training suddenly stops and nothing happens?
  • Find the running container
$ docker ps -a
e593fff04794   postgres   "docker-entrypoint.s…"   10 seconds ago   Up 9 seconds   5432/tcp   stupefied_elion
  • and start an interactive shell inside it:
$ docker exec -it stupefied_elion bash

Archiving images

You can also save images as archives with docker save

$ docker save mnist-demo > mnist-demo.tar

and load them with docker load

$ docker load < mnist-demo.tar

Making cache-friendly images

  • Images are composed of multiple layers: each set of changes made by RUN and COPY makes up a layer. Try:

$ docker image inspect python

  • To avoid new images for every build, Docker saves extra info for each layer:
    • For RUN, the command is saved
    • For COPY, a checksum of the added files is saved
  • If the commands run/the files added by the layer are the same, and the last layer is the same, the layer is reused

Making cache-friendly images pt. 2

  • For this reason, the most frequent and smallest changes should come last in your image, e.g.
  1. Installing system packages
  2. Installing Python packages
  3. Adding your script
  4. Running your script
  • The commands should, as far as possible, have the same results each time you run them

System package installation

  • When using apt, group together updating and package installation:
apt-get update && apt-get install -y python3
  • NOTE: This doesn’t guarantee that you get the latest packages every time you build, but makes sure the package index makes sense when you install the packages

  • Possible to lock apt packages to specific versions, but specifying the distro usually sufficient

Language package installation

  • Generally we want a lock file with exact package versions, which we can then restore as part of building the image
  • In Python, pip freeze produces a list of all packages installed
$ pip freeze > requirements.txt
  • Good practice to set up a virtual environment with venv to isolate exactly which packages the project needs
  • In R, renv (formerly packrat) can create similar virtual environments with lock files:
renv::snapshot() # saves project deps to renv.lock
renv::restore() # restores deps from renv.lock

M1 Mac: installing amd64 packages

  • Common problem on M1 Macs: older packages/libraries without ARM binaries
  • We can ask Docker to have the container act as an Intel Mac with --platform=linux/amd64:
docker run --platform=linux/amd64 ubuntu uname -a

Building images - conclusion

  • Order changes from least to most frequent
  • Commands should be deterministic
  • Virtual environments and lock files useful to have reproducible containers

GPU/device access

  • Nvidia Container Runtime lets you run CUDA code in containers:
$ apt-get install nvidia-container-runtime
  • Can use the --gpus switch to grant access to GPU:
$ docker run --gpus all ubuntu
  • Windows: requires WSL 2, newer CUDA driver
  • Premade CUDA Docker images: https://github.com/NVIDIA/nvidia-docker/wiki/CUDA
  • Also possible to grant access to USB/hardware devices with --device

Packaging for cloud services

  • You can set up your own container registry with the cloud provider:
    • Amazon ECR
    • Azure Container Registry
    • Google Cloud Container Registry
  • To push the image to your cloud registry, tag your image with the URL of the registry and push:
$ docker tag mnist-demo jludemo.azurecr.io/mnist-demo
$ docker push jludemo.azurecro.io/mnist.demo

Packaging for Springfield (UiT)

  • UiT’s Springfield cluster uses Kubernetes for orchestration across multiple nodes
  • Docker Desktop allows setting up your PC as a single-node Kubernetes cluster
  • Kubernetes lets us define a Job which runs one or more containers until completion:
kind: Job
apiVersion: batch/v1
metadata:
  name: your-training-job
spec:
  template:
    spec:
      containers:
      - name: your-training
        image: "your-training-image"
        workingDir: /storage
        command: ["sh", "train.sh"]
        volumeMounts:
        - name: storage
          mountPath: /storage
      volumes:
      - name: storage
        persistentVolumeClaim:
          claimName: storage
      restartPolicy: OnFailure
  backoffLimit: 0

Packaging for NRIS HPC (Betzy, LUMI)

  • As long as architecture is the same, we can now push images and run the containers on HPC clusters!
  • However: container libraries most likely not tuned for the HPC cluster
    • If libraries are available as Lmod modules, the modules will be faster
  • But containers are still useful for
    • Portability
    • Specifying package/library versions

Packaging for NRIS HPC - Singularity (Betzy, LUMI)

  • Singularity, the container runtime installed on NRIS HPC computers, supports converting Docker images to Singularity .sif images. Running
$ singularity pull --name train.sif docker://jlu015/train:latest

will retrieve the image jlu015/train from Docker Hub and save it as train.sif

  • To run the default command specified by CMD, we can call singularity run with the image:
$ singularity run train.sif
  • Alternatively, singularity exec will run a specific command:
$ singularity exec train.sif echo Hello world!

Packaging for NRIS HPC - Writing a SLURM job (Betzy, LUMI)

See https://documentation.sigma2.no/code_development/guides/containers.html#singularity-in-job-scripts

Resources

  • Docker in Y Minutes:
    • https://learnxinyminutes.com/docs/docker/
  • The Play with Docker exercises:
    • https://training.play-with-docker.com/
  • NRIS’ documentation on containers:
    • https://documentation.sigma2.no/