1
0
mirror of https://github.com/PrivSec-dev/privsec.dev synced 2025-01-22 03:52:04 -05:00
privsec.dev/public/os/docker-and-oci-hardening/index.html
Tommy 6b30fb93fe
New Hugo build
Signed-off-by: Tommy <contact@tommytran.io>
2022-07-17 06:00:20 -04:00

63 lines
69 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!doctype html><html lang=en dir=auto><head><meta charset=utf-8><meta http-equiv=x-ua-compatible content="IE=edge"><meta name=viewport content="width=device-width,initial-scale=1,shrink-to-fit=no"><meta name=robots content="index, follow"><title>Docker and OCI Hardening | PrivSec.dev</title><meta name=keywords content="operating systems,linux,container,security"><meta name=description content="Containers aren&rsquo;t that new fancy thing anymore, but they were a big deal. And they still are. They are a concrete solution to the following problem:
- Hey, your software doesn&rsquo;t work&mldr;
- Sorry, it works on my computer! Can&rsquo;t help you.
Whether we like them or not, containers are here to stay. Their expressiveness and semantics allow for an abstraction of the OS dependencies that a software has, the latter being often dynamically linked against certain libraries."><meta name=author content="Wonderfall"><link rel=canonical href=https://wonderfall.dev/docker-hardening/><link crossorigin=anonymous href=/assets/css/stylesheet.8b523f1730c922e314350296d83fd666efa16519ca136320a93df674d00b6325.css integrity="sha256-i1I/FzDJIuMUNQKW2D/WZu+hZRnKE2MgqT32dNALYyU=" rel="preload stylesheet" as=style><script defer crossorigin=anonymous src=/assets/js/highlight.f413e19d0714851f6474e7ee9632408e58ac146fbdbe62747134bea2fa3415e0.js integrity="sha256-9BPhnQcUhR9kdOfuljJAjlisFG+9vmJ0cTS+ovo0FeA=" onload=hljs.initHighlightingOnLoad()></script>
<link rel=icon href=https://privsec.dev/%3Clink%20/%20abs%20url%3E><link rel=icon type=image/png sizes=16x16 href=https://privsec.dev/%3Clink%20/%20abs%20url%3E><link rel=icon type=image/png sizes=32x32 href=https://privsec.dev/%3Clink%20/%20abs%20url%3E><link rel=apple-touch-icon href=https://privsec.dev/%3Clink%20/%20abs%20url%3E><link rel=mask-icon href=https://privsec.dev/%3Clink%20/%20abs%20url%3E><meta name=theme-color content="#2e2e33"><meta name=msapplication-TileColor content="#2e2e33"><noscript><style>#theme-toggle,.top-link{display:none}</style></noscript><meta property="og:title" content="Docker and OCI Hardening"><meta property="og:description" content="Containers aren&rsquo;t that new fancy thing anymore, but they were a big deal. And they still are. They are a concrete solution to the following problem:
- Hey, your software doesn&rsquo;t work&mldr;
- Sorry, it works on my computer! Can&rsquo;t help you.
Whether we like them or not, containers are here to stay. Their expressiveness and semantics allow for an abstraction of the OS dependencies that a software has, the latter being often dynamically linked against certain libraries."><meta property="og:type" content="article"><meta property="og:url" content="https://privsec.dev/os/docker-and-oci-hardening/"><meta property="article:section" content="os"><meta name=twitter:card content="summary"><meta name=twitter:title content="Docker and OCI Hardening"><meta name=twitter:description content="Containers aren&rsquo;t that new fancy thing anymore, but they were a big deal. And they still are. They are a concrete solution to the following problem:
- Hey, your software doesn&rsquo;t work&mldr;
- Sorry, it works on my computer! Can&rsquo;t help you.
Whether we like them or not, containers are here to stay. Their expressiveness and semantics allow for an abstraction of the OS dependencies that a software has, the latter being often dynamically linked against certain libraries."><script type=application/ld+json>{"@context":"https://schema.org","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":2,"name":"Operating Systems","item":"https://privsec.dev/os/"},{"@type":"ListItem","position":3,"name":"Docker and OCI Hardening","item":"https://privsec.dev/os/docker-and-oci-hardening/"}]}</script><script type=application/ld+json>{"@context":"https://schema.org","@type":"BlogPosting","headline":"Docker and OCI Hardening","name":"Docker and OCI Hardening","description":"Containers aren\u0026rsquo;t that new fancy thing anymore, but they were a big deal. And they still are. They are a concrete solution to the following problem:\n- Hey, your software doesn\u0026rsquo;t work\u0026hellip;\n- Sorry, it works on my computer! Can\u0026rsquo;t help you.\nWhether we like them or not, containers are here to stay. Their expressiveness and semantics allow for an abstraction of the OS dependencies that a software has, the latter being often dynamically linked against certain libraries.","keywords":["operating systems","linux","container","security"],"articleBody":"Containers arent that new fancy thing anymore, but they were a big deal. And they still are. They are a concrete solution to the following problem:\n- Hey, your software doesnt work…\n- Sorry, it works on my computer! Cant help you.\nWhether we like them or not, containers are here to stay. Their expressiveness and semantics allow for an abstraction of the OS dependencies that a software has, the latter being often dynamically linked against certain libraries. The developer can therefore provide a known-good environment where it is expected that their software “just works”. That is particularly useful for development to eliminate environment-related issues, and that is often used in production as well.\nContainers are often perceived as a great tool for isolation, that is, they can provide an isolated workspace that wont pollute your host OS - all that without the overhead of virtual machines. Security-wise: containers, as we know them on Linux, are glorified namespaces at their core. Containers usually share the same kernel with the host, and namespaces is the kernel feature for separating kernel resources across containers (IDs, networks, filesystems, IPC, etc.). Containers also leverage the features of cgroups to separate system resources (CPU, memory, etc.), and security features such as seccomp to restrict syscalls, or MACs (AppArmor, SELinux).\nAt first, it seems that containers may not provide the same isolation boundary as virtual machines. Thats fine, they were not designed to. But they cant be simplified to a simple chroot either. Well see that a “container” can mean a lot of things, and their definition may vary a lot depending on the implementation: as such, containers are mostly defined by their semantics.\nDocker is dead, long live Docker… and OCI! When people think of containers, a large group of them may think of Docker. While Docker played a big role in the popularity of containers a few years ago, it didnt introduce the technology: on Linux, LXC did (Linux Containers). In fact, Docker in its early days was a high-level wrapper for LXC which already combined the power of namespaces and cgroups. Docker then replaced LXC with libcontainer which does more or less the same, plus extra features.\nThen, what happened? Open Container Initiative (OCI). That is the current standard that defines the container ecosystem. That means that whether youre using Docker, Podman, or Kubernetes, youre in fact running OCI-compliant tools. That is a good thing, as it saves a lot of interoperability headaches.\nDocker is no longer the monolithic platform it once was. libcontainer was absorbed by runc, the reference OCI runtime. The high-level components of Docker split into different parts related to the upstream Moby project (Docker is the “assembled product” of the “Moby components”). When we refer to Docker, we refer in fact at this powerful high-level API that manages OCI containers. By design, Docker is a daemon that communicates with containerd, a lower-level layer, which in turn communicates with the OCI runtime. That also means that you could very well skip Docker altogether and use containerd or even runc directly.\nDocker client \u003c=\u003e Docker daemon \u003c=\u003e containerd \u003c=\u003e containerd-shim \u003c=\u003e runc Podman is an alternative to Docker developed by RedHat, that also intends to be a drop-in replacement for Docker. It doesnt work with a daemon, and can work rootless by design (Docker has support for rootless too, but that is not without caveats). I would largely recommend Podman over Docker for someone who wants a simple tool to run containers and test code on their machine.\nKubernetes (also known as K8S) is the container platform made by Google. It is designed with scaling in mind, and is about running containers across a cluster whereas Docker focuses on packaging containers on a single node. Docker Swarm is the direct alternative to that, but it has never really took off due to the popularity of K8S.\nFor the rest of this article, we will use Docker as the reference for our examples, along with the Compose specification format. Most of these examples can be adapted to other platforms without issues.\nThe nightmare of dependencies Containers are made from images, and images are typically built from a Dockerfile. Images can be built and distributed through OCI registries: Docker Hub, Google Container Registry, GitHub Container Registry, and so on. You can also set up your own private registry as well, but the reality is that people often pull images from these public registries.\nImages, immutability and versioning Images are what make containers, well, containers. Containers made from the same image should behave similary on different machines. Images can have tags, which are useful for software versioning. The usage of generic tags such as latest is often discouraged because it defeats the purpose of the expected behavior of the container. Tags are not necessarily immutable by design, and they shouldnt be (more on that below). Digest, however, is the attribute of an immutable image, and is often generated with the SHA-256 algorithm.\ndocker.io/library/golang:1.17.1@sha256:232a180dbcbcfa7250917507f3827d88a9ae89bb1cdd8fe3ac4db7b764ebb25 ^ ^ ^ ^ | | | | Registry Image Tag Digest (immutable) Now onto why tags shouldnt be immutable: as written above, containers bring us an abstraction over the OS dependencies that are used by the packaged software. That is nice indeed, but this shouldnt lure us into believing that we can forget security updates. The fact is, there is still a whole OS to care about, and we cant just think of the container as a simple package tool for software.\nFor these reasons, good practices were established:\nAn image should be as minimal as possible (Alpine Linux, or scratch/distroless). An image, with a given tag, should be regularly built, without cache to ensure all layers are freshly built. An image should be rebuilt when the images its based on are updated. A minimal base system Alpine Linux is often the choice for official images for the first reason. This is not a typical Linux distribution as it uses musl as its C library, but it works quite well. Actually, Im quite fond of Alpine Linux and apk (its package manager). If a supervision suite is needed, Id look into s6. If you need a glibc distribution, Debian provides slim variants for lightweight base images. We can do even better than using Alpine by using distroless images, allowing us to have state-of-the-art application containers.\n“Distroless” is a fancy name referring to an image with a minimal set of dependencies, from none (for fully static binaries) to some common libraries (typically the C library). Google maintains distroless images you can use as a base for your own images. If you were wondering, the difference with scratch (empty starting point) is that distroless images contain common dependencies that “almost-statically compiled” binaries may need, such as ca-certificates.\nHowever, distroless images are not suited for every application. In my experience though, distroless is an excellent option with pure Go binaries. Going with minimal images drastically reduces the available attack surface in the container. For example, heres a multi-stage Dockerfile resulting in a minimal non-root image for a simple Go project:\nFROM golang:alpine as build WORKDIR /app COPY . . RUN CGO_ENABLED=0 go mod -o /my_app cmd/my_app FROM gcr.io/distroless/static COPY --from=build /my_app / USER nobody ENTRYPOINT [\"/my_app\"] The main drawback of using minimal images is the lack of tools that help with debugging, which also constitute the very attack surface were trying to get rid of. The trade-off is probably not worth the hassle for development-focused containers, and if youre running such images in production, you have to be confident enough to operate with them. Note that the gcr.io/distroless images have a :debug tag to help in that regard.\nKeeping images up-to-date The two other points are highly problematic, because most software vendors just publish an image on release, and forget about it. You should take it up to them if youre running images that are versioned but not regularly updated. Id say running scheduled builds once a week is the bare minimum to make sure dependencies stay up-to-date. Alpine Linux is a better choice than most other “stable” distributions because it usually has more recent packages.\nStable distributions often rely on backporting security fixes from CVEs, which is known to be a flawed approach to security since CVEs arent always assigned or even taken care of. Alpine has more recent packages, and it has versioning, so its once again a particulary good choice as long as musl doesnt cause issues.\nIs it really a security nightmare? When people say Docker is a security nightmare because of that, thats a fair point. On a traditional system, you could upgrade your whole system with a single command or two. With Docker, youll have to recreate several containers… if the images were kept up-to-date in the first place. Recreating itself is not a big deal actually: hot upgrades of binaries and libraries often require the services that use them to restart, otherwise they could still use an old (and vulnerable) version of them in memory. But yeah, the fact is most people are running outdated containers, and more often than not, they dont have the choice if they rely on third-party images.\nTrivy is an excellent tool to scan images for a subset of known vulnerabilities an image might have. You should play with it and see for yourself how outdated many publicly available images are.\nSupply-chain attacks As with any code downloaded from a software vendor, OCI images are not exempt from supply-chain attacks. The good practice is quite simple: rely on official images, and ideally build and maintain your own images. One should definitely not automatically trust random third-party images they can find on Docker Hub. Half of these images, if not more, contain vulnerabilities, and I bet a good portion of them contains malwares such as miners or worse.\nAs an image maintainer, you can sign your images to improve the authenticity assurance. Most official images make use of Docker Content Trust, which works with a OCI registry attached to a Notary server. With the Docker toolset, setting the environment variable DOCKER_CONTENT_TRUST=1 enforces signature verification (a signature is only good if its checked in the first place). The SigStore initiative is developing cosign, an alternative that doesnt require a Notary server because it works with features already provided by the registry such as tags. Kubernetes users may be interested in Connaisseur to ensure all signatures have been validated.\nLeave my root alone! Attack surface Traditionally, Docker runs as a daemon owned by root. That also means that root in the container is actually the root on the host and may be a few commands away from compromising the host. More generally, the attacker has to exploit the available attack surface to escape the container. There is a huge attack surface, actually: the Linux kernel. Someone wise once said:\nThe kernel can effectively be thought of as the largest, most vulnerable setuid root binary on the system.\nThat applies particulary to traditional containers which werent designed to provide a robust level of isolation. A recent example was CVE-2022-0492: the attacker could abuse root in the container to exploit cgroups v1, and compromise the host. Of course defense-in-depth measures would have prevented that, and well mention them. But fundamentally, container escapes are possible by design.\nBreaking out via the OCI runtime runc is also possible, although CVE-2019-5736 was a particularly nasty bug. The attacker had to gain access to root in the container first in order to access /proc/[runc-pid]/exe, which indicates them where to overwrite the runc binary.\nGood practices have been therefore established:\nAvoid using root in the container, plain and simple. Keep the host kernel, Docker and the OCI runtime updated. Consider the usage of user namespaces. By the way, it goes without saying that any user who has access to the Docker daemon should be considered as privileged as root. Mounting the Docker socket (/var/run/docker.sock) in a container makes it highly privileged, and so it should be avoided. The socket should only be owned by root, and if that doesnt work with your environment, use Docker rootless or Podman.\nAvoiding root root can be avoided in different ways in the final container:\nImage creation time: setting the USER instruction in the Dockerfile. Container creation time: via the tools available (user: in the Compose file). Container runtime: degrading privileges with entrypoints scripts (gosu UID:GID). Well-made images with security in mind will have a USER instruction. In my experience, most people will run images blindly, so its good harm reduction. Setting the user manually works in some images that arent designed without root in mind, and its also great to mitigate some scenarii where the image is controlled by an attacker. You also wont have surprises when mounting volumes, so I highly recommend setting the user explicitly and make sure volume permissions are correct once.\nSome images allow users to define their own user with UID/GID environment variables, with an entrypoint script that runs as root and takes care of the volume permissions before dropping privileges. While technically fine, it is still attack surface, and it requires the SETUID/SETGID capabilities to be available in the container.\nUser namespaces: sandbox or paradox? As mentioned just above, user namespaces are a solution to ensure root in the container is not root on the host. Docker supports user namespaces, for instance you could set the default mapping in /etc/docker/daemon.json:\n\"userns-remap\": \"default\" whoami \u0026\u0026 sleep 60 in the container will return root, but ps -fC sleep on the host will show us the PID of another user. That is nice, but it has limitations and therefore shouldnt be considered as a real sandbox. In fact, the paradox is that user namespaces are attack surface (and vulnerabilities are still being found years later), and its common wisdom to restrict them to privileged users (kernel.unprivileged_userns_clone=0). That is fine for Docker with its traditional root daemon, but Podman expects you to let unprivileged users interact with user namespaces (so essentially privileged code).\nEnabling userns-remap in Docker shouldnt be a substitute for running unprivileged application containers (where applicable). User namespaces are mostly useful if you intend to run full-fledged OS containers which need root in order to function, but that is out of the scope of the container technologies mentioned in this article; for them, Id argue exposing such a vulnerable attack surface from the host kernel for dubious sandboxing benefits isnt an interesting trade-off to make.\nThe no_new_privs bit After ensuring root isnt used in your containers, you should look into setting the no_new_privs bit. This Linux feature restricts syscalls such as execve() from granting privileges, which is what you want to restrict in-container privilege escalation. This flag can be set for a given container in a Compose file:\nsecurity_opt: - no-new-privileges: true Gaining privileges in the container will be much harder that way.\nCapabilities Furthermore, we should mention capabilities: root powers are divided into distinct units by the Linux kernel, called capabilities. Each granted capability also grants privilege and therefore access to a significant amount of attack surface. Security researcher Brad Spengler enumerates 19 important capabilities. Docker restricts certain capabilities by default, but some of the most important ones are still available to a container by default.\nYou should consider the following rule of thumb:\nDrop all capabilities by default. Allow only the ones you really need to. If you already run your containers unprivileged without root, your container will very likely work fine with all capabilities dropped. That can be done in a Compose file:\ncap_drop: - ALL #cap_add: # - CHOWN # - DAC_READ_SEARCH # - SETUID # - SETGID Never use the --privileged option unless you really need to: a privileged container is given access to almost all capabilities, kernel features and devices.\nOther security features MACs and seccomp are robust tools that may vastly improve container security.\nMandatory Access Control MAC stand for Mandatory Access Control: traditionnally a Linux Security Module that will enforce a policy to restrict the userspace. Examples are AppArmor and SELinux: the former being more easy-to-use, the later being more fine-grained. Both are strong tools that can help… Yet, their sole presence does not mean theyre really effective. A robust policy starts from a deny all policy, and only allows the necessary resources to be accessed.\nseccomp seccomp (short for secure computing mode) on the other hand is a much simpler and complementary tool, and there is no reason not to use it. What it does is restricting a process to a set of system calls, thus drastically reducing the attack surface available.\nDocker provides default profiles for AppArmor and seccomp, and theyre enabled by default for newly created containers unless the unconfined option is explicitly passed. Note: Kubernetes doesnt enable the default seccomp profile by default, so you should probably try it.\nThese profiles are a great start, but you should do much more if you take security seriously, because they were made to not break compatibility with a large range of images. The default seccomp profile only disables around 44 syscalls, which are mostly not very common and/or obsoleted. Of course, the best profile you can get is supposed to be written for a given program. It also doesnt make sense to insist on the permissiveness of the default profiles, and a lof of work has gone into hardening containers.\ncgroups Use cgroups to restrict access to hardware and system resources. You likely dont want a guest container to monopolize the host resources. You also dont want to be vulnerable to stupid fork bomb attacks. In a Compose file, consider setting these limits:\nmem_limit: 4g cpus: 4 pids_limit: 256 More runtime options can be found in the official documentation. All of them should have a Compose spec equivalent.\nThe --cgroup-parent option should be avoided as it uses the host cgroup and not the one configured from Docker (or else), which is the default.\nRead-only filesystem It is good practice to treat the image as some refer to as the “golden image”.\nIn other words, youll run containers in read-only mode, with an immutable filesystem inherited from the image. Only the mounted volumes will be read/write accessible, and those should ideally be mounted with the noexec, nosuid and nodev options for extra security. If read/write access isnt needed, mount these volumes as read-only too.\nHowever, the image may not be perfect and still require read/write access to some parts of the filesystem, likely directories such as /tmp, /run or /var. You can make a tmpfs for those (a temporary filesystem in the container attributed memory), because theyre not persistent data anyway.\nIn a Compose file, that would look like the following settings:\nread_only: true tmpfs: - /tmp:size=10M,mode=0770,uid=1000,gid=1000,noexec,nosuid,nodev That is quite verbose indeed, but thats to show you the different options for a tmpfs mount. You want to restrict them in size and permissions ideally.\nNetwork isolation By default, all Docker containers will use the default network bridge. They will see and be able to communicate with each other. Each container should have its own user-defined bridge network, and each connection between containers should have an internal network. If you intend to run a reverse proxy in front of several containers, you should make a dedicated network for each container you want to expose to the reverse proxy.\nThe --network host option also shouldnt be used for obvious reasons since the container would share the same network as the host, providing no isolation at all.\nAlternative runtimes (gVisor) runc is the reference OCI runtime, but that means other runtimes can exist as well as long as theyre compliant with the OCI standard. These runtimes can be interchanged quite seamlessly. Theres a few alternatives, such as crun or youki, respectively implemented in C and Rust (runc is a Go implementation). However, there is one particular runtime that does a lot more for security: runsc, provided by the gVisor project by the folks at Google.\nContainers are not a sandbox, and while we can improve their security, they will fundamentally share a common attack surface with the host. Virtual machines are a solution to that problem, but you might prefer container semantics and ecosystem. gVisor can be perceived as an attempt to get the “best of both worlds”: containers that are easy to manage while providing a native isolation boundary. gVisor did just that by implementing two things:\nSentry: an application kernel in Go, a language known to be memory-safe. It implements the Linux logic in userspace such as various system calls. Gofer: a host process which communicates with Sentry and the host filesystem, since Sentry is restricted in that aspect. A platform like ptrace or KVM is used to intercept system calls and redirect them from the application to Sentry, which is running in the userspace. This has some costs: there is a higher per-syscall overhead, and compatibility is reduced since not all syscalls are implemented. On top of that, gVisor employs security mechanisms weve glanced over above, such as a very restrictive seccomp profile between Sentry and the host kernel, the no_new_privs bit, and isolated namespaces from the host.\nThe security model of gVisor is comparable to what you would expect from a virtual machine. It is also very easy to install and use. The path to runsc along with its different configuration flags (runsc flags) should be added to /etc/docker/daemon.json:\n\"runtimes\": { \"runsc-ptrace\": { \"path\": \"/usr/local/bin/runsc\", \"runtimeArgs\": [ \"--platform=ptrace\" ] }, \"runsc-kvm\": { \"path\": \"/usr/local/bin/runsc\", \"runtimeArgs\": [ \"--platform=kvm\" ] } } runsc needs to start with root to set up some mitigations, including the use of its own network stack separated from the host. The sandbox itself drops privileges to nobody as soon as possible. You can still use runsc rootless if you want (which should be needed for Podman):\n./runsc --rootless do uname -a *** Warning: sandbox network isn't supported with --rootless, switching to host *** Linux 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 GNU/Linux Linux 4.4.0 is shown because that is the version of the Linux API that Sentry tries to mimic. As youve probably guessed, youre not really using Linux 4.4.0, but the application kernel that behaves like it. By the way, gVisor is of course compatible with cgroups.\nConclusion: whats a container after all? Like I wrote above, a container is mostly defined by its semantics and ecosystem. Containers shouldnt be solely defined by the OCI reference runtime implementation, as weve seen with gVisor that provides an entirely different security model.\nStill not convinced? What if I told you a container can leverage the same technologies as a virtual machine? That is exactly what Kata Containers does by using a VMM like QEMU-lite to provide containers that are in fact lightweight virtual machines, with their traditional resources and security model, compatibility with container semantics and toolset, and an optimized overhead. While not in the OCI ecosystem, Amazon achieves quite the same with Firecracker.\nIf youre running untrusted workloads, I highly suggest you consider gVisor instead of a traditional container runtime. Your definition of “untrusted” may vary: for me, almost everything should be considered untrusted. That is how modern security works, and how mobile operating systems work. Its quite simple, security should be simple, and gVisor simply offers native security.\nContainers are a popular, yet strange world. They revolutionized the way we make and deploy software, but one should not loose the sight of what they really are and arent. This hardening guide is non-exhaustive, but I hope it can make you aware of some aspects youve never thought of.\n","wordCount":"3925","inLanguage":"en","datePublished":"0001-01-01T00:00:00Z","dateModified":"0001-01-01T00:00:00Z","author":{"@type":"Person","name":"Wonderfall"},"mainEntityOfPage":{"@type":"WebPage","@id":"https://privsec.dev/os/docker-and-oci-hardening/"},"publisher":{"@type":"Organization","name":"PrivSec.dev","logo":{"@type":"ImageObject","url":"https://privsec.dev/%3Clink%20/%20abs%20url%3E"}}}</script></head><body class=dark id=top><script>localStorage.getItem("pref-theme")==="light"&&document.body.classList.remove("dark")</script><header class=header><nav class=nav><div class=logo><a href=https://privsec.dev accesskey=h title="PrivSec.dev (Alt + H)">PrivSec.dev</a><div class=logo-switches><button id=theme-toggle accesskey=t title="(Alt + T)"><svg id="moon" xmlns="http://www.w3.org/2000/svg" width="24" height="18" viewBox="0 0 24 24" fill="none" stroke="currentcolor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M21 12.79A9 9 0 1111.21 3 7 7 0 0021 12.79z"/></svg><svg id="sun" xmlns="http://www.w3.org/2000/svg" width="24" height="18" viewBox="0 0 24 24" fill="none" stroke="currentcolor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><circle cx="12" cy="12" r="5"/><line x1="12" y1="1" x2="12" y2="3"/><line x1="12" y1="21" x2="12" y2="23"/><line x1="4.22" y1="4.22" x2="5.64" y2="5.64"/><line x1="18.36" y1="18.36" x2="19.78" y2="19.78"/><line x1="1" y1="12" x2="3" y2="12"/><line x1="21" y1="12" x2="23" y2="12"/><line x1="4.22" y1="19.78" x2="5.64" y2="18.36"/><line x1="18.36" y1="5.64" x2="19.78" y2="4.22"/></svg></button></div></div><ul id=menu><li><a href=https://privsec.dev/knowledge/ title="Knowledge Base"><span>Knowledge Base</span></a></li><li><a href=https://privsec.dev/os/ title="Operating Systems"><span>Operating Systems</span></a></li><li><a href=https://privsec.dev/apps/ title=Applications><span>Applications</span></a></li><li><a href=https://privsec.dev/providers/ title=Providers><span>Providers</span></a></li></ul></nav></header><main class=main><article class=post-single><header class=post-header><div class=breadcrumbs><a href=https://privsec.dev>Home</a>&nbsp;»&nbsp;<a href=https://privsec.dev/os/>Operating Systems</a></div><h1 class=post-title>Docker and OCI Hardening</h1><div class=post-meta>19 min&nbsp;·&nbsp;3925 words&nbsp;·&nbsp;Wonderfall&nbsp;|&nbsp;<a href=https://github.com/PrivSec-dev/privsec.dev/blob/main/content/os/Docker%20and%20OCI%20Hardening.md rel="noopener noreferrer" target=_blank>Suggest Changes</a>
&nbsp;|&nbsp;<span>Originally published at&nbsp;<a href=https://wonderfall.dev/docker-hardening/ title=https://wonderfall.dev/docker-hardening/ target=_blank rel="noopener noreferrer">wonderfall.dev</a></span></div></header><div class=toc><details><summary accesskey=c title="(Alt + C)"><span class=details>Table of Contents</span></summary><div class=inner><ul><li><a href=#docker-is-dead-long-live-docker-and-oci aria-label="Docker is dead, long live Docker&amp;hellip; and OCI!">Docker is dead, long live Docker&mldr; and OCI!</a></li><li><a href=#the-nightmare-of-dependencies aria-label="The nightmare of dependencies">The nightmare of dependencies</a><ul><li><a href=#images-immutability-and-versioning aria-label="Images, immutability and versioning">Images, immutability and versioning</a></li><li><a href=#a-minimal-base-system aria-label="A minimal base system">A minimal base system</a></li><li><a href=#keeping-images-up-to-date aria-label="Keeping images up-to-date">Keeping images up-to-date</a></li><li><a href=#is-it-really-a-security-nightmare aria-label="Is it really a security nightmare?">Is it really a security nightmare?</a></li><li><a href=#supply-chain-attacks aria-label="Supply-chain attacks">Supply-chain attacks</a></li></ul></li><li><a href=#leave-my-root-alone aria-label="Leave my root alone!">Leave my root alone!</a><ul><li><a href=#attack-surface aria-label="Attack surface">Attack surface</a></li><li><a href=#avoiding-root aria-label="Avoiding root">Avoiding root</a></li><li><a href=#user-namespaces-sandbox-or-paradox aria-label="User namespaces: sandbox or paradox?">User namespaces: sandbox or paradox?</a></li><li><a href=#the-no_new_privs-bit aria-label="The no_new_privs bit">The no_new_privs bit</a></li><li><a href=#capabilities aria-label=Capabilities>Capabilities</a></li></ul></li><li><a href=#other-security-features aria-label="Other security features">Other security features</a><ul><li><a href=#mandatory-access-control aria-label="Mandatory Access Control">Mandatory Access Control</a></li><li><a href=#seccomp aria-label=seccomp>seccomp</a></li><li><a href=#cgroups aria-label=cgroups>cgroups</a></li><li><a href=#read-only-filesystem aria-label="Read-only filesystem">Read-only filesystem</a></li><li><a href=#network-isolation aria-label="Network isolation">Network isolation</a></li></ul></li><li><a href=#alternative-runtimes-gvisor aria-label="Alternative runtimes (gVisor)">Alternative runtimes (gVisor)</a></li><li><a href=#conclusion-whats-a-container-after-all aria-label="Conclusion: what&amp;rsquo;s a container after all?">Conclusion: what&rsquo;s a container after all?</a></li></ul></div></details></div><div class=post-content><p>Containers aren&rsquo;t that new fancy thing anymore, but they were a big deal. And they still are. They are a concrete solution to the following problem:</p><blockquote><p>- Hey, your software doesn&rsquo;t work&mldr;</p><p>- Sorry, it works on my computer! Can&rsquo;t help you.</p></blockquote><p>Whether we like them or not, containers are here to stay. Their expressiveness and semantics allow for an abstraction of the OS dependencies that a software has, the latter being often dynamically linked against certain libraries. The developer can therefore provide a known-good environment where it is expected that their software &ldquo;just works&rdquo;. That is particularly useful for development to eliminate environment-related issues, and that is often used in production as well.</p><p>Containers are often perceived as a great tool for isolation, that is, they can provide an isolated workspace that won&rsquo;t pollute your host OS - all that without the overhead of virtual machines. Security-wise: containers, as we know them on Linux, are glorified namespaces at their core. Containers usually share the same kernel with the host, and <strong>namespaces</strong> is the kernel feature for separating kernel resources across containers (IDs, networks, filesystems, IPC, etc.). Containers also leverage the features of <strong>cgroups</strong> to separate system resources (CPU, memory, etc.), and security features such as seccomp to restrict syscalls, or MACs (AppArmor, SELinux).</p><p>At first, it seems that containers may not provide the same isolation boundary as virtual machines. That&rsquo;s fine, they were not designed to. But they can&rsquo;t be simplified to a simple <code>chroot</code> either. We&rsquo;ll see that a &ldquo;container&rdquo; can mean a lot of things, and their definition may vary a lot depending on the implementation: as such, containers are mostly defined by their semantics.</p><h2 id=docker-is-dead-long-live-docker-and-oci>Docker is dead, long live Docker&mldr; and OCI!<a hidden class=anchor aria-hidden=true href=#docker-is-dead-long-live-docker-and-oci>#</a></h2><p>When people think of containers, a large group of them may think of Docker. While Docker played a big role in the popularity of containers a few years ago, it didn&rsquo;t introduce the technology: on Linux, LXC did (<em>Linux Containers</em>). In fact, Docker in its early days was a high-level wrapper for LXC which already combined the power of namespaces and cgroups. Docker then replaced LXC with <code>libcontainer</code> which does more or less the same, plus extra features.</p><p>Then, what happened? <em>Open Container Initiative</em> (OCI). That is the current standard that defines the container ecosystem. That means that whether you&rsquo;re using Docker, Podman, or Kubernetes, you&rsquo;re in fact running OCI-compliant tools. That is a good thing, as it saves a lot of interoperability headaches.</p><p><strong>Docker</strong> is no longer the monolithic platform it once was. <code>libcontainer</code> was absorbed by <code>runc</code>, the reference OCI runtime. The high-level components of Docker split into different parts related to the upstream Moby project (Docker is the &ldquo;assembled product&rdquo; of the &ldquo;Moby components&rdquo;). When we refer to Docker, we refer in fact at this powerful high-level API that manages OCI containers. By design, Docker is a daemon that communicates with <code>containerd</code>, a lower-level layer, which in turn communicates with the OCI runtime. That also means that you could very well skip Docker altogether and use <code>containerd</code> or even <code>runc</code> directly.</p><pre tabindex=0><code>Docker client &lt;=&gt; Docker daemon &lt;=&gt; containerd &lt;=&gt; containerd-shim &lt;=&gt; runc
</code></pre><p><strong>Podman</strong> is an alternative to Docker developed by RedHat, that also intends to be a drop-in replacement for Docker. It doesn&rsquo;t work with a daemon, and can work rootless by design (Docker has support for rootless too, but that is not without caveats). I would largely recommend Podman over Docker for someone who wants a simple tool to run containers and test code on their machine.</p><p><strong>Kubernetes</strong> (also known as K8S) is the container platform made by Google. It is designed with scaling in mind, and is about running containers across a cluster whereas Docker focuses on packaging containers on a single node. Docker Swarm is the direct alternative to that, but it has never really took off due to the popularity of K8S.</p><p>For the rest of this article, we will use Docker as the reference for our examples, along with the <a href=https://docs.docker.com/compose/compose-file/>Compose specification</a> format. Most of these examples can be adapted to other platforms without issues.</p><h2 id=the-nightmare-of-dependencies>The nightmare of dependencies<a hidden class=anchor aria-hidden=true href=#the-nightmare-of-dependencies>#</a></h2><p>Containers are made from images, and images are typically built from a Dockerfile. Images can be built and distributed through OCI registries: <a href=https://hub.docker.com/>Docker Hub</a>, <a href=https://cloud.google.com/container-registry>Google Container Registry</a>, <a href=https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry>GitHub Container Registry</a>, and so on. You can also set up your own private registry as well, but the reality is that people often pull images from these public registries.</p><h3 id=images-immutability-and-versioning>Images, immutability and versioning<a hidden class=anchor aria-hidden=true href=#images-immutability-and-versioning>#</a></h3><p>Images are what make containers, well, containers. Containers made from the same image should behave similary on different machines. Images can have <strong>tags</strong>, which are useful for software versioning. The usage of generic tags such as <code>latest</code> is often discouraged because it defeats the purpose of the expected behavior of the container. Tags are not necessarily immutable by design, and they shouldn&rsquo;t be (more on that below). <strong>Digest</strong>, however, is the attribute of an immutable image, and is often generated with the SHA-256 algorithm.</p><pre tabindex=0><code>docker.io/library/golang:1.17.1@sha256:232a180dbcbcfa7250917507f3827d88a9ae89bb1cdd8fe3ac4db7b764ebb25
^ ^ ^ ^
| | | |
Registry Image Tag Digest (immutable)
</code></pre><p>Now onto why tags shouldn&rsquo;t be immutable: as written above, containers bring us an abstraction over the OS dependencies that are used by the packaged software. That is nice indeed, but this shouldn&rsquo;t lure us into believing that we can forget security updates. The fact is, <strong>there is still a whole OS to care about</strong>, and we can&rsquo;t just think of the container as a simple package tool for software.</p><p>For these reasons, good practices were established:</p><ul><li>An image should be as minimal as possible (Alpine Linux, or scratch/distroless).</li><li>An image, with a given tag, should be regularly built, without cache to ensure all layers are freshly built.</li><li>An image should be rebuilt when the images it&rsquo;s based on are updated.</li></ul><h3 id=a-minimal-base-system>A minimal base system<a hidden class=anchor aria-hidden=true href=#a-minimal-base-system>#</a></h3><p><a href=https://alpinelinux.org/>Alpine Linux</a> is often the choice for official images for the first reason. This is not a typical Linux distribution as it uses musl as its C library, but it works quite well. Actually, I&rsquo;m quite fond of Alpine Linux and <code>apk</code> (its package manager). If a supervision suite is needed, I&rsquo;d look into <code>s6</code>. If you need a glibc distribution, Debian provides slim variants for lightweight base images. We can do even better than using Alpine by using <strong>distroless images</strong>, allowing us to have state-of-the-art application containers.</p><p>&ldquo;Distroless&rdquo; is a fancy name referring to an image with a minimal set of dependencies, from none (for fully static binaries) to some common libraries (typically the C library). Google maintains <a href=https://github.com/GoogleContainerTools/distroless>distroless images</a> you can use as a base for your own images. If you were wondering, the difference with <code>scratch</code> (empty starting point) is that distroless images contain common dependencies that &ldquo;almost-statically compiled&rdquo; binaries may need, such as <code>ca-certificates</code>.</p><p>However, distroless images are not suited for every application. In my experience though, distroless is an excellent option with pure Go binaries. Going with minimal images drastically reduces the available attack surface in the container. For example, here&rsquo;s a <a href=https://docs.docker.com/develop/develop-images/multistage-build/>multi-stage Dockerfile</a> resulting in a minimal non-root image for a simple Go project:</p><pre tabindex=0><code>FROM golang:alpine as build
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 go mod -o /my_app cmd/my_app
FROM gcr.io/distroless/static
COPY --from=build /my_app /
USER nobody
ENTRYPOINT [&#34;/my_app&#34;]
</code></pre><p>The main drawback of using minimal images is the lack of tools that help with debugging, which also constitute the very attack surface we&rsquo;re trying to get rid of. The trade-off is probably not worth the hassle for development-focused containers, and if you&rsquo;re running such images in production, you have to be confident enough to operate with them. Note that the <code>gcr.io/distroless</code> images have a <code>:debug</code> tag to help in that regard.</p><h3 id=keeping-images-up-to-date>Keeping images up-to-date<a hidden class=anchor aria-hidden=true href=#keeping-images-up-to-date>#</a></h3><p>The two other points are highly problematic, because most software vendors just publish an image on release, and forget about it. You should take it up to them if you&rsquo;re running images that are versioned but not regularly updated. I&rsquo;d say running scheduled builds <strong>once a week</strong> is the bare minimum to make sure dependencies stay up-to-date. Alpine Linux is a better choice than most other &ldquo;stable&rdquo; distributions because it usually has more recent packages.</p><p>Stable distributions often rely on backporting security fixes from CVEs, which is known to be a flawed approach to security since CVEs aren&rsquo;t always assigned or even taken care of. Alpine has more recent packages, and it has versioning, so it&rsquo;s once again a particulary good choice as long as <code>musl</code> doesn&rsquo;t cause issues.</p><h3 id=is-it-really-a-security-nightmare>Is it really a security nightmare?<a hidden class=anchor aria-hidden=true href=#is-it-really-a-security-nightmare>#</a></h3><p>When people say Docker is a security nightmare because of that, that&rsquo;s a fair point. On a traditional system, you could upgrade your whole system with a single command or two. With Docker, you&rsquo;ll have to recreate several containers&mldr; if the images were kept up-to-date in the first place. Recreating itself is not a big deal actually: hot upgrades of binaries and libraries often require the services that use them to restart, otherwise they could still use an old (and vulnerable) version of them in memory. But yeah, the fact is most people are running outdated containers, and more often than not, they don&rsquo;t have the choice if they rely on third-party images.</p><p><a href=https://github.com/aquasecurity/trivy>Trivy</a> is an excellent tool to scan images for a subset of <strong>known vulnerabilities</strong> an image might have. You should play with it and see for yourself how outdated many publicly available images are.</p><h3 id=supply-chain-attacks>Supply-chain attacks<a hidden class=anchor aria-hidden=true href=#supply-chain-attacks>#</a></h3><p>As with any code downloaded from a software vendor, OCI images are not exempt from supply-chain attacks. The good practice is quite simple: rely on official images, and ideally build and maintain your own images. One should definitely not automatically trust random third-party images they can find on Docker Hub. Half of these images, if not more, contain vulnerabilities, and I bet a good portion of them contains malwares <a href=https://www.trendmicro.com/vinfo/fr/security/news/virtualization-and-cloud/malicious-docker-hub-container-images-cryptocurrency-mining>such as miners</a> or worse.</p><p>As an image maintainer, you can sign your images to improve the authenticity assurance. Most official images make use of <a href=https://docs.docker.com/engine/security/trust/>Docker Content Trust</a>, which works with a OCI registry attached to a <a href=https://github.com/notaryproject/notary>Notary server</a>. With the Docker toolset, setting the environment variable <code>DOCKER_CONTENT_TRUST=1</code> enforces signature verification (a signature is only good if it&rsquo;s checked in the first place). The SigStore initiative is developing <a href=https://github.com/sigstore/cosign>cosign</a>, an alternative that doesn&rsquo;t require a Notary server because it works with features already provided by the registry such as tags. Kubernetes users may be interested in <a href=https://github.com/sse-secure-systems/connaisseur>Connaisseur</a> to ensure all signatures have been validated.</p><h2 id=leave-my-root-alone>Leave my root alone!<a hidden class=anchor aria-hidden=true href=#leave-my-root-alone>#</a></h2><h3 id=attack-surface>Attack surface<a hidden class=anchor aria-hidden=true href=#attack-surface>#</a></h3><p>Traditionally, Docker runs as a daemon owned by root. That also means that root in the container is actually the root on the host and may be a few commands away from compromising the host. More generally, the attacker has to exploit the available attack surface to escape the container. There is a huge attack surface, actually: the Linux kernel. <a href=https://grsecurity.net/huawei_hksp_introduces_trivially_exploitable_vulnerability>Someone wise once said</a>:</p><blockquote><p>The kernel can effectively be thought of as the largest, most vulnerable setuid root binary on the system.</p></blockquote><p>That applies particulary to traditional containers which weren&rsquo;t designed to provide a robust level of isolation. A recent example was <a href=https://unit42.paloaltonetworks.com/cve-2022-0492-cgroups/>CVE-2022-0492</a>: the attacker could abuse root in the container to exploit cgroups v1, and compromise the host. Of course defense-in-depth measures would have prevented that, and we&rsquo;ll mention them. But fundamentally, container escapes are possible by design.</p><p>Breaking out via the OCI runtime <code>runc</code> is also possible, although <a href=https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736/>CVE-2019-5736</a> was a particularly nasty bug. The attacker had to gain access to root in the container first in order to access <code>/proc/[runc-pid]/exe</code>, which indicates them where to overwrite the <code>runc</code> binary.</p><p>Good practices have been therefore established:</p><ul><li>Avoid using root in the container, plain and simple.</li><li>Keep the host kernel, Docker and the OCI runtime updated.</li><li>Consider the usage of user namespaces.</li></ul><p>By the way, it goes without saying that any user who has access to the Docker daemon should be considered as privileged as root. Mounting the Docker socket (<code>/var/run/docker.sock</code>) in a container makes it highly privileged, and so it should be avoided. The socket should only be owned by root, and if that doesn&rsquo;t work with your environment, use Docker rootless or Podman.</p><h3 id=avoiding-root>Avoiding root<a hidden class=anchor aria-hidden=true href=#avoiding-root>#</a></h3><p>root can be avoided in different ways in the final container:</p><ul><li>Image creation time: setting the <code>USER</code> instruction in the Dockerfile.</li><li>Container creation time: via the tools available (<code>user:</code> in the Compose file).</li><li>Container runtime: degrading privileges with entrypoints scripts (<code>gosu UID:GID</code>).</li></ul><p>Well-made images with security in mind will have a <code>USER</code> instruction. In my experience, most people will run images blindly, so it&rsquo;s good harm reduction. Setting the user manually works in some images that aren&rsquo;t designed without root in mind, and it&rsquo;s also great to mitigate some <em>scenarii</em> where the image is controlled by an attacker. You also won&rsquo;t have surprises when mounting volumes, so I highly recommend setting the user explicitly and make sure volume permissions are correct once.</p><p>Some images allow users to define their own user with UID/GID environment variables, with an entrypoint script that runs as root and takes care of the volume permissions before dropping privileges. While technically fine, it is still attack surface, and it requires the <code>SETUID</code>/<code>SETGID</code> capabilities to be available in the container.</p><h3 id=user-namespaces-sandbox-or-paradox>User namespaces: sandbox or paradox?<a hidden class=anchor aria-hidden=true href=#user-namespaces-sandbox-or-paradox>#</a></h3><p>As mentioned just above, <a href=https://www.man7.org/linux/man-pages/man7/user_namespaces.7.html>user namespaces</a> are a solution to ensure root in the container is not root on the host. Docker supports user namespaces, for instance you could set the default mapping in <code>/etc/docker/daemon.json</code>:</p><pre tabindex=0><code> &#34;userns-remap&#34;: &#34;default&#34;
</code></pre><p><code>whoami && sleep 60</code> in the container will return root, but <code>ps -fC sleep</code> on the host will show us the PID of another user. That is nice, but it has limitations and therefore shouldn&rsquo;t be considered as a real sandbox. In fact, the paradox is that <a href=https://lists.archlinux.org/pipermail/arch-general/2017-February/043066.html>user namespaces are attack surface</a> (and vulnerabilities are still being found <a href=https://www.openwall.com/lists/oss-security/2022/01/29/1>years later</a>), and it&rsquo;s common wisdom to restrict them to privileged users (<code>kernel.unprivileged_userns_clone=0</code>). That is fine for Docker with its traditional root daemon, but Podman expects you to let unprivileged users interact with user namespaces (so essentially privileged code).</p><p>Enabling <code>userns-remap</code> in Docker shouldn&rsquo;t be a substitute for running unprivileged application containers (where applicable). User namespaces are mostly useful if you intend to run full-fledged OS containers which need root in order to function, but that is out of the scope of the container technologies mentioned in this article; for them, I&rsquo;d argue exposing such a vulnerable attack surface from the host kernel for dubious sandboxing benefits isn&rsquo;t an interesting trade-off to make.</p><h3 id=the-no_new_privs-bit>The no_new_privs bit<a hidden class=anchor aria-hidden=true href=#the-no_new_privs-bit>#</a></h3><p>After ensuring root isn&rsquo;t used in your containers, you should look into setting the <code>no_new_privs</code> bit. <a href=https://docs.kernel.org/userspace-api/no_new_privs.html>This Linux feature</a> restricts syscalls such as <code>execve()</code> from granting privileges, which is what you want to restrict in-container privilege escalation. This flag can be set for a given container in a Compose file:</p><pre tabindex=0><code> security_opt:
- no-new-privileges: true
</code></pre><p>Gaining privileges in the container will be much harder that way.</p><h3 id=capabilities>Capabilities<a hidden class=anchor aria-hidden=true href=#capabilities>#</a></h3><p>Furthermore, we should mention capabilities: root powers are divided into distinct units by the Linux kernel, called capabilities. Each granted capability also grants privilege and therefore access to a significant amount of attack surface. Security researcher Brad Spengler enumerates <a href="https://forums.grsecurity.net/viewtopic.php?f=7&t=2522#p10271">19 important capabilities</a>. Docker <strong>restricts certain capabilities by default</strong>, but <a href=https://github.com/moby/moby/blob/1308a3a99faa13ff279dcb4eb5ad23aee3ab5cdb/oci/caps/defaults.go>some of the most important ones</a> are still available to a container by default.</p><p>You should consider the following rule of thumb:</p><ul><li>Drop all capabilities by default.</li><li>Allow only the ones you really need to.</li></ul><p>If you already run your containers unprivileged without root, your container will very likely work fine with all capabilities dropped. That can be done in a Compose file:</p><pre tabindex=0><code> cap_drop:
- ALL
#cap_add:
# - CHOWN
# - DAC_READ_SEARCH
# - SETUID
# - SETGID
</code></pre><p>Never use the <code>--privileged</code> option unless you really need to: a privileged container is given access to almost all capabilities, kernel features and devices.</p><h2 id=other-security-features>Other security features<a hidden class=anchor aria-hidden=true href=#other-security-features>#</a></h2><p>MACs and seccomp are robust tools that may vastly improve container security.</p><h3 id=mandatory-access-control>Mandatory Access Control<a hidden class=anchor aria-hidden=true href=#mandatory-access-control>#</a></h3><p>MAC stand for Mandatory Access Control: traditionnally a Linux Security Module that will enforce a policy to restrict the userspace. Examples are <strong>AppArmor</strong> and <strong>SELinux</strong>: the former being more easy-to-use, the later being more fine-grained. Both are strong tools that can help&mldr; Yet, their sole presence does not mean they&rsquo;re really effective. A robust policy starts from a <em>deny all</em> policy, and only allows the necessary resources to be accessed.</p><h3 id=seccomp>seccomp<a hidden class=anchor aria-hidden=true href=#seccomp>#</a></h3><p>seccomp (short for secure computing mode) on the other hand is a much simpler and complementary tool, and there is no reason not to use it. What it does is restricting a process to a set of system calls, thus drastically reducing the attack surface available.</p><p>Docker provides default profiles for <a href=https://github.com/moby/moby/tree/85eaf23bf46b12827273ab2ff523c753117dbdc7/profiles/apparmor>AppArmor</a> and <a href=https://github.com/moby/moby/blob/85eaf23bf46b12827273ab2ff523c753117dbdc7/profiles/seccomp/default.json>seccomp</a>, and they&rsquo;re enabled by default for newly created containers unless the <code>unconfined</code> option is explicitly passed. Note: Kubernetes doesn&rsquo;t enable the default seccomp profile by default, so you should probably <a href=https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads>try it</a>.</p><p>These profiles are a great start, but you should do much more if you take security seriously, because they were made to not break compatibility with a large range of images. The default seccomp profile only disables <a href=https://docs.docker.com/engine/security/seccomp/#significant-syscalls-blocked-by-the-default-profile>around 44 syscalls</a>, which are mostly not very common and/or obsoleted. Of course, the best profile you can get is supposed to be written for a given program. It also doesn&rsquo;t make sense to insist on the permissiveness of the default profiles, and <a href=https://blog.jessfraz.com/post/containers-security-and-echo-chambers/>a lof of work has gone</a> into hardening containers.</p><h3 id=cgroups>cgroups<a hidden class=anchor aria-hidden=true href=#cgroups>#</a></h3><p>Use cgroups to restrict access to hardware and system resources. You likely don&rsquo;t want a guest container to monopolize the host resources. You also don&rsquo;t want to be vulnerable to stupid fork bomb attacks. In a Compose file, consider setting these limits:</p><pre tabindex=0><code> mem_limit: 4g
cpus: 4
pids_limit: 256
</code></pre><p>More runtime options can be found in <a href=https://docs.docker.com/config/containers/resource_constraints/>the official documentation</a>. All of them should have a <a href=https://github.com/compose-spec/compose-spec/blob/master/spec.md>Compose spec</a> equivalent.</p><p>The <code>--cgroup-parent</code> option should be avoided as it uses the host cgroup and not the one configured from Docker (or else), which is the default.</p><h3 id=read-only-filesystem>Read-only filesystem<a hidden class=anchor aria-hidden=true href=#read-only-filesystem>#</a></h3><p>It is good practice to treat the image as some refer to as the &ldquo;golden image&rdquo;.</p><p>In other words, you&rsquo;ll run containers in <em>read-only</em> mode, with an immutable filesystem inherited from the image. Only the mounted volumes will be read/write accessible, and those should ideally be mounted with the <code>noexec</code>, <code>nosuid</code> and <code>nodev</code> options for extra security. If read/write access isn&rsquo;t needed, mount these volumes as read-only too.</p><p>However, the image may not be perfect and still require read/write access to some parts of the filesystem, likely directories such as <code>/tmp</code>, <code>/run</code> or <code>/var</code>. You can make a <strong>tmpfs</strong> for those (a temporary filesystem in the container attributed memory), because they&rsquo;re not persistent data anyway.</p><p>In a Compose file, that would look like the following settings:</p><pre tabindex=0><code> read_only: true
tmpfs:
- /tmp:size=10M,mode=0770,uid=1000,gid=1000,noexec,nosuid,nodev
</code></pre><p>That is quite verbose indeed, but that&rsquo;s to show you the different options for a tmpfs mount. You want to restrict them in size and permissions ideally.</p><h3 id=network-isolation>Network isolation<a hidden class=anchor aria-hidden=true href=#network-isolation>#</a></h3><p>By default, all Docker containers will use the default network bridge. They will see and be able to communicate with each other. Each container should have its own user-defined bridge network, and each connection between containers should have an internal network. If you intend to run a reverse proxy in front of several containers, you should make a dedicated network for each container you want to expose to the reverse proxy.</p><p>The <code>--network host</code> option also shouldn&rsquo;t be used for obvious reasons since the container would share the same network as the host, providing no isolation at all.</p><h2 id=alternative-runtimes-gvisor>Alternative runtimes (gVisor)<a hidden class=anchor aria-hidden=true href=#alternative-runtimes-gvisor>#</a></h2><p><code>runc</code> is the reference OCI runtime, but that means other runtimes can exist as well as long as they&rsquo;re compliant with the OCI standard. These runtimes can be interchanged quite seamlessly. There&rsquo;s a few alternatives, such as <a href=https://github.com/containers/crun>crun</a> or <a href=https://github.com/containers/youki>youki</a>, respectively implemented in C and Rust (<code>runc</code> is a Go implementation). However, there is one particular runtime that does a lot more for security: <code>runsc</code>, provided by the <a href=https://gvisor.dev/>gVisor project</a> by the folks at Google.</p><p><strong>Containers are not a sandbox</strong>, and while we can improve their security, they will fundamentally share a common attack surface with the host. Virtual machines are a solution to that problem, but you might prefer container semantics and ecosystem. gVisor can be perceived as an attempt to get the &ldquo;best of both worlds&rdquo;: containers that are easy to manage while providing a native isolation boundary. gVisor did just that by implementing two things:</p><ul><li><strong>Sentry</strong>: an application kernel in Go, a language known to be memory-safe. It implements the Linux logic in userspace such as various system calls.</li><li><strong>Gofer</strong>: a host process which communicates with Sentry and the host filesystem, since Sentry is restricted in that aspect.</li></ul><p>A platform like ptrace or KVM is used to intercept system calls and redirect them from the application to Sentry, which is running in the userspace. This has some costs: there is a higher per-syscall overhead, and compatibility is reduced since not all syscalls are implemented. On top of that, gVisor employs security mechanisms we&rsquo;ve glanced over above, such as a <a href=https://github.com/google/gvisor/blob/86ad7d5b5838da1b539e976886d04b93c939ca3d/runsc/boot/filter/config.go>very restrictive seccomp profile</a> between Sentry and the host kernel, the <a href=https://github.com/google/gvisor/blob/6ef268409620c57197b9d573e23be8cb05dbf381/pkg/sentry/kernel/task_identity.go#L464>no_new_privs bit</a>, and isolated namespaces from the host.</p><p>The security model of gVisor is comparable to what you would expect from a virtual machine. It is also very easy to <a href=https://gvisor.dev/docs/user_guide/install/>install and use</a>. The path to runsc along with its different configuration flags (<code>runsc flags</code>) should be added to <code>/etc/docker/daemon.json</code>:</p><pre tabindex=0><code> &#34;runtimes&#34;: {
&#34;runsc-ptrace&#34;: {
&#34;path&#34;: &#34;/usr/local/bin/runsc&#34;,
&#34;runtimeArgs&#34;: [
&#34;--platform=ptrace&#34;
]
},
&#34;runsc-kvm&#34;: {
&#34;path&#34;: &#34;/usr/local/bin/runsc&#34;,
&#34;runtimeArgs&#34;: [
&#34;--platform=kvm&#34;
]
}
}
</code></pre><p><code>runsc</code> needs to start with root to set up some mitigations, including the use of its own network stack separated from the host. The sandbox itself drops privileges to nobody as soon as possible. You can still use <code>runsc</code> rootless if you want (which should be needed for Podman):</p><pre tabindex=0><code>./runsc --rootless do uname -a
*** Warning: sandbox network isn&#39;t supported with --rootless, switching to host ***
Linux 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 GNU/Linux
</code></pre><p>Linux 4.4.0 is shown because that is the version of the Linux API that Sentry tries to mimic. As you&rsquo;ve probably guessed, you&rsquo;re not really using Linux 4.4.0, but the application kernel that behaves like it. By the way, gVisor is of course compatible with cgroups.</p><h2 id=conclusion-whats-a-container-after-all>Conclusion: what&rsquo;s a container after all?<a hidden class=anchor aria-hidden=true href=#conclusion-whats-a-container-after-all>#</a></h2><p>Like I wrote above, a container is mostly defined by its semantics and ecosystem. Containers shouldn&rsquo;t be solely defined by the OCI reference runtime implementation, as we&rsquo;ve seen with gVisor that provides an entirely different security model.</p><p>Still not convinced? What if I told you a container can leverage the same technologies as a virtual machine? That is exactly what <a href=https://katacontainers.io/>Kata Containers</a> does by using a VMM like QEMU-lite to provide containers that are in fact lightweight virtual machines, with their traditional resources and security model, compatibility with container semantics and toolset, and an optimized overhead. While not in the OCI ecosystem, Amazon achieves quite the same with <a href=https://firecracker-microvm.github.io/>Firecracker</a>.</p><p>If you&rsquo;re running untrusted workloads, I highly suggest you consider gVisor instead of a traditional container runtime. Your definition of &ldquo;untrusted&rdquo; may vary: for me, almost everything should be considered untrusted. That is how modern security works, and how mobile operating systems work. It&rsquo;s quite simple, security should be simple, and gVisor simply offers native security.</p><p>Containers are a popular, yet strange world. They revolutionized the way we make and deploy software, but one should not loose the sight of what they really are and aren&rsquo;t. This hardening guide is non-exhaustive, but I hope it can make you aware of some aspects you&rsquo;ve never thought of.</p></div><footer class=post-footer><ul class=post-tags><li><a href=https://privsec.dev/tags/operating-systems/>operating systems</a></li><li><a href=https://privsec.dev/tags/linux/>linux</a></li><li><a href=https://privsec.dev/tags/container/>container</a></li><li><a href=https://privsec.dev/tags/security/>security</a></li></ul><nav class=paginav><a class=prev href=https://privsec.dev/os/choosing-your-desktop-linux-distribution/><span class=title>« Prev</span><br><span>Choosing Your Desktop Linux Distribution</span></a>
<a class=next href=https://privsec.dev/os/linux-insecurities/><span class=title>Next »</span><br><span>Linux Insecurities</span></a></nav></footer></article></main><footer class=footer><span>&copy; 2022 <a href=https://privsec.dev>PrivSec.dev</a></span>
<span>Powered by
<a href=https://gohugo.io/ rel="noopener noreferrer" target=_blank>Hugo</a> &
<a href=https://github.com/adityatelange/hugo-PaperMod/ rel=noopener target=_blank>PaperMod</a></span></footer><a href=#top aria-label="go to top" title="Go to Top (Alt + G)" class=top-link id=top-link accesskey=g><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 12 6" fill="currentcolor"><path d="M12 6H0l6-6z"/></svg></a><script>let menu=document.getElementById("menu");menu&&(menu.scrollLeft=localStorage.getItem("menu-scroll-position"),menu.onscroll=function(){localStorage.setItem("menu-scroll-position",menu.scrollLeft)}),document.querySelectorAll('a[href^="#"]').forEach(e=>{e.addEventListener("click",function(e){e.preventDefault();var t=this.getAttribute("href").substr(1);window.matchMedia("(prefers-reduced-motion: reduce)").matches?document.querySelector(`[id='${decodeURIComponent(t)}']`).scrollIntoView():document.querySelector(`[id='${decodeURIComponent(t)}']`).scrollIntoView({behavior:"smooth"}),t==="top"?history.replaceState(null,null," "):history.pushState(null,null,`#${t}`)})})</script><script>var mybutton=document.getElementById("top-link");window.onscroll=function(){document.body.scrollTop>800||document.documentElement.scrollTop>800?(mybutton.style.visibility="visible",mybutton.style.opacity="1"):(mybutton.style.visibility="hidden",mybutton.style.opacity="0")}</script><script>document.getElementById("theme-toggle").addEventListener("click",()=>{document.body.className.includes("dark")?(document.body.classList.remove("dark"),localStorage.setItem("pref-theme","light")):(document.body.classList.add("dark"),localStorage.setItem("pref-theme","dark"))})</script><script>document.querySelectorAll("pre > code").forEach(e=>{const n=e.parentNode.parentNode,t=document.createElement("button");t.classList.add("copy-code"),t.innerHTML="copy";function s(){t.innerHTML="copied!",setTimeout(()=>{t.innerHTML="copy"},2e3)}t.addEventListener("click",t=>{if("clipboard"in navigator){navigator.clipboard.writeText(e.textContent),s();return}const n=document.createRange();n.selectNodeContents(e);const o=window.getSelection();o.removeAllRanges(),o.addRange(n);try{document.execCommand("copy"),s()}catch{}o.removeRange(n)}),n.classList.contains("highlight")?n.appendChild(t):n.parentNode.firstChild==n||(e.parentNode.parentNode.parentNode.parentNode.parentNode.nodeName=="TABLE"?e.parentNode.parentNode.parentNode.parentNode.parentNode.appendChild(t):e.parentNode.appendChild(t))})</script></body></html>