1
0
mirror of https://github.com/PrivSec-dev/privsec.dev synced 2025-01-24 21:11:33 -05:00
privsec.dev/public/os/docker-and-oci-hardening/index.html

63 lines
69 KiB
HTML
Raw Normal View History

<!doctype html><html lang=en dir=auto><head><meta charset=utf-8><meta http-equiv=x-ua-compatible content="IE=edge"><meta name=viewport content="width=device-width,initial-scale=1,shrink-to-fit=no"><meta name=robots content="index, follow"><title>Docker and OCI Hardening | PrivSec.dev</title><meta name=keywords content="operating systems,linux,container,security"><meta name=description content="Containers aren&rsquo;t that new fancy thing anymore, but they were a big deal. And they still are. They are a concrete solution to the following problem:
- Hey, your software doesn&rsquo;t work&mldr;
- Sorry, it works on my computer! Can&rsquo;t help you.
Whether we like them or not, containers are here to stay. Their expressiveness and semantics allow for an abstraction of the OS dependencies that a software has, the latter being often dynamically linked against certain libraries."><meta name=author content="Wonderfall"><link rel=canonical href=https://wonderfall.dev/docker-hardening/><link crossorigin=anonymous href=/assets/css/stylesheet.8b523f1730c922e314350296d83fd666efa16519ca136320a93df674d00b6325.css integrity="sha256-i1I/FzDJIuMUNQKW2D/WZu+hZRnKE2MgqT32dNALYyU=" rel="preload stylesheet" as=style><script defer crossorigin=anonymous src=/assets/js/highlight.f413e19d0714851f6474e7ee9632408e58ac146fbdbe62747134bea2fa3415e0.js integrity="sha256-9BPhnQcUhR9kdOfuljJAjlisFG+9vmJ0cTS+ovo0FeA=" onload=hljs.initHighlightingOnLoad()></script>
<link rel=icon href=https://privsec.dev/%3Clink%20/%20abs%20url%3E><link rel=icon type=image/png sizes=16x16 href=https://privsec.dev/%3Clink%20/%20abs%20url%3E><link rel=icon type=image/png sizes=32x32 href=https://privsec.dev/%3Clink%20/%20abs%20url%3E><link rel=apple-touch-icon href=https://privsec.dev/%3Clink%20/%20abs%20url%3E><link rel=mask-icon href=https://privsec.dev/%3Clink%20/%20abs%20url%3E><meta name=theme-color content="#2e2e33"><meta name=msapplication-TileColor content="#2e2e33"><noscript><style>#theme-toggle,.top-link{display:none}</style></noscript><meta property="og:title" content="Docker and OCI Hardening"><meta property="og:description" content="Containers aren&rsquo;t that new fancy thing anymore, but they were a big deal. And they still are. They are a concrete solution to the following problem:
- Hey, your software doesn&rsquo;t work&mldr;
- Sorry, it works on my computer! Can&rsquo;t help you.
Whether we like them or not, containers are here to stay. Their expressiveness and semantics allow for an abstraction of the OS dependencies that a software has, the latter being often dynamically linked against certain libraries."><meta property="og:type" content="article"><meta property="og:url" content="https://privsec.dev/os/docker-and-oci-hardening/"><meta property="article:section" content="os"><meta name=twitter:card content="summary"><meta name=twitter:title content="Docker and OCI Hardening"><meta name=twitter:description content="Containers aren&rsquo;t that new fancy thing anymore, but they were a big deal. And they still are. They are a concrete solution to the following problem:
- Hey, your software doesn&rsquo;t work&mldr;
- Sorry, it works on my computer! Can&rsquo;t help you.
Whether we like them or not, containers are here to stay. Their expressiveness and semantics allow for an abstraction of the OS dependencies that a software has, the latter being often dynamically linked against certain libraries."><script type=application/ld+json>{"@context":"https://schema.org","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":2,"name":"Operating Systems","item":"https://privsec.dev/os/"},{"@type":"ListItem","position":3,"name":"Docker and OCI Hardening","item":"https://privsec.dev/os/docker-and-oci-hardening/"}]}</script><script type=application/ld+json>{"@context":"https://schema.org","@type":"BlogPosting","headline":"Docker and OCI Hardening","name":"Docker and OCI Hardening","description":"Containers aren\u0026rsquo;t that new fancy thing anymore, but they were a big deal. And they still are. They are a concrete solution to the following problem:\n- Hey, your software doesn\u0026rsquo;t work\u0026hellip;\n- Sorry, it works on my computer! Can\u0026rsquo;t help you.\nWhether we like them or not, containers are here to stay. Their expressiveness and semantics allow for an abstraction of the OS dependencies that a software has, the latter being often dynamically linked against certain libraries.","keywords":["operating systems","linux","container","security"],"articleBody":"Containers arent that new fancy thing anymore, but they were a big deal. And they still are. They are a concrete solution to the following problem:\n- Hey, your software doesnt work\n- Sorry, it works on my computer! Cant help you.\nWhether we like them or not, containers are here to stay. Their expressiveness and semantics allow for an abstraction of the OS dependencies that a software has, the latter being often dynamically linked against certain libraries. The developer can therefore provide a known-good environment where it is expected that their software just works. That is particularly useful for development to eliminate environment-related issues, and that is often used in production as well.\nContainers are often perceived as a great tool for isolation, that is, they can provide an isolated workspace that wont pollute your host OS - all that without the overhead of virtual machines. Security-wise: containers, as we know them on Linux, are glorified namespaces at their core. Containers usually share the same kernel with the host, and namespaces is the kernel feature for separating kernel resources across containers (IDs, networks, filesystems, IPC, etc.). Containers also leverage the features of cgroups to separate system resources (CPU, memory, etc.), and security features such as seccomp to restrict syscalls, or MACs (AppArmor, SELinux).\nAt first, it seems that containers may not provide the same isolation boundary as virtual machines. Thats fine, they were not designed to. But they cant be simplified to a simple chroot either. Well see that a container can mean a lot of things, and their definition may vary a lot depending on the implementation: as such, containers are mostly defined by their semantics.\nDocker is dead, long live Docker and OCI! When people think of containers, a large group of them may think of Docker. While Docker played a big role in the popularity of containers a few years ago, it didnt introduce the technology: on Linux, LXC did (Linux Containers). In fact, Docker in its early days was a high-level wrapper for LXC which already combined the power of namespaces and cgroups. Docker then replaced LXC with libcontainer which does more or less the same, plus extra features.\nThen, what happened? Open Container Initiative (OCI). That is the current standard that defines the container ecosystem. That means that whether youre using Docker, Podman, or Kubernetes, youre in fact running OCI-compliant tools. That is a good thing, as it saves a lot of interoperability headaches.\nDocker is no longer the monolithic platform it once was. libcontainer was absorbed by runc, the reference OCI runtime. The high-level components of Docker split i
&nbsp;|&nbsp;<span>Originally published at&nbsp;<a href=https://wonderfall.dev/docker-hardening/ title=https://wonderfall.dev/docker-hardening/ target=_blank rel="noopener noreferrer">wonderfall.dev</a></span></div></header><div class=toc><details><summary accesskey=c title="(Alt + C)"><span class=details>Table of Contents</span></summary><div class=inner><ul><li><a href=#docker-is-dead-long-live-docker-and-oci aria-label="Docker is dead, long live Docker&amp;hellip; and OCI!">Docker is dead, long live Docker&mldr; and OCI!</a></li><li><a href=#the-nightmare-of-dependencies aria-label="The nightmare of dependencies">The nightmare of dependencies</a><ul><li><a href=#images-immutability-and-versioning aria-label="Images, immutability and versioning">Images, immutability and versioning</a></li><li><a href=#a-minimal-base-system aria-label="A minimal base system">A minimal base system</a></li><li><a href=#keeping-images-up-to-date aria-label="Keeping images up-to-date">Keeping images up-to-date</a></li><li><a href=#is-it-really-a-security-nightmare aria-label="Is it really a security nightmare?">Is it really a security nightmare?</a></li><li><a href=#supply-chain-attacks aria-label="Supply-chain attacks">Supply-chain attacks</a></li></ul></li><li><a href=#leave-my-root-alone aria-label="Leave my root alone!">Leave my root alone!</a><ul><li><a href=#attack-surface aria-label="Attack surface">Attack surface</a></li><li><a href=#avoiding-root aria-label="Avoiding root">Avoiding root</a></li><li><a href=#user-namespaces-sandbox-or-paradox aria-label="User namespaces: sandbox or paradox?">User namespaces: sandbox or paradox?</a></li><li><a href=#the-no_new_privs-bit aria-label="The no_new_privs bit">The no_new_privs bit</a></li><li><a href=#capabilities aria-label=Capabilities>Capabilities</a></li></ul></li><li><a href=#other-security-features aria-label="Other security features">Other security features</a><ul><li><a href=#mandatory-access-control aria-label="Mandatory Access Control">Mandatory Access Control</a></li><li><a href=#seccomp aria-label=seccomp>seccomp</a></li><li><a href=#cgroups aria-label=cgroups>cgroups</a></li><li><a href=#read-only-filesystem aria-label="Read-only filesystem">Read-only filesystem</a></li><li><a href=#network-isolation aria-label="Network isolation">Network isolation</a></li></ul></li><li><a href=#alternative-runtimes-gvisor aria-label="Alternative runtimes (gVisor)">Alternative runtimes (gVisor)</a></li><li><a href=#conclusion-whats-a-container-after-all aria-label="Conclusion: what&amp;rsquo;s a container after all?">Conclusion: what&rsquo;s a container after all?</a></li></ul></div></details></div><div class=post-content><p>Containers aren&rsquo;t that new fancy thing anymore, but they were a big deal. And they still are. They are a concrete solution to the following problem:</p><blockquote><p>- Hey, your software doesn&rsquo;t work&mldr;</p><p>- Sorry, it works on my computer! Can&rsquo;t help you.</p></blockquote><p>Whether we like them or not, containers are here to stay. Their expressiveness and semantics allow for an abstraction of the OS dependencies that a software has, the latter being often dynamically linked against certain libraries. The developer can therefore provide a known-good environment where it is expected that their software &ldquo;just works&rdquo;. That is particularly useful for development to eliminate environment-related issues, and that is often used in production as well.</p><p>Containers are often perceived as a great tool for isolation, that is, they can provide an isolated workspace that won&rsquo;t pollute your host OS - all that without the overhead of virtual machines. Security-wise: containers, as we know them on Linux, are glorified namespaces at their core. Containers usually share the same kernel with the host, and <strong>namespaces</strong> is the kernel feature for separating kernel resources across containers (IDs, networks, filesystems, IPC, etc.). Containers also leverage the features of <strong>cgroups</strong> to separate system resources (CPU, mem
</code></pre><p><strong>Podman</strong> is an alternative to Docker developed by RedHat, that also intends to be a drop-in replacement for Docker. It doesn&rsquo;t work with a daemon, and can work rootless by design (Docker has support for rootless too, but that is not without caveats). I would largely recommend Podman over Docker for someone who wants a simple tool to run containers and test code on their machine.</p><p><strong>Kubernetes</strong> (also known as K8S) is the container platform made by Google. It is designed with scaling in mind, and is about running containers across a cluster whereas Docker focuses on packaging containers on a single node. Docker Swarm is the direct alternative to that, but it has never really took off due to the popularity of K8S.</p><p>For the rest of this article, we will use Docker as the reference for our examples, along with the <a href=https://docs.docker.com/compose/compose-file/>Compose specification</a> format. Most of these examples can be adapted to other platforms without issues.</p><h2 id=the-nightmare-of-dependencies>The nightmare of dependencies<a hidden class=anchor aria-hidden=true href=#the-nightmare-of-dependencies>#</a></h2><p>Containers are made from images, and images are typically built from a Dockerfile. Images can be built and distributed through OCI registries: <a href=https://hub.docker.com/>Docker Hub</a>, <a href=https://cloud.google.com/container-registry>Google Container Registry</a>, <a href=https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry>GitHub Container Registry</a>, and so on. You can also set up your own private registry as well, but the reality is that people often pull images from these public registries.</p><h3 id=images-immutability-and-versioning>Images, immutability and versioning<a hidden class=anchor aria-hidden=true href=#images-immutability-and-versioning>#</a></h3><p>Images are what make containers, well, containers. Containers made from the same image should behave similary on different machines. Images can have <strong>tags</strong>, which are useful for software versioning. The usage of generic tags such as <code>latest</code> is often discouraged because it defeats the purpose of the expected behavior of the container. Tags are not necessarily immutable by design, and they shouldn&rsquo;t be (more on that below). <strong>Digest</strong>, however, is the attribute of an immutable image, and is often generated with the SHA-256 algorithm.</p><pre tabindex=0><code>docker.io/library/golang:1.17.1@sha256:232a180dbcbcfa7250917507f3827d88a9ae89bb1cdd8fe3ac4db7b764ebb25
^ ^ ^ ^
| | | |
Registry Image Tag Digest (immutable)
</code></pre><p>Now onto why tags shouldn&rsquo;t be immutable: as written above, containers bring us an abstraction over the OS dependencies that are used by the packaged software. That is nice indeed, but this shouldn&rsquo;t lure us into believing that we can forget security updates. The fact is, <strong>there is still a whole OS to care about</strong>, and we can&rsquo;t just think of the container as a simple package tool for software.</p><p>For these reasons, good practices were established:</p><ul><li>An image should be as minimal as possible (Alpine Linux, or scratch/distroless).</li><li>An image, with a given tag, should be regularly built, without cache to ensure all layers are freshly built.</li><li>An image should be rebuilt when the images it&rsquo;s based on are updated.</li></ul><h3 id=a-minimal-base-system>A minimal base system<a hidden class=anchor aria-hidden=true href=#a-minimal-base-system>#</a></h3><p><a href=https://alpinelinux.org/>Alpine Linux</a> is often the choice for official images for the first reason. This is not a typical Linux distribution as it uses musl as its C library, but it works quite well. Actually, I&rsquo;m quite fond of Alpine Linux and <code>apk</code> (its package manager). If a supervision suite is needed, I&rsquo;d look into <code>s6</code>. If you need a glibc distribution, Debian provides slim variants for lightweight base images. We can do even better than using Alpine by using <strong>distroless images</strong>, allowing us to have state-of-the-art application containers.</p><p>&ldquo;Distroless&rdquo; is a fancy name referring to an image with a minimal set of dependencies, from none (for fully static binaries) to some common libraries (typically the C library). Google maintains <a href=https://github.com/GoogleContainerTools/distroless>distroless images</a> you can use as a base for your own images. If you were wondering, the difference with <code>scratch</code> (empty starting point) is that distroless images contain common dependencies that &ldquo;almost-statically compiled&rdquo; binaries may need, such as <code>ca-certificates</code>.</p><p>However, distroless images are not suited for every application. In my experience though, distroless is an excellent option with pure Go binaries. Going with minimal images drastically reduces the available attack surface in the container. For example, here&rsquo;s a <a href=https://docs.docker.com/develop/develop-images/multistage-build/>multi-stage Dockerfile</a> resulting in a minimal non-root image for a simple Go project:</p><pre tabindex=0><code>FROM golang:alpine as build
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 go mod -o /my_app cmd/my_app
FROM gcr.io/distroless/static
COPY --from=build /my_app /
USER nobody
ENTRYPOINT [&#34;/my_app&#34;]
</code></pre><p>The main drawback of using minimal images is the lack of tools that help with debugging, which also constitute the very attack surface we&rsquo;re trying to get rid of. The trade-off is probably not worth the hassle for development-focused containers, and if you&rsquo;re running such images in production, you have to be confident enough to operate with them. Note that the <code>gcr.io/distroless</code> images have a <code>:debug</code> tag to help in that regard.</p><h3 id=keeping-images-up-to-date>Keeping images up-to-date<a hidden class=anchor aria-hidden=true href=#keeping-images-up-to-date>#</a></h3><p>The two other points are highly problematic, because most software vendors just publish an image on release, and forget about it. You should take it up to them if you&rsquo;re running images that are versioned but not regularly updated. I&rsquo;d say running scheduled builds <strong>once a week</strong> is the bare minimum to make sure dependencies stay up-to-date. Alpine Linux is a better choice than most other &ldquo;stable&rdquo; distributions because it usually has more recent packages.</p><p>Stable distributions often rely on backporting security fixes from CVEs, which is known to be a flawed approach to security since CVEs aren&rsquo;t always assigned or even taken care of. Alpine has more recent packages, and it has versioning, so it&rsquo;s once again a particulary good choice as long as <code>musl</code> doesn&rsquo;t cause issues.</p><h3 id=is-it-really-a-security-nightmare>Is it really a security nightmare?<a hidden class=anchor aria-hidden=true href=#is-it-really-a-security-nightmare>#</a></h3><p>When people say Docker is a security nightmare because of that, that&rsquo;s a fair point. On a traditional system, you could upgrade your whole system with a single command or two. With Docker, you&rsquo;ll have to recreate several containers&mldr; if the images were kept up-to-date in the first place. Recreating itself is not a big deal actually: hot upgrades of binaries and libraries often require the services that use them to restart, otherwise they could still use an old (and vulnerable) version of them in memory. But yeah, the fact is most people are running outdated containers, and more often than not, they don&rsquo;t have the choice if they rely on third-party images.</p><p><a href=https://github.com/aquasecurity/trivy>Trivy</a> is an excellent tool to scan images for a subset of <strong>known vulnerabilities</strong> an image might have. You should play with it and see for yourself how outdated many publicly available images are.</p><h3 id=supply-chain-attacks>Supply-chain attacks<a hidden class=anchor aria-hidden=true href=#supply-chain-attacks>#</a></h3><p>As with any code downloaded from a software vendor, OCI images are not exempt from supply-chain attacks. The good practice is quite simple: rely on official images, and ideally build and maintain your own images. One should definitely not automatically trust random third-party images they can find on Docker Hub. Half of these images, if not more, contain vulnerabilities, and I bet a good portion of them contains malwares <a href=https://www.trendmicro.com/vinfo/fr/security/news/virtualization-and-cloud/malicious-docker-hub-container-images-cryptocurrency-mining>such as miners</a> or worse.</p><p>As an image maintainer, you can sign your images to improve the authenticity assurance. Most official images make use of <a href=https://docs.docker.com/engine/security/trust/>Docker Content Trust</a>, which works with a OCI registry attached to a <a href=https://github.com/notaryproject/notary>Notary server</a>. With the Docker toolset, setting the environment variable <code>DOCKER_CONTENT_TRUST=1</code> enforces signature verification (a signature is only good if it&rsquo;s checked in the first place). The SigStore initiative is developing <a href=https://github.com/sigstore/cosign>cosign</a>, an alternative that doesn&rsquo;t require a Notary server because it works with features already provided by the registry such as tags. Kubernetes use
</code></pre><p><code>whoami && sleep 60</code> in the container will return root, but <code>ps -fC sleep</code> on the host will show us the PID of another user. That is nice, but it has limitations and therefore shouldn&rsquo;t be considered as a real sandbox. In fact, the paradox is that <a href=https://lists.archlinux.org/pipermail/arch-general/2017-February/043066.html>user namespaces are attack surface</a> (and vulnerabilities are still being found <a href=https://www.openwall.com/lists/oss-security/2022/01/29/1>years later</a>), and it&rsquo;s common wisdom to restrict them to privileged users (<code>kernel.unprivileged_userns_clone=0</code>). That is fine for Docker with its traditional root daemon, but Podman expects you to let unprivileged users interact with user namespaces (so essentially privileged code).</p><p>Enabling <code>userns-remap</code> in Docker shouldn&rsquo;t be a substitute for running unprivileged application containers (where applicable). User namespaces are mostly useful if you intend to run full-fledged OS containers which need root in order to function, but that is out of the scope of the container technologies mentioned in this article; for them, I&rsquo;d argue exposing such a vulnerable attack surface from the host kernel for dubious sandboxing benefits isn&rsquo;t an interesting trade-off to make.</p><h3 id=the-no_new_privs-bit>The no_new_privs bit<a hidden class=anchor aria-hidden=true href=#the-no_new_privs-bit>#</a></h3><p>After ensuring root isn&rsquo;t used in your containers, you should look into setting the <code>no_new_privs</code> bit. <a href=https://docs.kernel.org/userspace-api/no_new_privs.html>This Linux feature</a> restricts syscalls such as <code>execve()</code> from granting privileges, which is what you want to restrict in-container privilege escalation. This flag can be set for a given container in a Compose file:</p><pre tabindex=0><code> security_opt:
- no-new-privileges: true
</code></pre><p>Gaining privileges in the container will be much harder that way.</p><h3 id=capabilities>Capabilities<a hidden class=anchor aria-hidden=true href=#capabilities>#</a></h3><p>Furthermore, we should mention capabilities: root powers are divided into distinct units by the Linux kernel, called capabilities. Each granted capability also grants privilege and therefore access to a significant amount of attack surface. Security researcher Brad Spengler enumerates <a href="https://forums.grsecurity.net/viewtopic.php?f=7&t=2522#p10271">19 important capabilities</a>. Docker <strong>restricts certain capabilities by default</strong>, but <a href=https://github.com/moby/moby/blob/1308a3a99faa13ff279dcb4eb5ad23aee3ab5cdb/oci/caps/defaults.go>some of the most important ones</a> are still available to a container by default.</p><p>You should consider the following rule of thumb:</p><ul><li>Drop all capabilities by default.</li><li>Allow only the ones you really need to.</li></ul><p>If you already run your containers unprivileged without root, your container will very likely work fine with all capabilities dropped. That can be done in a Compose file:</p><pre tabindex=0><code> cap_drop:
- ALL
#cap_add:
# - CHOWN
# - DAC_READ_SEARCH
# - SETUID
# - SETGID
</code></pre><p>Never use the <code>--privileged</code> option unless you really need to: a privileged container is given access to almost all capabilities, kernel features and devices.</p><h2 id=other-security-features>Other security features<a hidden class=anchor aria-hidden=true href=#other-security-features>#</a></h2><p>MACs and seccomp are robust tools that may vastly improve container security.</p><h3 id=mandatory-access-control>Mandatory Access Control<a hidden class=anchor aria-hidden=true href=#mandatory-access-control>#</a></h3><p>MAC stand for Mandatory Access Control: traditionnally a Linux Security Module that will enforce a policy to restrict the userspace. Examples are <strong>AppArmor</strong> and <strong>SELinux</strong>: the former being more easy-to-use, the later being more fine-grained. Both are strong tools that can help&mldr; Yet, their sole presence does not mean they&rsquo;re really effective. A robust policy starts from a <em>deny all</em> policy, and only allows the necessary resources to be accessed.</p><h3 id=seccomp>seccomp<a hidden class=anchor aria-hidden=true href=#seccomp>#</a></h3><p>seccomp (short for secure computing mode) on the other hand is a much simpler and complementary tool, and there is no reason not to use it. What it does is restricting a process to a set of system calls, thus drastically reducing the attack surface available.</p><p>Docker provides default profiles for <a href=https://github.com/moby/moby/tree/85eaf23bf46b12827273ab2ff523c753117dbdc7/profiles/apparmor>AppArmor</a> and <a href=https://github.com/moby/moby/blob/85eaf23bf46b12827273ab2ff523c753117dbdc7/profiles/seccomp/default.json>seccomp</a>, and they&rsquo;re enabled by default for newly created containers unless the <code>unconfined</code> option is explicitly passed. Note: Kubernetes doesn&rsquo;t enable the default seccomp profile by default, so you should probably <a href=https://kubernetes.io/docs/tutorials/security/seccomp/#enable-the-use-of-runtimedefault-as-the-default-seccomp-profile-for-all-workloads>try it</a>.</p><p>These profiles are a great start, but you should do much more if you take security seriously, because they were made to not break compatibility with a large range of images. The default seccomp profile only disables <a href=https://docs.docker.com/engine/security/seccomp/#significant-syscalls-blocked-by-the-default-profile>around 44 syscalls</a>, which are mostly not very common and/or obsoleted. Of course, the best profile you can get is supposed to be written for a given program. It also doesn&rsquo;t make sense to insist on the permissiveness of the default profiles, and <a href=https://blog.jessfraz.com/post/containers-security-and-echo-chambers/>a lof of work has gone</a> into hardening containers.</p><h3 id=cgroups>cgroups<a hidden class=anchor aria-hidden=true href=#cgroups>#</a></h3><p>Use cgroups to restrict access to hardware and system resources. You likely don&rsquo;t want a guest container to monopolize the host resources. You also don&rsquo;t want to be vulnerable to stupid fork bomb attacks. In a Compose file, consider setting these limits:</p><pre tabindex=0><code> mem_limit: 4g
cpus: 4
pids_limit: 256
</code></pre><p>More runtime options can be found in <a href=https://docs.docker.com/config/containers/resource_constraints/>the official documentation</a>. All of them should have a <a href=https://github.com/compose-spec/compose-spec/blob/master/spec.md>Compose spec</a> equivalent.</p><p>The <code>--cgroup-parent</code> option should be avoided as it uses the host cgroup and not the one configured from Docker (or else), which is the default.</p><h3 id=read-only-filesystem>Read-only filesystem<a hidden class=anchor aria-hidden=true href=#read-only-filesystem>#</a></h3><p>It is good practice to treat the image as some refer to as the &ldquo;golden image&rdquo;.</p><p>In other words, you&rsquo;ll run containers in <em>read-only</em> mode, with an immutable filesystem inherited from the image. Only the mounted volumes will be read/write accessible, and those should ideally be mounted with the <code>noexec</code>, <code>nosuid</code> and <code>nodev</code> options for extra security. If read/write access isn&rsquo;t needed, mount these volumes as read-only too.</p><p>However, the image may not be perfect and still require read/write access to some parts of the filesystem, likely directories such as <code>/tmp</code>, <code>/run</code> or <code>/var</code>. You can make a <strong>tmpfs</strong> for those (a temporary filesystem in the container attributed memory), because they&rsquo;re not persistent data anyway.</p><p>In a Compose file, that would look like the following settings:</p><pre tabindex=0><code> read_only: true
tmpfs:
- /tmp:size=10M,mode=0770,uid=1000,gid=1000,noexec,nosuid,nodev
</code></pre><p>That is quite verbose indeed, but that&rsquo;s to show you the different options for a tmpfs mount. You want to restrict them in size and permissions ideally.</p><h3 id=network-isolation>Network isolation<a hidden class=anchor aria-hidden=true href=#network-isolation>#</a></h3><p>By default, all Docker containers will use the default network bridge. They will see and be able to communicate with each other. Each container should have its own user-defined bridge network, and each connection between containers should have an internal network. If you intend to run a reverse proxy in front of several containers, you should make a dedicated network for each container you want to expose to the reverse proxy.</p><p>The <code>--network host</code> option also shouldn&rsquo;t be used for obvious reasons since the container would share the same network as the host, providing no isolation at all.</p><h2 id=alternative-runtimes-gvisor>Alternative runtimes (gVisor)<a hidden class=anchor aria-hidden=true href=#alternative-runtimes-gvisor>#</a></h2><p><code>runc</code> is the reference OCI runtime, but that means other runtimes can exist as well as long as they&rsquo;re compliant with the OCI standard. These runtimes can be interchanged quite seamlessly. There&rsquo;s a few alternatives, such as <a href=https://github.com/containers/crun>crun</a> or <a href=https://github.com/containers/youki>youki</a>, respectively implemented in C and Rust (<code>runc</code> is a Go implementation). However, there is one particular runtime that does a lot more for security: <code>runsc</code>, provided by the <a href=https://gvisor.dev/>gVisor project</a> by the folks at Google.</p><p><strong>Containers are not a sandbox</strong>, and while we can improve their security, they will fundamentally share a common attack surface with the host. Virtual machines are a solution to that problem, but you might prefer container semantics and ecosystem. gVisor can be perceived as an attempt to get the &ldquo;best of both worlds&rdquo;: containers that are easy to manage while providing a native isolation boundary. gVisor did just that by implementing two things:</p><ul><li><strong>Sentry</strong>: an application kernel in Go, a language known to be memory-safe. It implements the Linux logic in userspace such as various system calls.</li><li><strong>Gofer</strong>: a host process which communicates with Sentry and the host filesystem, since Sentry is restricted in that aspect.</li></ul><p>A platform like ptrace or KVM is used to intercept system calls and redirect them from the application to Sentry, which is running in the userspace. This has some costs: there is a higher per-syscall overhead, and compatibility is reduced since not all syscalls are implemented. On top of that, gVisor employs security mechanisms we&rsquo;ve glanced over above, such as a <a href=https://github.com/google/gvisor/blob/86ad7d5b5838da1b539e976886d04b93c939ca3d/runsc/boot/filter/config.go>very restrictive seccomp profile</a> between Sentry and the host kernel, the <a href=https://github.com/google/gvisor/blob/6ef268409620c57197b9d573e23be8cb05dbf381/pkg/sentry/kernel/task_identity.go#L464>no_new_privs bit</a>, and isolated namespaces from the host.</p><p>The security model of gVisor is comparable to what you would expect from a virtual machine. It is also very easy to <a href=https://gvisor.dev/docs/user_guide/install/>install and use</a>. The path to runsc along with its different configuration flags (<code>runsc flags</code>) should be added to <code>/etc/docker/daemon.json</code>:</p><pre tabindex=0><code> &#34;runtimes&#34;: {
&#34;runsc-ptrace&#34;: {
&#34;path&#34;: &#34;/usr/local/bin/runsc&#34;,
&#34;runtimeArgs&#34;: [
&#34;--platform=ptrace&#34;
]
},
&#34;runsc-kvm&#34;: {
&#34;path&#34;: &#34;/usr/local/bin/runsc&#34;,
&#34;runtimeArgs&#34;: [
&#34;--platform=kvm&#34;
]
}
}
</code></pre><p><code>runsc</code> needs to start with root to set up some mitigations, including the use of its own network stack separated from the host. The sandbox itself drops privileges to nobody as soon as possible. You can still use <code>runsc</code> rootless if you want (which should be needed for Podman):</p><pre tabindex=0><code>./runsc --rootless do uname -a
*** Warning: sandbox network isn&#39;t supported with --rootless, switching to host ***
Linux 4.4.0 #1 SMP Sun Jan 10 15:06:54 PST 2016 x86_64 GNU/Linux
</code></pre><p>Linux 4.4.0 is shown because that is the version of the Linux API that Sentry tries to mimic. As you&rsquo;ve probably guessed, you&rsquo;re not really using Linux 4.4.0, but the application kernel that behaves like it. By the way, gVisor is of course compatible with cgroups.</p><h2 id=conclusion-whats-a-container-after-all>Conclusion: what&rsquo;s a container after all?<a hidden class=anchor aria-hidden=true href=#conclusion-whats-a-container-after-all>#</a></h2><p>Like I wrote above, a container is mostly defined by its semantics and ecosystem. Containers shouldn&rsquo;t be solely defined by the OCI reference runtime implementation, as we&rsquo;ve seen with gVisor that provides an entirely different security model.</p><p>Still not convinced? What if I told you a container can leverage the same technologies as a virtual machine? That is exactly what <a href=https://katacontainers.io/>Kata Containers</a> does by using a VMM like QEMU-lite to provide containers that are in fact lightweight virtual machines, with their traditional resources and security model, compatibility with container semantics and toolset, and an optimized overhead. While not in the OCI ecosystem, Amazon achieves quite the same with <a href=https://firecracker-microvm.github.io/>Firecracker</a>.</p><p>If you&rsquo;re running untrusted workloads, I highly suggest you consider gVisor instead of a traditional container runtime. Your definition of &ldquo;untrusted&rdquo; may vary: for me, almost everything should be considered untrusted. That is how modern security works, and how mobile operating systems work. It&rsquo;s quite simple, security should be simple, and gVisor simply offers native security.</p><p>Containers are a popular, yet strange world. They revolutionized the way we make and deploy software, but one should not loose the sight of what they really are and aren&rsquo;t. This hardening guide is non-exhaustive, but I hope it can make you aware of some aspects you&rsquo;ve never thought of.</p></div><footer class=post-footer><ul class=post-tags><li><a href=https://privsec.dev/tags/operating-systems/>operating systems</a></li><li><a href=https://privsec.dev/tags/linux/>linux</a></li><li><a href=https://privsec.dev/tags/container/>container</a></li><li><a href=https://privsec.dev/tags/security/>security</a></li></ul><nav class=paginav><a class=prev href=https://privsec.dev/os/choosing-your-desktop-linux-distribution/><span class=title>« Prev</span><br><span>Choosing Your Desktop Linux Distribution</span></a>
<a class=next href=https://privsec.dev/os/linux-insecurities/><span class=title>Next »</span><br><span>Linux Insecurities</span></a></nav></footer></article></main><footer class=footer><span>&copy; 2022 <a href=https://privsec.dev>PrivSec.dev</a></span>
<span>Powered by
<a href=https://gohugo.io/ rel="noopener noreferrer" target=_blank>Hugo</a> &
<a href=https://github.com/adityatelange/hugo-PaperMod/ rel=noopener target=_blank>PaperMod</a></span></footer><a href=#top aria-label="go to top" title="Go to Top (Alt + G)" class=top-link id=top-link accesskey=g><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 12 6" fill="currentcolor"><path d="M12 6H0l6-6z"/></svg></a><script>let menu=document.getElementById("menu");menu&&(menu.scrollLeft=localStorage.getItem("menu-scroll-position"),menu.onscroll=function(){localStorage.setItem("menu-scroll-position",menu.scrollLeft)}),document.querySelectorAll('a[href^="#"]').forEach(e=>{e.addEventListener("click",function(e){e.preventDefault();var t=this.getAttribute("href").substr(1);window.matchMedia("(prefers-reduced-motion: reduce)").matches?document.querySelector(`[id='${decodeURIComponent(t)}']`).scrollIntoView():document.querySelector(`[id='${decodeURIComponent(t)}']`).scrollIntoView({behavior:"smooth"}),t==="top"?history.replaceState(null,null," "):history.pushState(null,null,`#${t}`)})})</script><script>var mybutton=document.getElementById("top-link");window.onscroll=function(){document.body.scrollTop>800||document.documentElement.scrollTop>800?(mybutton.style.visibility="visible",mybutton.style.opacity="1"):(mybutton.style.visibility="hidden",mybutton.style.opacity="0")}</script><script>document.getElementById("theme-toggle").addEventListener("click",()=>{document.body.className.includes("dark")?(document.body.classList.remove("dark"),localStorage.setItem("pref-theme","light")):(document.body.classList.add("dark"),localStorage.setItem("pref-theme","dark"))})</script><script>document.querySelectorAll("pre > code").forEach(e=>{const n=e.parentNode.parentNode,t=document.createElement("button");t.classList.add("copy-code"),t.innerHTML="copy";function s(){t.innerHTML="copied!",setTimeout(()=>{t.innerHTML="copy"},2e3)}t.addEventListener("click",t=>{if("clipboard"in navigator){navigator.clipboard.writeText(e.textContent),s();return}const n=document.createRange();n.selectNodeContents(e);const o=window.getSelection();o.removeAllRanges(),o.addRange(n);try{document.execCommand("copy"),s()}catch{}o.removeRange(n)}),n.classList.contains("highlight")?n.appendChild(t):n.parentNode.firstChild==n||(e.parentNode.parentNode.parentNode.parentNode.parentNode.nodeName=="TABLE"?e.parentNode.parentNode.parentNode.parentNode.parentNode.appendChild(t):e.parentNode.appendChild(t))})</script></body></html>