Converting privileged LXC containers to unprivileged containers

I use LXC (Linux Containers) on a small server to host several Linuxes on the same hardware. While upgrading to the recently released Debian Stretch and reviewing the configuration, I thought I should really make these containers unprivileged. What does that mean? Usually, LXC containers use the same UID namespace as the host, i.e., root inside the container is root outside the container. This is obviously not optimal, since any bug which allows a breakout from the container may allow an attacker to have the same privileges on the host as inside of the container. Even worse, the UID to name mapping can differ between container and host. UID 1000, for example, could be “alice” on the host and “bob” in the container. This can give users on the host too much access to the container’s file system. Luckily, LXC has a solution: UID mappings.

Concept

LXC can map from internal to external UID/GID ranges:

IDs on the host	in the container
10000000–10065535	0–65535

You enable this mapping by adding the following to the container’s config file (usually /var/lib/lxc/<name>/config):

lxc.id_map = u 0 10000000 65536
lxc.id_map = g 0 10000000 65536

The u/g stands for UIDs or GIDs, respectively. The first number is the first ID in the container that should be mapped; for my purposes that should always be zero so that root is included. The second number is the first ID on the host. The third number is the number of IDs in the range and should be at least 65536 for a full Linux system. In fact, I use 10⁷ IDs for every container (10,000,000–19,999,999 for the first one, 20,000,000–29,999,999 for the second one, etc.), since a Linux host supports 2³² = 4,294,967,296 IDs.

This alone will not work, though. The user that starts the LXC container must be allowed to use this range. In most current Linux distributions this is managed by the files /etc/subuid and /etc/subgid. The entries for both look exactly alike. One thing that is not obvious from the LXC documentation is what to put there if you want system containers, i.e., containers that are auto-started on boot and that live in /var/lib/lxc. Contrary to what I intuitively thought, you must not create a new user for every container, but allow root to use the ID ranges you want to assign. Yes, even root needs explicit permission for this feature. So for the above, we add something like this to the two files:

root:10000000:65536

The range is defined in the same way as in the LXC config file. This is all we need to start the container in its own UID/GID range.

Converting an existing container

Still, to convert an existing container one thing is missing: Correct file ownership. The files need to be owned by users and groups in the new ID range. This has a little trap built-in, so I wrote a Python script to shift the IDs correctly. The problem is that when files marked with the setuid or the setgid bit change ownership, these bits are stripped. The reason is to avoid accidental security holes, leaving it to the user to set these permissions again after changing the file owner. A container will obviously not work without all the setuid binaries (mount, sudo, …) and they need to be preserved. This Python snippet does the trick:

# Read permissions and other metadata of the file.
stat = os.stat(filename, follow_symlinks=False)
# Change the UID and GID.
os.chown(filename, new_uid, new_gid, follow_symlinks=False)
# Reset the permissions.
if not os.path.islink(filename):
    os.chmod(filenamefp, stat.st_mode)

A small but important detail are hard links, due to which a file (more correctly: an inode) can have several names. In order to avoid shifting the IDs several times, you can use the stat attributes st_dev and st_ino, which, together, uniquely identify an inode. This is implemented in the complete script at the end of the post.

Another set of permissions that is relevant on current systems is defined by access control lists (ACLs). I never really used them or knew anybody who did, but nowadays systemd’s journal sets them for its log files. Even if you don’t need the journal, system initialization can go wrong if the ACL IDs lie outside the valid range for the container. Thus, we need to make sure to fix them, if they are set. Python has support through a third party package (pylibacl), which provides the posix1e module. The API of this thing is a bit idiosyncratic for Python, so let’s see how it works. Accessing ACLs is performed using

acl = posix1e.ACL(file=filename)

But wait, there is another kind of ACL, the default ACL. This one only exists for directories and determines the ACL of new files inside it. Get it like this:

acl_def = posix1e.ACL(filedef=filename)

In an ACL, the entries ACL_USER_OBJ and ACL_GROUP_OBJ just denote the standard Unix permissions. Therefore, the only entries we need to change are ACL_USER and ACL_GROUP:

for entry in acl:
    if entry.tag_type in (posix1e.ACL_USER,
                          posix1e.ACL_GROUP):
        entry.qualifier += offset

Finally, the modified ACL can be applied to a file:

acl.applyto(filename, posix1e.ACL_TYPE_ACCESS)
acl_def.applyto(filename, posix1e.ACL_TYPE_DEFAULT)

Once again, this needs to be done for the default ACL, too. I put all of this into a Python script. The first command line parameter is the directory on which the conversion will be performed recursively, the second parameter is the offset for UIDs/GIDs (e.g., 10000000 in the example above):

#!/usr/bin/env python3
#
# Written 2017 by Tobias Brink
#
# To the extent possible under law, the author(s) have dedicated
# all copyright and related and neighboring rights to this software
# to the public domain worldwide. This software is distributed
# without any warranty.
#
# You should have received a copy of the CC0 Public Domain
# Dedication along with this software. If not, see
# <http://creativecommons.org/publicdomain/zero/1.0/>.

import sys
import os

try:
    import posix1e # "pylibacl" package
except ImportError:
    print("WARNING: pylibacl missing, cannot update ACLs!")
    has_posix1e = False
else:
    has_posix1e = True

# Call the script with two arguments:
#   1) the root directory of the container, e.g.,
#      /var/lib/lxc/foo/rootfs/
#   2) the UID/GID offset, in the example case from
#      above that would be 10000000
directory = sys.argv[1]
offset = int(sys.argv[2])

# Preview action and require confirmation.
print("directory     :", directory)
print("UID/GID offset:", offset)
print()
ans = input("Is this correct? [y/N] ")
if ans.lower() != "y":
    sys.exit()


# This functions shifts either an access or a default ACL. This is
# determined by the "flag" parameter.
def shift_acl(fp, offset, flag):
    """Shift either access or default ACL by offset."""
    # What kind of ACL do we have?
    acl = (posix1e.ACL(file=fp)
           if flag == posix1e.ACL_TYPE_ACCESS
           else posix1e.ACL(filedef=fp))
    # See if we need to update ACL and add offsets.
    acl_needs_update = False
    for entry in acl:
        if entry.tag_type in {posix1e.ACL_USER,
                              posix1e.ACL_GROUP}:
            acl_needs_update = True
            entry.qualifier += offset
    if acl_needs_update:
        # Something has changed, update the file.
        acl.applyto(fp, flag)


# This one shifts the UIDs/GIDs and corrects the setuid/setgid
# bits.
def shift_ids(fp, offset, seen_inodes):
    """Shift the UIDs/GIDs of the file "fp" by "offset"."""
    # Get the current UID/GID/permissions.
    stat = os.stat(fp, follow_symlinks=False)
    # Ensure that we haven't already shifted the IDs of
    # this inode.
    if (stat.st_dev, stat.st_ino) in seen_inodes:
        # Already changed that file.
        return
    else:
        # Remember this inode to avoid shifting its
        # IDs again.
        seen_inodes.add( (stat.st_dev, stat.st_ino) )
    # Add the offset.
    new_uid = stat.st_uid + offset
    new_gid = stat.st_gid + offset
    # Change the UID and GID.
    os.chown(fp, new_uid, new_gid, follow_symlinks=False)
    # Permissions are not relevant for symlinks on Linux.
    if not os.path.islink(fp):
        # Restore mode, because (e.g.) setuid may be stripped
        # by chown.
        os.chmod(fp, stat.st_mode)
        # Update ACL.
        if not has_posix1e: return
        shift_acl(fp, offset, posix1e.ACL_TYPE_ACCESS)
        if os.path.isdir(fp):
            # Directories can also have a "default ACL".
            shift_acl(fp, offset, posix1e.ACL_TYPE_DEFAULT)


# Apply transformation to the directory itself.
seen_inodes = set()
shift_ids(directory, offset, seen_inodes)
# Recursively walk through the directory tree of the container.
for root, dirs, files in os.walk(directory):
    for obj in dirs + files:
        fp = os.path.join(root, obj)
        # Modify.
        shift_ids(fp, offset, seen_inodes)

This script should be a more or less safe way to update the file ownership in your container. Note that the script does not check if all new IDs lie inside valid ID ranges, so if you are in doubt check this yourself. Additionally, there could be other problems, so have a backup handy!