Discussed some aspects of this with Patrick more, I’m not sure if we’re getting closer to a concrete implementation plan or not, but here’s some rough notes about what we were thinking:
Right now we have two privilege levels, user and sysmaint. sysmaint has the power most administrative users on other distros have; it’s “root with light safeguards to make it harder to do something foolish on accident”. user is more like what non-administrative users on Windows are; they can run arbitrary applications, work with files, use the network, etc., but they can’t mess with the guts of the system. This provides some additional security since it’s harder to get a foothold in the root account, but ultimately it doesn’t do much to fight malware that wants to steal or encrypt the user’s data, except for making it more likely that a user will be mindful about what they install.
Our ultimate goal, however we implement it, is to add a third privilege level to this model. Let’s call it untrusted-user (bad name, but it gets the point across for now).
- The standard
user privilege level will retain all of the power it currently has. It is also where all of the user’s trusted data will be stored.
- The
untrusted-user privilege level is where more dangerous applications (such as web browsers) should run. This privilege level should be resistant to attacker-incited persistent compromise, while simultaneously expecting temporary compromise. That is, if an attacker manages to compromise a legitimate application running as untrusted-user, that compromise must be able to “just go away” somehow. While the attacker has a foothold in untrusted-user, our implementation must contain that compromise if at all possible so that it cannot gain access to user data.
- This implies that
untrusted-user will be at least semi-ephemeral; changes made by malware need to be able to be erased reliably in some fashion.
- It is likely that the
untrusted-user environment will suffer from user-incited persistent compromise (i.e. someone intentionally installed malware into the environment). It therefore must be possible to get rid of malware of this kind in some manner.
- Users may work with data they want to store long-term when using
untrusted-user, so they must be able to save that data persistently. Some of that data may be trusted, in which case there needs to be a way to move it into the user privilege level (users will obviously have to be very careful about what data they move), while some of that data may be untrusted and needs to stay in the untrusted-user privilege level. Users will also need to be able to move files from the user privilege level to the untrusted-user level.
- Users will be running resource-intensive apps (most notably web browsers) in the
untrusted-user privilege level. They will also be running these apps within potentially resource-constrained virtual machines. Thus, performance concerns have to be taken into account. Any solution based on virtualization will have to take into account the fact that nested virtualization may be disabled or may not be available at all (such as on current versions of Qubes OS).
- Users will be running apps that make use of Linux’s native sandboxing capabilities. The implementation of
untrusted-user must not break or weaken these sandboxes if at all possible.
- Applications running as
untrusted-user are going to have to do IPC with things that run as user (most notably Wayland, Pipewire, and terminal emulator software, but probably also others). This provides substantial attack surface; Wayland is not a simple protocol, and labwc is not written in a memory-safe language, so I would not be surprised if a motivated attacker could attack the compositor to escape the sandbox.
- We do not know what applications the user will run in the sandbox. The sandbox should work transparently with as many existing applications as possible.
- Users will have legitimate reasons to run things in the
user privilege level for compatibility or performance reasons. Not everything can be pushed into a container or VM and just work. However, a subset of users will only want to use the user privilege level for managing application that run as untrusted-user, or for very basic tasks. For those users who can afford to lock down user, there should be an option that uses a project like apparmor.d to lock down the user privilege level. This might be worth while to enable by default and make opt-out.
There are two main sandboxing models that I can see, per-application sandboxing and per-usecase sandboxing. Firejail, Bubblewrap, Flatpak, and sandbox-app-launcher are examples of the former kind. I personally do not believe this is the best way of doing things, for a few reasons:
- Per-application sandboxing generally involves figuring out everything a particular application needs, allowlisting it, and then denying everything else. This means that a special profile has to be written for every application, and that lots of fiddling is needed to ensure good application compatibility. sandbox-app-launcher doesn’t seem to have had this problem, but Firejail and Bubblejail certainly do.
- Per-application sandboxing makes it confusing, difficult, or impossible to run the same app in two different “trust domains”. It causes similar implications when trying to run multiple apps in the same “trust domain”.
- Because each application is isolated independently of others, one cannot easily sandbox utilities that are primarily used as parts of larger systems. For instance, it’s hard to imagine a useful generic sandbox for
netcat. Per-usecase sandboxing allows one to sandbox these lower-level utilities meaningfully (for instance, maybe my “software-dev” environment should only allow netcat to reach out to termbin.com, while my “document-processing” environment should allow it to reach out to a local network printer for… some reason… this is a bad example but you get my point).
- Because each application is isolated independently of others, IPC becomes a concern. Should a sandboxed Tor Browser be able to ask my file manager to open a directory and show me the contents? On the surface, the answer is “yes”, because I might want to use its “Open containing folder” feature after downloading something. But what if the browser is compromised, and it’s able to use the ability to open my file manager by itself as part of a social engineering attack, trying to convince me to upload sensitive data somewhere? It shouldn’t be able to do that. With per-usecase sandboxing, I can say that Tor Browser should be allowed to communicate with anything within its sandbox, for any reason, but nothing outside of its sandbox. If the browser then decides to open a file manager window out of nowhere, it will be recognizably not my normal file manager, and it won’t be able to show me any sensitive files, so the attack’s likelihood of success goes down.
- Sandboxing of server-like “applications” becomes harder, especially if they depend on and integrate with things like systemd.
Per-usecase sandboxing is basically what containers and VMs give us. VMs are better from a security standpoint in a lot of ways, but due to resource constraints and compatibility issues, we can’t rely on them alone like I was suggesting previously. Containers are somewhat scary because they run directly on the host kernel, meaning that any host kernel bug in code accessible to the container can be exploited, potentially allowing an attacker to obtain kernel-level privileges. That being said, they’re better than nothing, and they allow working around many of the limitations of virtual machines (for instance, 3d acceleration can work inside containers).
There are a lot of existing containerization systems out there; Docker, LXC, LXD, Incus, systemd-nspawn, libvirt-lxc, etc. Which one of these would be good to build on top of, I’m not quite sure yet, but I initially think either systemd-nspawn or libvirt-lxc would be a good option. One nice thing about libvirt-lxc is that it allows defining a container in much the same way one would define a virtual machine, and libvirt provides both container and virtual machine functionality. This might permit implementing both VM-based sandboxing and container-based sandboxing in the same application. Something similar could be done with systemd-nspawn and systemd-vmspawn possibly, but unfortunately systemd-vmspawn is not present in Debian Trixie, so this may not be the best option.
In my mind, an ideal sandboxing solution would look something like this:
- There’s a privileged daemon (let’s call it
sandboxd), which listens for requests to create and manage sandboxes. sandboxd creates sandboxes by using mmdebstrap to create a Debian rootfs with proper UIDs and GIDs so that an unprivileged sandbox can be booted. This daemon runs as root so that it can make the needed ownership changes after bootstrapping a new sandbox.
- There’s a user-accessible application (call it
sandboxctl) that talks to the daemon in order to do things like create, delete, rename, query information about, start, and stop sandboxes. It can also request files to be moved between the user’s home folder and the sandbox’s home folder, or can request a directory to be transparently passed through.
- A number of validating proxies for basic, unavoidable IPC are provided with the system. These proxies do things like virtualize Wayland, Pipewire, and console access, preventing malicious interactions with parts of the system that run under the
user privilege level.
- A number of basic permissions exist (“allow network”, “allow GUI”, “allow audio”, “allow mic”, “allow 3d accel”, etc.). These permissions can be toggled on or off at the trust domain level. Wherever possible, validating proxies are used to enable these permissions rather than just passing through bits of the host system to the container. In some situation though, host passthrough will likely be unavoidable, for instance for passing through something like
/dev/dri/renderD128 for 3d acceleration.
- A helper application using seccomp-bpf will be provided that will deny the ability to create user namespaces to most applications in some way (handwaving here, haven’t thought this through fully). Only specific applications that need user namespaces such as web browsers will be run without this application wrapping them, thus allowing us to have the lessened kernel attack surface of “no user namespaces”, and the better in-browser security that comes with user namespaces.
- Sandboxes can be booted with an optional RAM-based overlay within the sandbox to make them ephemeral. Only specific folders within the sandbox’s home folder such as Documents would be exempted (bind mounting would be needed for this). This would allow a user to make a sandbox “forget” a compromise, even one that used files like
~/.bashrc for persistence, by simply rebooting the sandbox. For software updates and installation, the sandbox could be booted in a fully persistent mode.
This is the basic architecture in my mind. This would meet all of the criteria above AFAICT.
The hardest part of this is probably going to be the validating proxies. What might be doable instead is to run a real Wayland compositor and audio server within the sandbox, then use simpler protocols like VNC and audio streaming to get video and audio out of the container. A terminal emulator could then be run within the sandbox rather than using a user-privileged terminal emulator to interface with the sandbox. The performance impact of this might be substantial, and it could make things like the clipboard harder to use, but perhaps not impossible. Research would have to be done into this.
Everything after this is research I did into a bunch of existing sandboxing mechanisms and why I think they probably aren’t suitable for what we want to do.
- VirtualBox and KVM are both very resource intensive. They require lots of disk space, lots of memory, lots of CPU power, and destroy the ability to use graphics hardware acceleration (which is necessary to do things like watch videos at reasonable resolutions). They require hardware virtualization features, making them unsuitable for use within Qubes or environments where nested virtualization isn’t available or is painfully slow.
- Flatpak uses a combination of namespaces, cgroups, and seccomp to sandbox applications. While good in theory, in practice this results in a lot of problems:
- Many Flatpaks have very loose permissions, and those that don’t oftentimes don’t work right.
- seccomp is used to deny access to features that may increase kernel attack surface, like user namespaces. This breaks Chrome’s sandbox among other things, making it more dangerous to use a Flatpak-packaged browser than an apt-packaged one.
- Sandboxed applications connect directly to the Wayland compositor, which is unsafe as described above.
- The way in which Flatpaks bundle dependencies means that oftentimes Flatpaks can contain insecure dependencies, resulting in security problems. https://flatkill.org/ goes into a lot of detail on that.
- Snap suffers from similar issues as Flatpak, though it may not be as bad from a dependency standpoint if it uses packages from the Ubuntu archive. It may also handle nested sandboxing better. It’s a semi-closed-source system dependent upon Canonical though, so it may be better avoided.
- Bubblewrap can be used on its own, without Flatpak. Depending on what exactly is being done, this can be a viable sandboxing mechanism, it might be usable instead of systemd-nspawn or libvirt-lxc. I’m not sure if it has advantages beyond these mechanisms yet, it might complicate attempts to make VM-based sandboxing work.
- I took a look at Landlock. It looks like a rather interesting way to do some of the things AppArmor does, but without needing privileges. It can be used to do things like make an application that says “I need to be able to read from this dir, read and write this other dir, and read, write, and execute this third dir, and that’s it.” Then the kernel will keep it from doing anything other than the things it locked itself into doing.
- This isn’t really what we’re looking for I would argue. It might be interesting for some specific usecases, but… we already have AppArmor for this sort of thing. This is a variant on mandatory access control, which doesn’t provide the compatibility we’re looking for.
- AppArmor is something we’re already familiar with, as we actively use it to do things like sandbox Tor Browser.
- This requires a custom policy to be written for each sandboxed application, which is a pain, and it also doesn’t do much to work with applications that don’t work if you deny them resources that are potentially dangerous. It also doesn’t allow isolating data into separate
untrusted-user environments.
- Firejail is designed to sandbox individual applications using individually written profiles. It also allows the user to create custom profiles for applications that don’t have profiles written for them already. It sandboxes the applications that already exist on the root filesystem.
- It’s SUID-root, which is problematic since we’re trying to get rid of such applications.
- While simple to use in theory, it seems to require the user to know that they’re doing a bit more than I personally would prefer. Rather than saying “this set of apps has its own environment”, it says “here are the parts of the
user environment this particular app can access.” This is possibly dangerous since it blurs the lines between the privilege levels we want to establish. For instance, a browser should naturally be allowed to save things in Downloads, but what if you have sensitive company data in your Downloads folder, and your browser gets compromised and uploads it?
- There has been at least one TOCTOU vuln in Firejail which allowed an attacker to escalate their privileges to root. This is arguably proof that Firejail as an SUID application is a less-than-great idea. Rigged Race Against Firejail for Local Root is a good example. There have been other vulnerabilities in Firejail that make this look dangerous.
This is a bit long, but hopefully it will be useful.