Learning About Syscall Filtering With Seccomp

Posted on Saturday June 27, 2020
Updated on Wednesday April 20, 2022

I’d heard about being able to run Docker containers with a custom security profile, but wasn’t really sure what that meant or what was happening behind the scenes, so I decided to do some experimentation to find out.

It turns out that the Linux kernel includes a feature called “secure computing mode,” or seccomp for short. Using seccomp lets you tell the kernel that you only expect your program to use a specific set of system calls, and if your program makes any system calls that aren’t in your approved list, the kernel should kill your program.

But why would you want to do this? I think if you had a pretty simple program, using seccomp might be overkill. But if your program makes different system calls depending on possibly-untrustworthy user input, it might make sense to try to limit what the program is allowed to do. Looking at a list of software using seccomp on Wikipedia backs this up: the software listed are mostly hypervisors/container runners (like Docker), web browsers, etc.

By reading the manual page for the seccomp(2) system call, we can learn how to write a program to try this out. The simplest action is to enter “strict mode,” which prevents all system calls except for read(2), write(2), _exit(2), and sigreturn(2) --- in other words, what I think should be just enough to write hello world! Let’s give it a shot:

#include <linux/seccomp.h>
#include <sys/prctl.h>
#include <stdio.h>

int
main()
{
        if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) != 0) {
                perror("prctl");
                return 1;
        }
        printf("hello, world!\n");
        return 0;
}

When I compile and run my program, I just see Killed being printed, not hello, world!. Well, this is pretty good evidence that seccomp is doing something --- it’s at least killing my program! Let’s try to find out why it’s being killed using strace, a program that shows you all of the system calls being made:

$ strace ./hello
execve("./hello", ["./hello"], 0x7fff77b754b0 /* 20 vars */) = 0
brk(NULL)                               = 0x559e08463000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=25762, ...}) = 0
mmap(NULL, 25762, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fe65b9f0000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\34\2\0\0\0\0\0"...,
832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2030544, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0x7fe65b9ee000
mmap(NULL, 4131552, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) =
0x7fe65b3df000
mprotect(0x7fe65b5c6000, 2097152, PROT_NONE) = 0
mmap(0x7fe65b7c6000, 24576, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7fe65b7c6000
mmap(0x7fe65b7cc000, 15072, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fe65b7cc000
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7fe65b9ef4c0) = 0
mprotect(0x7fe65b7c6000, 16384, PROT_READ) = 0
mprotect(0x559e077b9000, 4096, PROT_READ) = 0
mprotect(0x7fe65b9f7000, 4096, PROT_READ) = 0
munmap(0x7fe65b9f0000, 25762)           = 0
prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) = 0
fstat(1,  <unfinished ...>)             = ?
+++ killed by SIGKILL +++
Killed

There’s a lot at the beginning about loading dynamically linked libraries, reading the program binary, and mapping it into memory that I don’t fully understand. But the last few syscalls provide some clues: right after prctl is called, we see fstat being called! fstat is a system call for getting the status of a file, and 1 happens to be the file descriptor for standard output. It makes sense that calling printf might involve checking the status of standard output, so I tried commenting out the call to printf in hello.c. When I compiled and ran the new version, it still just printed Killed, so I used strace again. Just looking at the last few lines:

prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT) = 0
exit_group(0)                           = ?
+++ killed by SIGKILL +++
Killed

Now my program is making the exit_group system call. Thinking back to the manual page for seccomp, it said:

The only system calls that the calling thread is permitted to make are read(2), write(2), _exit(2) (but not exit_group(2)), and sigreturn(2).

It looks like I’ll need to actually do some real filtering if I want to run my hello world program and not just use strict mode. To do this, we need to use SECCOMP_MODE_FILTER and pass a pointer to a struct sock_fprog, which according to the manpage is “a Berkeley Packet Filter program designed to filter arbitrary system calls and system call arguments.“

While we could construct a BPF program using an array of struct sock_filters, looking at the chain of instructions we’d need made me think it would be much easier to enlist the services of libseccomp, a library designed for just this purpose. Let’s try rewriting hello.c to use libseccomp and allowing those three syscalls we saw before (fstat, write, and exit_group):

#include <seccomp.h>
#include <stdio.h>
#include <stdlib.h>

scmp_filter_ctx ctx;

/* graceful_exit cleans up our seccomp context before exiting */
void
graceful_exit(int rc)
{
        seccomp_release(ctx);
        exit(rc);
}

/* setup_seccomp initializes seccomp and loads our BPF program that filters
 * syscalls into the kernel */
void
setup_seccomp()
{
        int rc;

        /* Initialize the seccomp filter state */
        if ((ctx = seccomp_init(SCMP_ACT_KILL)) == NULL) {
                graceful_exit(1);
        }
        if ((rc = seccomp_reset(ctx, SCMP_ACT_KILL)) != 0) {
                graceful_exit(1);
        }

        /* Add allowed system calls to the BPF program */
        if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(fstat), 0)) != 0) {
                graceful_exit(1);
        }
        if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0)) != 0) {
                graceful_exit(1);
        }
        if ((rc = seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0)) != 0) {
                graceful_exit(1);
        }

        /* Load the BPF program for the current context into the kernel */
        if ((rc = seccomp_load(ctx)) != 0) {
                graceful_exit(1);
        }
}

int
main()
{
        setup_seccomp();
        printf("hello, world!\n");
        graceful_exit(0);
}

Since we’re now using libseccomp, we need to tell our C compiler to link the library:

$ cc -o hello hello.c -lseccomp
$ ./hello
hello, world!

Success! Our program compiles and runs, and all of the necessary syscalls have been allowed. Now let’s try modifying the main() function of our program to do something bad, like trying to read the password file /etc/shadow:

int
main()
{
        FILE *fd;
        setup_seccomp();
        printf("hello, world!\n");
        if ((fd = fopen("/etc/shadow", "r")) == NULL) {
                perror("fopen");
                graceful_exit(1);
        }
        fclose(fd);
        graceful_exit(0);
}

Now when we compile and run our program, we get:

$ ./hello
hello, world!
Bad system call (core dumped)

Nice! The kernel killed our program when we tried to use a system call (openat) that we didn’t plan on!

I wanted to figure out how to allow openat to only open a specific file name, but I couldn’t figure out how to compare string system call arguments. Thanks to Isaiah Bell for referring me to the explanation for why this isn’t possible: to prevent time-of-check-time-of-use problems.

Now let’s go back to how this all fits in to Docker. Looking at Docker’s default seccomp profile, a lot of it starts to make more sense. In fact, it looks like they’re using the exact same names from libseccomp that we used in our program! If we search the moby source code for libseccomp, we can see that it is indeed being used (via Go bindings).

Let’s try to use a custom seccomp profile to prohibit programs in our Docker container from listening for network connections. To start, I want to make sure I can accept network connections, then modify my profile and watch it break. I downloaded the default seccomp profile to use as a starting point for tweaking, started a container with port 4000 open, then used nc to try communicating from my host machine to a listener in the Docker container:

$ docker run --rm -it -p 4000:4000 --security-opt seccomp=seccomp.json alpine
/ # nc -l -p 4000

When I run echo hi | nc 127.0.0.1 4000 in a separate terminal, my greeting is printed by the netcat listener in the Docker container---success! Now that I know my basic TCP server works, let’s try blocking it with seccomp! To start listening on a TCP port, I know that nc has to use the socket, bind, and listen system calls (which we can verify using strace). I’ll try removing them from the list of allowed system calls in the default profile, and run the docker container again with the modified profile:

$ docker run --rm -it -p 4000:4000 --security-opt seccomp=seccomp.json alpine
/ # nc -l -p 4000
nc: socket(AF_INET,1,0): Operation not permitted

Awesome! We just used seccomp to control what our Docker container is allowed to do!

I can imagine this might be helpful if you had an environment where security was extremely important and wanted to really lock down your containers, but it’s hard to imagine that writing custom seccomp profiles for every container in your production environment is the best use of time without having some specific situation you’re trying to address.

How to Add Row Level Security to Views in PostgreSQL

Debugging HTTP services with mitmproxy