Dhananjaay's Blog

If you are a backend or frontend engineer, chances are you might have heard of Docker. For DevOps or infrastructure engineers, Docker needs no introduction. Docker has revolutionized the software deployment and packaging industry. However, as an engineer, you should always strive to understand the underlying mechanisms that make these tools work, rather than just memorizing commands

As someone who enjoys diving deep into topics, I started exploring Docker a while ago. I wanted to understand what makes a container a container. It turns out that containers are powered by a subset of the Linux kernel. Isn't it fascinating that the software that drives cloud systems is built on top of the Linux kernel?

Core Features of Containerization

Containerization is built on three key features provided by the Linux kernel:

Namespaces
Cgroups
Chroot

Let's break down these concepts in simpler terms.

Namespaces

Namespaces allow you to specify what resources a specific application or process is allowed to share or inherit from the host and which ones should be isolated within the group. Essentially, the resources that a program can share with other processes in that environment are determined by the namespaces it is part of.

There are eight types of namespaces, but for our purposes, we need to understand three of them. Here's a quick overview of the different namespaces:

CLONE_NEWCGROUP : Shares CGroup
CLONE_NEWIPC : Shares Inter-Process Communication Queue
CLONE_NEWNET : Shares Network Devices and Ports
CLONE_NEWNS : Shares Mount Points
CLONE_NEWPID : Shares Process IDs
CLONE_NEWTIME : Shares System Clock
CLONE_NEWUSER : Shares User IDs
CLONE_NEWUTS : Shares Hostname and Network Information Service

Chroot

Chroot, short for "Change Root", allows you to change your root directory to a custom location. This creates an isolated environment within the filesystem, which is essential for containers.

Cgroups

Cgroups, or control groups, allow the kernel to restrict access to system resources for a program. This ensures that containers do not exceed their allocated resources and maintains system stability. With an understanding of these three features, you have the foundational knowledge to implement a container runtime in your favorite language. For this tutorial, I will use C++. Why not Rust or Go? Because the code we'll be writing involves syscalls and low-level system interactions, where the real fun lies in handling memory segmentation faults and debugging BSODs (cough cough, CrowdStrike).

So, fasten your seat belts because this is going to be a fun and exciting blog where we'll understand and implement our own container runtime

Before we begin, let's set up a few things. Make sure you have a C++ compiler installed with C++17 support.

Understanding Docker Run Command

Let's take a look at a simple Docker run command and break down what happens behind the scenes.

Run the following command:

sudo docker run ubuntu echo "Hello Dhananjay"

running docker run command

Here's what happens step-by-step:

Docker searches for the image 'ubuntu': If Docker cannot find the image locally, it fetches the image from a remote repository
Docker loads the image: Once the image is fetched, Docker loads it into the system
Docker runs the command: Docker runs the command echo "Hello Dhananjay" within the container.
Container exits: After executing the command, the container exits

Let's implement a similar behavior our own container runtime.

Writing a C++ Program for Containerization

Let's start writing our C++ program. The first step is to extract the arguments and print them on the screen.

#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>

int main(int argc, char** argv) {

std::vector<std::string> args(argv + 1, argv + argc);
std::cout << "Got Arguments: ";
std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
return EXIT_SUCCESS;

}

Compile and run the program with:

g++ main.cpp -std=c++17 -o knocker && ./knocker arg1 arg2

Running C++ Program

Now, let's modify the program to execute whatever is passed as an argument. We will use execvp, which is part of the exec family of functions. execvp replaces the current process with a new one specified by its arguments.


#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>

int main(int argc, char** argv) {

    std::vector<std::string> args(argv + 1, argv + argc);
    std::cout << "Running Argument: ";
    std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
    std::cout << std::flush;
    std::vector<char*> run_arg; 
    for (auto& str : args) {
        run_arg.push_back(const_cast<char*>(str.c_str()));
    }
    run_arg.push_back(nullptr);
    execvp(run_arg[0], run_arg.data());
    return EXIT_SUCCESS;

}

This code may look a bit complex due to the use of pointers, but here's what it does:

It stores the arguments in a vector of strings
It converts the vector of strings to a vector of C-style strings (char*)
It terminates the vector with a nullptr
It calls execvp to execute the command

Calling echo command

Now, let's start virtualizing the process using namespaces. We will use a syscall called clone. The clone syscall is used to create a new process, and it takes several parameters:

Function to run in the cloned process
Pointer to stack memory allocation for the cloned process
Flags/namespaces
Argument pointer to pass to the function

First, let's define the stack memory


#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>

int main(int argc, char** argv) {

    std::vector<std::string> args(argv + 1, argv + argc);
    std::cout << "Running Argument: ";
    std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
    std::cout << std::flush;
    std::vector<char*> run_arg; 
    for (auto& str : args) {
        run_arg.push_back(const_cast<char*>(str.c_str()));
    }
    run_arg.push_back(nullptr);
    auto* stack = (char*)malloc(sizeof(char) * 1024 * 1024); 
    auto* stackTop = stack + sizeof(char) * 1024 * 1024; 
    execvp(run_arg[0], run_arg.data());
    return EXIT_SUCCESS;

}

We allocate 1MB of memory for the stack. Since malloc points to the start of the allocated memory, we need to reach the top of the memory as the stack grows downwards

Next, we need to declare a function to run in the cloned process:


#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>

int runtime(void* args) {
    return 1;
}

int main(int argc, char** argv) {

    std::vector<std::string> args(argv + 1, argv + argc);
    std::cout << "Running Argument: ";
    std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
    std::cout << std::flush;
    std::vector<char*> run_arg; 
    for (auto& str : args) {
        run_arg.push_back(const_cast<char*>(str.c_str()));
    }
    run_arg.push_back(nullptr);
    auto* stack = (char*)malloc(sizeof(char) * 1024 * 1024); 
    auto* stackTop = stack + sizeof(char) * 1024 * 1024; 
    execvp(run_arg[0], run_arg.data());
    return EXIT_SUCCESS;

}

Now, let's create process isolation. Open a terminal and run hostname to see the current hostname of your system

hostname

Running hostname command

Set a new hostname in a different terminal session

sudo hostname new-hostname

Setting new hostname

Notice that the hostname is updated in the original session as well which is not what we want

To isolate the hostname between processes, we will use the clone syscall with the CLONE_NEWUTS flag. Modify the code as follows


#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>

int runtime(void* args) {
    std::vector<char*>* arg = (std::vector<char*>*)args; 
    execvp((*arg)[0], (*arg).data()); 
    return 1; 
} 

int main(int argc, char** argv) {

    std::vector<std::string> args(argv + 1, argv + argc);
    std::cout << "Running Argument: ";
    std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
    std::cout << std::flush;
    std::vector<char*> run_arg; 
    for (auto& str : args) {
        run_arg.push_back(const_cast<char*>(str.c_str()));
    }
    run_arg.push_back(nullptr);
    auto* stack = (char*)malloc(sizeof(char) * 1024 * 1024); 
    auto* stackTop = stack + sizeof(char) * 1024 * 1024; 
    void* arg = static_cast<void*>(&run_arg); 
    pid_t child_process_id = clone(runtime, stackTop, SIGCHLD | CLONE_NEWUTS, arg); 
    waitpid(child_process_id, nullptr, 0); 
    free(stack); 
    return EXIT_SUCCESS;

}

In this code:

We typecast the arguments into a void* pointer and pass it to the clone function with flags for cloning a new UTS namespace and notifying a signal to the child process.
Inside runtime, we typecast back to the original type, wait for the spun-up process to complete with waitpid, and free the allocated memory as process ends.

Now, let's try setting a hostname as we did earlier but this time after running bash as the program argument. This should isolate the hostname changes to the new process

sudo ./knocker /bin/bash

Setting hostname in bash

Notice that the hostname is not updated in host system

Changing the hostname inside the spawned process didn’t change the root system's hostname. To further isolate the process, let's examine how to achieve this using the CLONE_NEWPID flag

Isolating Processes in Containers

Running ps command

When we run ps inside the emulated bash, it shows all processes running on the system, indicating the spawned process isn’t isolated. To isolate it, add the CLONE_NEWPID flag to the clone system call.


#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>
#include <sched.h> 
#include <signal.h> 
#include <sys/wait.h> 

int runtime(void* args) {
    std::vector<char*>* arg = (std::vector<char*>*)args; 
    execvp((*arg)[0], (*arg).data()); 
    return 1; 
} 

int main(int argc, char** argv) {

    std::vector<std::string> args(argv + 1, argv + argc);
    std::cout << "Running Argument: ";
    std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
    std::cout << std::flush;
    std::vector<char*> run_arg; 
    for (auto& str : args) {
        run_arg.push_back(const_cast<char*>(str.c_str()));
    }
    run_arg.push_back(nullptr);
    auto* stack = (char*)malloc(sizeof(char) * 1024 * 1024); 
    auto* stackTop = stack + sizeof(char) * 1024 * 1024; 
    void* arg = static_cast<void*>(&run_arg); 
    pid_t child_process_id = clone(runtime, stackTop, SIGCHLD | CLONE_NEWUTS | CLONE_NEWPID, arg);
    waitpid(child_process_id, nullptr, 0);
    free(stack);
    return EXIT_SUCCESS;

}

However, compiling the code with the CLONE_NEWPID flag didn’t change anything. Why? In Linux, everything is treated a file, including memory/process information, stored in a file-like structure under /proc. When you call ps, the OS reads the content of /proc where the kernel dumps all info about running processes

Creating a Custom Filesystem Structure

To isolate the process environment completely, including the /proc filesystem, we need to instruct runtime to use custom filesystem. This involves changing the root of the filesystem

Let's examine how docker does this and what's inside the container filesystem

docker run ubuntu echo "Hello World"
docker container ps --all
sudo docker export {container_id} > ubuntu_fs.tar
mkdir dock_ubuntu_tar
tar -xvf ubuntu_fs.tar -C dock_ubuntu_tar

Extracting filesystem

Running ls on extracted filesystem

Linux Users will identify the folder structure easily , it’s the root Filesystem , so now for instructing os to use our own proc , we need to change root filesystem

CHRoot → Changing Root

First, we need to download the Ubuntu base image to use in our isolated system. Navigate to and download the base image from Here . Then, run the following commands to download and extract the root filesystem file

wget https://cdimage.ubuntu.com/ubuntu-base/releases/24.04/release/ubuntu-base-24.04-base-amd64.tar.gz
mkdir ubuntu_fs
tar -xf ubuntu-base-24.04-base-amd64.tar.gz -C ubuntu_fs/

Now let's update our code to use this new filesystem


#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>
#include <sched.h> 
#include <signal.h> 
#include <sys/wait.h> 
#include <sys/mount.h>

int runtime(void* args) {
    std::vector<char*>* arg = (std::vector<char*>*)args; 
    std::string hostName = "Knocker-Host"; 
    sethostname(hostName.c_str(), hostName.length()); 
    chroot("/home/dhananjay/ubuntu_fs"); 
    chdir("/"); 
    mount("proc", "proc", "proc", 0, ""); 
    execvp((*arg)[0], (*arg).data()); 
    return 1; 
} 

int main(int argc, char** argv) {

    std::vector<std::string> args(argv + 1, argv + argc);
    std::cout << "Running Argument: ";
    std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
    std::cout << std::flush;
    std::vector<char*> run_arg; 
    for (auto& str : args) {
        run_arg.push_back(const_cast<char*>(str.c_str()));
    }
    run_arg.push_back(nullptr);
    auto* stack = (char*)malloc(sizeof(char) * 1024 * 1024); 
    auto* stackTop = stack + sizeof(char) * 1024 * 1024; 
    void* arg = static_cast<void*>(&run_arg); 
    pid_t child_process_id = clone(runtime, stackTop, SIGCHLD | CLONE_NEWUTS | CLONE_NEWPID, arg);
    waitpid(child_process_id, nullptr, 0);
    free(stack);
    return EXIT_SUCCESS;

}

In this code, we set the hostname, change the root to the new filesystem, set the current working directory to /, and mount the parent /proc to the child /proc with the proc file type and no special attributes.

Running ps in the child process shows that bash is running as PID 1, indicating successful isolation of the child process from the parent

Running ps command

However, if you look at the mount points in the host , you can see that the proc is mounted in the host system as well. This is because the mount is propagated to the parent namespace. To prevent this, we need to spawn a new namespace associated with child , unmount the proc filesystem in the parent namespace after the clone call to release the resources

Mount points


#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>
#include <sched.h> 
#include <signal.h> 
#include <sys/wait.h> 
#include <sys/mount.h> 

int runtime(void* args) {
    std::vector<char*>* arg = (std::vector<char*>*)args; 
    unshare(CLONE_NEWNS); 
    mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL); 
    std::string hostName = "Knocker-Host"; 
    sethostname(hostName.c_str(), hostName.length()); 
    chroot("/home/dhananjay/ubuntu_fs"); 
    chdir("/"); 
    mount("proc", "proc", "proc", 0, ""); 
    pid_t pid = fork(); 
    if (pid == 0) {
        execvp((*arg)[0], (*arg).data());
    } else {
        waitpid(pid, nullptr, 0);
        std::cout << "Cleanup Running" << std::endl;
        umount("proc");
    }
    execvp((*arg)[0], (*arg).data());
    return 1; 
} 

int main(int argc, char** argv) {

    std::vector<std::string> args(argv + 1, argv + argc);
    std::cout << "Running Argument: ";
    std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
    std::cout << std::flush;
    std::vector<char*> run_arg; 
    for (auto& str : args) {
        run_arg.push_back(const_cast<char*>(str.c_str()));
    }
    run_arg.push_back(nullptr);
    auto* stack = (char*)malloc(sizeof(char) * 1024 * 1024); 
    auto* stackTop = stack + sizeof(char) * 1024 * 1024; 
    void* arg = static_cast<void*>(&run_arg); 
    pid_t child_process_id = clone(runtime, stackTop, SIGCHLD | CLONE_NEWUTS | CLONE_NEWPID, arg); 
    pid_t child_process_id = clone(runtime, stackTop, SIGCHLD | CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS, arg); 
    waitpid(child_process_id, nullptr, 0);
    free(stack);
    return EXIT_SUCCESS;

}

In this code snippet, we:

Clone the process with new namespaces.
Unshare the namespace from the host.
Mount the directory as private.
Change root, set the hostname, and create a cleanup process with fork.
This way, the system cleans itself from the mount point it allocated at the end of its lifetime and we don't see the mount point cluttering the host system.

Restricting Resources with Cgroups

To add resource restrictions using cgroups to our minimal container runtime in C++ located in /sys/.fs/cgroups, we will need to use the cgroup filesystem to set limits on CPU, memory, and the number of processes. Here’s how to integrate cgroups into the code for resource restrictions:

Helper Function for applying limits automatically

The rule_set function will create a cgroup, set the process ID of the child to the cgroup, and configure resource limits.


#include <vector>
#include <cstdlib>
#include <iterator>
#include <iostream>
#include <cstring>
#include <unistd.h>
#include <sched.h> 
#include <signal.h> 
#include <sys/wait.h> 
#include <sys/mount.h> 
#include <filesystem>
#include <fstream>

void rule_set(pid_t child_pid) {
    std::filesystem::path pids_path{"/sys/fs/cgroup/pids/knocker"};
    std::filesystem::path memory_path{"/sys/fs/cgroup/memory/knocker"};
    std::filesystem::create_directories(pids_path);
    std::filesystem::create_directories(memory_path);
    std::ofstream ofs(pids_path / "cgroup.procs");
    ofs << std::to_string(child_pid);
    ofs.close();
    ofs.open(pids_path / "pids.max");
    ofs << "3";
    ofs.close();
    ofs.open(memory_path / "cgroup.procs");
    ofs << std::to_string(child_pid);
    ofs.close();
    ofs.open(memory_path / "memory.limit_in_bytes");
    ofs << "209715200";
    ofs.close();
}

int runtime(void* args) {
    std::vector<char*>* arg = (std::vector<char*>*)args; 
    unshare(CLONE_NEWNS); 
    mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL); 
    std::string hostName = "Knocker-Host"; 
    sethostname(hostName.c_str(), hostName.length()); 
    chroot("/home/dhananjay/ubuntu_fs"); 
    chdir("/"); 
    mount("proc", "proc", "proc", 0, ""); 
    pid_t pid = fork(); 
    if (pid == 0) { 
        execvp((*arg)[0], (*arg).data()); 
    } else { 
        waitpid(pid, nullptr, 0); 
        std::cout << "Cleanup Running" << std::endl; 
        umount("proc"); 
    } 
    execvp((*arg)[0], (*arg).data()); 
    return 1; 
} 

int main(int argc, char** argv) {

    std::vector<std::string> args(argv + 1, argv + argc);
    std::cout << "Running Argument: ";
    std::copy(args.begin(), args.end(), std::ostream_iterator<std::string>(std::cout, " "));
    std::cout << std::flush;
    std::vector<char*> run_arg; 
    for (auto& str : args) {
        run_arg.push_back(const_cast<char*>(str.c_str()));
    }
    run_arg.push_back(nullptr);
    auto* stack = (char*)malloc(sizeof(char) * 1024 * 1024); 
    auto* stackTop = stack + sizeof(char) * 1024 * 1024; 
    void* arg = static_cast<void*>(&run_arg); 
    pid_t child_process_id = clone(runtime, stackTop, SIGCHLD | CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS, arg); 
    rule_set(child_process_id); 
    waitpid(child_process_id, nullptr, 0);
    free(stack);
    return EXIT_SUCCESS;

}

In rule_set, we:

Creates two new cgroup directory: /sys/fs/cgroup/pids/knocker and /sys/fs/cgroup/memory/knocker
Assigns the child process ID to the cgroup: By writing the PID to cgroup.procs.
Sets a process limit: By writing 4 to pids.max which limits the number of processes and 200MB to memory.limit_in_bytes which limits the memory usage.

And that's it! We have successfully implemented a minimal container runtime in C++ that isolates processes, hostnames and restricts resources using cgroups. By understanding the underlying mechanisms of containerization, you can build your own container runtime from scratch. This is a great way to learn about the Linux kernel, system calls, and low-level programming. I hope you enjoyed this blog and learned something new.

Were My Blogs Beneficial to You ?

Subscribe to My Newsletter , Get Notified Whenever I post new Blogs

Writing Container Runtime in C++