My Own Private Binary: Appendix: Making the Flat Binary Format Usable

Okay. At this point, we have a working kernel module that establishes a new, metadata-free binary format for Linux. But without the tools and libraries that we normally have at our disposal when writing programs, trying to use this format is an utter pain. We're forced to do everything from scratch, in bare assembly. How can we improve this situation?

In theory these problems are orthogonal to each other, but in practice they become somewhat intertwined. But we need to start somewhere, so let's begin with the first one.

Creating Binaries

The process of transforming an object file into an executable is usually done by a linker, such as ld(1). Now in this case "linker" is something of a misnomer, since we're not even trying to link anything. We just want to generate binary files that work with our kernel module. If we could get that from a standard object file, then at a stroke we would be able to use compilers again. The irony is that a binary image is present inside of an object file — if only we could remove all of the guff surrounding it.

And hey, we can do exactly that, thanks to objcopy(1). This is a utility program that few programmers ever need, but it's a nice little program that's part of the standard GNU build tools. Its purpose is to translate object files across different formats.

Are you familiar with the GNU BFD library? ("BFD" doesn't stand for what you're probably thinking; it's short for "binary file descriptor".) libbfd is a foundational hunk of code used in the GNU toolchain. This library provides details about all the various binary file formats — object files and executables alike. All of the GNU build tools depend on libbfd for reading and writing these file types, allowing them all to support binary file formats across a variety of platforms.

objcopy provides a simple command-line interface to a major piece of functionality provided by libbfd. It allows you to pull apart an object file, and then stuff those contents back into another object file using a different format. And as it happens, objcopy provides a null format called "binary" — a blank-slate format with no intrinsic associated metadata. So, in theory, we should be able to ask objcopy to extract the code section (which, according to tradition, is named .text) from an ELF object file, and place it in a "binary" object file, and that should give us a valid executable file that our kernel module will load and run.

Before we can properly test this idea with an object file generated by gcc(1), though, we'd need to have the C code actually, you know, do something. (If nothing else, it would need to exit safely.) Which brings us face-to-face with the second problem. Set let's set objcopy aside for a moment, and consider our lack of system library.

Utility Functions

It's rough for a C programmer to suddenly be denied access to the standard functions. However, it is true that a number of the things we're missing the most are provided by system calls, with the standard library functions being little more than wrappers. So let's consider those functions first.

We'll start with the exit system call, since it's quite simple and used by nearly every program. We want to wrap a C function around it. While we could use assembly language to write a function that can be called from C, the number of instructions we need is small enough that it makes more sense to use gcc's inline assembly feature instead:

tiny.c

#define exit(exitcode)  asm volatile ("syscall" : : "a" (60), "D" (exitcode))

The inline assembly feature of gcc is notorious for being complicated, but for our purposes it's absolutely worth it to understand what it can do. In this example, we have on the left a string that contains a single assembly-language instruction. But the section on the right tells the compiler what you want the registers to contain before it runs. In this case, we have a list of two registers, with "a" indicating the register rax and "D" referring to rdi. (There are varying levels of specificity available — for example, "r" requests any general-purpose register, while "U" requests a register that doesn't need to be preserved across function calls.) The compiler will follow these requests and add instructions to initialize rax and rdi with the given values right before the syscall instruction.

We could have included instructions in our string to initialize rax and rdi explicitly. But doing it this way allows the compiler, and in particular the optimizer, to better merge our assembly with the surrounding code. So, for example, knowing that the exit status has to be stored in rdi may influence the compiler to use rdi to store it there to begin with, thus allowing it to optimize away a mov instruction. (Of course, once you invite the optimizer to your party, you have to dance to its tune. This is one reason that the inline assembly statement is marked as volatile — since our inline assembly has no output values, the optimizer might incorrectly deduce that our entire statement has no effect, and remove it entirely. The volatile qualifier warns the compiler that the inline assembly has side effects, and thus its usefulness cannot be judged on its outputs alone.)

Linking Without Linkers

So, armed with this macro, let's return to our proposed objcopy experiment. We'll make a C function that requires no assistance from libc, or indeed any external entity:

tiny.c

#define exit(exitcode)  asm volatile ("syscall" : : "a" (60), "D" (exitcode))

void foo(void)
{
    exit(42);
}

The name of the function here isn't important, since we won't actually be calling it as a function — it'll just run when the file is executed. (It might have been more intuitive to call it main(), but compilers give that name special treatment.) We'll compile it, and then use objcopy to extract the object file's .text section.

While 21 bytes is quite a bit larger than our own 12-byte and 7-byte creations, it's not bad for unoptimized compiler output. And it suggests that objcopy did extract precisely what we wanted it to. And when we test it:

We can see that we really do have a working binary, created using standard build tools. Let's try it again, but this time inviting the optimizer to get involved.

Down to 13 bytes — the C compiler is only one byte away from the version that we created manually on our first try. That's quite respectable! Let's look at the disassembly to see where the thirteenth byte came from:

It's actually the exact same program that we wrote, the only difference being a useless ret instruction tacked on at the end. That's not surprising, in retrospect, as the compiler has no way of knowing that the inline assembly will never finish. But guess what: we can actually remedy that. gcc has a special pseudo-function named __builtin_unreachable(). Using it in your code is a promise to the compiler that control can never reach it. If we place a call to this function at the end of foo(), the optimizer will take advantage of our guarantee:

Writing our programs in C is now looking rather attractive, when the compiler can do such a good job at whittling away extra instructions.

However, this objcopy technique is not going to scale up. Not only is it a terrible hack, it's also dependent on there being only one function in our program. If the code contained multiple functions, we couldn't guarantee that our top-level function would be placed first in the .text section. Moreover, if our code makes use of any global variables, the compiler will almost certainly place them in a separate .data section, making their addresses incompatible with addresses in the .text section.

It might seem that we've landed back to square one, with no easy way to extract usable binaries out of the compiler's object files. But not so — we're just getting started here. Using objcopy allowed us to avoid the linker entirely, but the truth is that we still want the linker's help with things like address fixups. Fortunately for us, the linker is remarkably amenable to this kind of detailed customization, thanks to the existence of linker scripts.

Linker Scripts

If you are not already familiar with linker scripts, you may be surprised to learn that a significant block of a linker's logic resides not in the code itself, or even in a library like libbfd, but rather in simple, textual configuration files. Every time you run your linker, it uses the appropriate linker script as a guide for what to put where. While linker scripts are mostly stored internally, you can typically also find copies of them under the linker's search paths. On my machine, they are under /usr/lib/x86_64-linux-gnu/ldscripts/. Linker scripts can and do get extremely complicated, but for basic needs like our own they can be quite simple:

comfile.x

/* Linker script for command executable files */
OUTPUT_FORMAT(binary)
OUTPUT(a.com)
SECTIONS
{
  .main 0x10000 : { *(.main) }
  .text : { *(.text) *(.rodata) }
  .data : { *(.data) *(.bss) }
  /DISCARD/ : { *(.eh_frame) }
}

The first line sets the "binary" format as the BFD-provided file format to output. Using this null format ensures that the resulting file will contain only what our linker script explicitly asks for. The second line provides a default filename if none is provided on the command line.

The SECTIONS block lays out what the output file should contain, like a blueprint. This script specifies that the linker should place all of the .text and .rodata sections together before any .data and .bss sections, for example. (The asterisks before the names are there because there can be more than one section with the same name when multiple object files are involved — i.e. the situation in which there is actual linking being done.) Other sections not named in the blueprint do not make it into the output file, for the most part. Some sections, like .eh_frame, are still included in the BFD-defined binary format, but fortunately the linker provides an explicit /DISCARD/ destination that causes sections to be forcibly omitted.

But then what's this .main section at the top, with the hexadecimal annotation? This is how we can choose which function will land at the start of the generated binary. We will define a separate section that is specifically set aside for our top-level function. The linker can then ensure that that function is placed in front of everything else. (The 0x10000 annotation is the value of loadaddr in our kernel module. It informs the linker where the binary will be loaded into memory, so that the linker can calculate absolute addresses if called upon to do so.)

gcc, like most compilers, provides an extension that allows us to mark objects for placement in non-standard sections. We can use this in our C program like so:

tiny.c

static void exit(int exitcode)
{
    asm ("syscall" : : "a" (60), "D" (exitcode));
    __builtin_unreachable();
}

void __attribute__((section(".main"))) _main(void)
{
    exit(42);
}

After compiling, we can use objdump(1) to look at the list of sections created by the compiler:

As you can see, there are a fair number of sections, though several are actually empty. But the important point is that there is both a .text section and a .main section, and both of them are non-empty. This shows that our two functions did in fact wind up in separate sections. The .text section appears earlier in the object file, but our linker script will ensure that the contents of .main are placed first.

Typically the linker will determine which kind of output file it's being asked to create from context, and then load the appropriate linker script from its own store. We can explicitly tell the linker to use our hand-rolled script instead, via the -T command-line option:

The binary's increased size is mainly due to exit() being written as a separate function. If we turn optimization back on, the optimizer will see that the exit() function can be inlined instead, and we will quickly get our 12-byte executable again.

More Inline Assembly

Now that we can safely write code with multiple functions, we can create proper wrapper functions for system calls. We've got a working exit() function; now let's see what a wrapper function for write() might look like:

write.c

long write(int fd, void const *buf, long size)
{
    long r;

    asm volatile ("syscall" : "=a" (r)
			    : "0" (1), "D" (fd), "S" (buf), "d" (size)
			    : "rcx", "r11", "memory");

    if (r < 0) {
        errno = -r;
        r = -1;
    }
    return r;
}

This one is a bit more involved. That's partly because write() takes more arguments, but it's mainly because unlike exit() it actually returns. The part after the first colon declares the assembly's outputs — i.e. the values that need to be transferred into C variables afterwards. As with all system calls, write's return value is stored in rax, which is indicated by the constraint string "=a". Now rax also appears in the second list, the list of inputs, but this time the constraint string is "0" instead of "a". This is because registers cannot be repeated across constraint strings: if a register is both an output and an input, the latter needs to be referred to by its (zero-based) index instead.

The final clause in the asm statement is a list of operands that are neither inputs nor outputs but are nonetheless modified by the inline assembly. The optimizer will assume that any register not mentioned will be diligently preserved, so it's important to be thorough here. As it happens, Linux system calls are documented as preserving all registers except rcx and r11 (and rax, obviously), so our list is mercifully short. The last item, "memory" is a general indicator that the inline assembly reads and/or writes memory locations other than the explicitly named C variables. (If we didn't include this constraint, the optimizer might feel justified in reordering statements so that, for example, the write system call is made before the code that actually populates the buffer contents.)

Finally, our function checks for a negative return, and stores such values in a global errno variable, so that it matches the behavior of the standard library function.

hello.c

static int errno = 0;

static long write(int fd, void const *buf, long size)
{
    long r;
    asm volatile ("syscall" : "=a" (r)
			    : "0" (1), "D" (fd), "S" (buf), "d" (size)
			    : "rcx", "r11", "memory");
    if (r < 0) {
        errno = -r;
        r = -1;
    }
    return r;
}

static __attribute__((noreturn)) void exit(int exitcode)
{
    asm volatile ("syscall" : : "a" (60), "D" (exitcode));
    __builtin_unreachable();
}

void __attribute__((section(".main"))) _main(void)
{
    write(1, "hello, world\n", 13);
    exit(errno ? 1 : 0);
}

With the extra error-handling, our file size has ballooned to a whopping 72 bytes — but the fact that the C compiler generated it all for us is a breath of fresh air.

Arguing About Arguments

If we want to move on to a program that actually does something useful, however, we are once again faced with the issue of accessing argc, argv, and envp. Ideally we would like to access those values as parameters to our top-level function, just as we normally do with main(). But right now those values are stored on the stack, and under the x86 64-bit calling convention, function arguments are passed through registers instead. How can we fix this?

One way would be to add some inline assembly at the beginning of our top-level function to grab these three values from the stack and cram them into variables. This is not a great solution, because functions usually set up a stack frame first thing, the size of which varies. So it would be preferable to address the issue before the function runs, instead of during.

Another possible approach would be to work the opposite end, and modify our kernel module so that it stores those three values in registers in the first place. Is that even possible? It actually is. You may remember that the final step in our kernel module's loader function is calling start_thread(), and that the first argument to that function is a pointer to a struct holding the process's register contents. Nothing is stopping us from modifying those values before passing them along to start_thread(). In fact, it's actually considered good policy to do just that, and set all of the general-purpose registers to known values (zero if nothing else). Otherwise, the registers will contain whatever values were left in them from the parent process, and this could, at least in theory, leak information and become a security issue. Okay, so why haven't we done that already? Well, the issue is that registers are architecture-specific, and so naturally the struct storing their values is too. The kernel source tree is set up so that most of the code is architecture-neutral, with the minimal amount of necessary architecture-specific code being relegated to separate directory subtrees. Right now, our kernel module is architecture-neutral, so it would be preferable not to embed x86-specific code in the middle of it.

(Of course, if I thought there was a snowball's chance in hell that this code could ever become an official kernel module, I would happily break out the architecture-specific code into separate files and integrate them into the full directory structure, not to mention doing the necessary research to determine the proper calling convention for the Tensilica Xtensa architecture and all the other less-popular platforms that Linux runs on. But you and I both know that this flat binary file format is never going to be officially adopted, so I'd prefer to find a solution that doesn't involve the kernel module code.)

A third possibility would be to provide a tiny bit of prolog code, to be inserted at the top of the binary, that pops the stack values into registers. This is more or less what the implicit object file crt1.o does for normal C programs, via the _start() function that in turn calls main(). In fact, it's such a common thing to need that the linker script has a feature supporting it. The STARTUP() command can be used at the top level to implicitly include an object file and ensure that it will be linked first, before the other input files. The nice thing about this approach is that if we did want to add support for another architecture, we would just need to provide a different startup object file.

For the x86 64-bit architecture, the calling convention is that the first six function arguments are stored in rdi, rsi, rdx, rcx, r8, and r9, with further arguments stored on the stack. (Note that this isn't the whole story: SSE registers are used for floating-point arguments, for example. But these details are sufficient for our purposes.) So, converting our three stack entries to function parameters simply requires the following:

startup.asm

BITS 64
SECTION .main	
	pop	rdi
	pop	rsi
	pop	rdx

This reduces down to a measly three bytes of machine code, but we need to house it in an ELF object file so that the linker can use it at link time:

comfile.x

/* Linker script for command executable files */
OUTPUT_FORMAT(binary)
OUTPUT(a.com)
STARTUP(startup.o)
SECTIONS
{
  .main 0x10000 : { *(.main) }
  .text : { *(.text) *(.rodata) }
  .data : { *(.data) *(.bss) }
  /DISCARD/ : { *(.eh_frame) }
}

Putting It All Together

In order to verify that this change really does give us working function arguments, we'll write a quick-and-dirty test program:

test.c

static void println(char *str)
{
    int n, r;

    for (n = 0 ; str[n] ; ++n) ;
    str[n] = '\n';
    asm volatile ("syscall" : "=a" (r)
			    : "0" (1), "D" (1), "S" (str), "d" (n + 1)
                            : "rcx", "r11", "memory");
}

static void exit(int exitcode)
{
    asm volatile ("syscall" : : "a" (60), "D" (exitcode));
    __builtin_unreachable();
}

void __attribute__((section(".main"))) _main(int argc, char *argv[], char *envp[])
{
    int i;

    for (i = 0 ; i < argc ; ++i)
        println(argv[i]);
    for (i = 0 ; envp[i] ; ++i)
        println(envp[i]);
    exit(0);
}

And with that, we now have a working system for writing (almost) idiomatic C code, with the output being a binary file format of our own design.

Of course, this is still little more than a proof of concept. The next step would be to create a larger library of wrapper functions around the system calls, as well as providing popular standard function like strlen(). (We should also define a nice macro to hide the messiness of the _main() function declaration.) A dozen or so functions is enough to be able to start writing some non-trivial programs.

In fact, I did enough of this to be able to build my own replacement version of factor(1). A standalone binary with all the same features as the standard utility, it manages to squeeze in at just under 1k in size — whereas the /usr/bin/factor on my machine is dependent on libc and is still over 74k. What a grotesque cyclopean boat anchor of a binary, am I right? (Of course, the standard utility includes a bunch of complicated math functions that allow it to complete quickly even when given very large numbers, but you don't really notice the difference in speed until you reach the trillions. Hey, there are always tradeoffs.)

That said, my version is a standalone executable out of necessity. If it could have linked with a shared system library, like libc, it might have been smaller still. But our binary file format has no support for dynamic linking. Nor is it likely to in the future, as that would require the ability to look up functions by name, and to identify addresses needing fixups at runtime. All of which calls for … metadata. So, static libraries for us it is. If you'd like to see and/or build any of this code yourself, by the way, I've provided a tarball with all of the source code.

It would be awfully nice, though, if we had to option to statically link with libc itself. In theory this ought to be possible, since our private binary file format doesn't really require any special handling of code before the linking stage, and building a static library doesn't involve the linker. Unfortunately, however, libc is a bit of a special case, as the library depends on special features of ELF executables (such as defining initialization functions that run in advance of main()). So even though I was able to get my programs to statically link with libc.a, they invariably crashed on startup. It's possible that someone else's nonstandard, less-featureful implementation of the C library would work for our programs, but so far I haven't found a decent solution. I'll keep looking, though. This binary file format may never be installed on any Linux machine besides my own, but I'm fond of it nonetheless.

	tiny.c

`#define exit(exitcode) asm volatile ("syscall" : : "a" (60), "D" (exitcode))`

	tiny.c

`static void exit(int exitcode) { asm ("syscall" : : "a" (60), "D" (exitcode)); __builtin_unreachable(); } void __attribute__((section(".main"))) _main(void) { exit(42); }`

	startup.asm

`BITS 64 SECTION .main pop rdi pop rsi pop rdx`