Okay. At this point, we have a working kernel module that establishes a new, metadata-free binary format for Linux. But without the tools and libraries that we normally have at our disposal when writing programs, trying to use this format is an utter pain. We're forced to do everything from scratch, in bare assembly. How can we improve this situation?
To be specific, there are two problems here:
libc
, so none of our familiar system functions
are accessible.In theory these problems are orthogonal to each other, but in practice they become somewhat intertwined. But we need to start somewhere, so let's begin with the first one.
The process of transforming an object file into an executable is
usually done by a linker, such as ld(1)
. Now in this case "linker"
is something of a misnomer, since we're not even trying to link
anything. We just want to generate binary files that work with our
kernel module. If we could get that from a standard object file, then
at a stroke we would be able to use compilers again. The irony is that
a binary image is present inside of an object file — if only we could
remove all of the guff surrounding it.
And hey, we can do exactly that, thanks to objcopy(1)
. This is a
utility program that few programmers ever need, but it's a nice little
program that's part of the standard GNU build tools. Its purpose is to
translate object files across different formats.
Are you familiar with the GNU BFD library? ("BFD" doesn't stand for
what you're probably thinking; it's short for "binary file
descriptor".) libbfd
is a foundational hunk of code used in the GNU
toolchain. This library provides details about all the various binary
file formats — object files and executables alike. All of the GNU
build tools depend on libbfd
for reading and writing these file
types, allowing them all to support binary file formats across a
variety of platforms.
objcopy
provides a simple command-line interface to a major piece of
functionality provided by libbfd
. It allows you to pull apart an
object file, and then stuff those contents back into another object
file using a different format. And as it happens, objcopy
provides a
null format called "binary" — a blank-slate format with no intrinsic
associated metadata. So, in theory, we should be able to ask objcopy
to extract the code section (which, according to tradition, is named
.text
) from an ELF object file, and place it in a "binary" object
file, and that should give us a valid executable file that our kernel
module will load and run.
Before we can properly test this idea with an object file generated by
gcc(1)
, though, we'd need to have the C code actually, you know, do
something. (If nothing else, it would need to exit safely.) Which
brings us face-to-face with the second problem. Set let's set
objcopy
aside for a moment, and consider our lack of system library.
It's rough for a C programmer to suddenly be denied access to the standard functions. However, it is true that a number of the things we're missing the most are provided by system calls, with the standard library functions being little more than wrappers. So let's consider those functions first.
We'll start with the exit
system call, since it's quite simple and
used by nearly every program. We want to wrap a C function around it.
While we could use assembly language to write a function that can be
called from C, the number of instructions we need is small enough that
it makes more sense to use gcc
's inline assembly feature instead:
tiny.c | ||
|
The inline assembly feature of gcc
is notorious for being
complicated, but for our purposes it's absolutely worth it to
understand what it can do. In this example, we have on the left a
string that contains a single assembly-language instruction. But the
section on the right tells the compiler what you want the registers to
contain before it runs. In this case, we have a list of two registers,
with "a"
indicating the register rax
and "D"
referring to rdi
.
(There are varying levels of specificity available — for example,
"r"
requests any general-purpose register, while "U"
requests a
register that doesn't need to be preserved across function calls.) The
compiler will follow these requests and add instructions to initialize
rax
and rdi
with the given values right before the syscall
instruction.
We could have included instructions in our string to initialize rax
and rdi
explicitly. But doing it this way allows the compiler, and
in particular the optimizer, to better merge our assembly with the
surrounding code. So, for example, knowing that the exit status has to
be stored in rdi
may influence the compiler to use rdi
to store it
there to begin with, thus allowing it to optimize away a mov
instruction. (Of course, once you invite the optimizer to your party,
you have to dance to its tune. This is one reason that the inline
assembly statement is marked as volatile
— since our inline
assembly has no output values, the optimizer might incorrectly deduce
that our entire statement has no effect, and remove it entirely. The
volatile
qualifier warns the compiler that the inline assembly has
side effects, and thus its usefulness cannot be judged on its outputs
alone.)
So, armed with this macro, let's return to our proposed objcopy
experiment. We'll make a C function that requires no assistance from
libc
, or indeed any external entity:
tiny.c | ||
|
The name of the function here isn't important, since we won't actually
be calling it as a function — it'll just run when the file is
executed. (It might have been more intuitive to call it main()
, but
compilers give that name special treatment.) We'll compile it, and
then use objcopy
to extract the object file's .text
section.
While 21 bytes is quite a bit larger than our own 12-byte and 7-byte
creations, it's not bad for unoptimized compiler output. And it
suggests that objcopy
did extract precisely what we wanted it to.
And when we test it:
We can see that we really do have a working binary, created using standard build tools. Let's try it again, but this time inviting the optimizer to get involved.
-Os
option tells the compiler to optimize for
size instead of performance.Down to 13 bytes — the C compiler is only one byte away from the version that we created manually on our first try. That's quite respectable! Let's look at the disassembly to see where the thirteenth byte came from:
It's actually the exact same program that we wrote, the only
difference being a useless ret
instruction tacked on at the end.
That's not surprising, in retrospect, as the compiler has no way of
knowing that the inline assembly will never finish. But guess what: we
can actually remedy that. gcc
has a special pseudo-function named
__builtin_unreachable()
. Using it in your code is a promise to the
compiler that control can never reach it. If we place a call to this
function at the end of foo()
, the optimizer will take advantage of
our guarantee:
Writing our programs in C is now looking rather attractive, when the compiler can do such a good job at whittling away extra instructions.
However, this objcopy
technique is not going to scale up. Not only
is it a terrible hack, it's also dependent on there being only one
function in our program. If the code contained multiple functions, we
couldn't guarantee that our top-level function would be placed first
in the .text
section. Moreover, if our code makes use of any global
variables, the compiler will almost certainly place them in a separate
.data
section, making their addresses incompatible with addresses in
the .text
section.
It might seem that we've landed back to square one, with no easy way
to extract usable binaries out of the compiler's object files. But not
so — we're just getting started here. Using objcopy
allowed us to
avoid the linker entirely, but the truth is that we still want the
linker's help with things like address fixups. Fortunately for us, the
linker is remarkably amenable to this kind of detailed customization,
thanks to the existence of linker scripts.
If you are not already familiar with linker scripts, you may be
surprised to learn that a significant block of a linker's logic
resides not in the code itself, or even in a library like libbfd
,
but rather in simple, textual configuration files. Every time you run
your linker, it uses the appropriate linker script as a guide for what
to put where. While linker scripts are mostly stored internally, you
can typically also find copies of them under the linker's search
paths. On my machine, they are under
/usr/lib/x86_64-linux-gnu/ldscripts/
. Linker scripts can and do get
extremely complicated, but for basic needs like our own they can be
quite simple:
comfile.x | ||
|
The first line sets the "binary" format as the BFD-provided file format to output. Using this null format ensures that the resulting file will contain only what our linker script explicitly asks for. The second line provides a default filename if none is provided on the command line.
The SECTIONS
block lays out what the output file should contain,
like a blueprint. This script specifies that the linker should place
all of the .text
and .rodata
sections together before any .data
and .bss
sections, for example. (The asterisks before the names are
there because there can be more than one section with the same name
when multiple object files are involved — i.e. the situation in which
there is actual linking being done.) Other sections not named in the
blueprint do not make it into the output file, for the most part. Some
sections, like .eh_frame
, are still included in the BFD-defined
binary format, but fortunately the linker provides an explicit
/DISCARD/
destination that causes sections to be forcibly omitted.
But then what's this .main
section at the top, with the hexadecimal
annotation? This is how we can choose which function will land at the
start of the generated binary. We will define a separate section that
is specifically set aside for our top-level function. The linker can
then ensure that that function is placed in front of everything else.
(The 0x10000
annotation is the value of loadaddr
in our kernel
module. It informs the linker where the binary will be loaded into
memory, so that the linker can calculate absolute addresses if called
upon to do so.)
gcc
, like most compilers, provides an extension that allows us to
mark objects for placement in non-standard sections. We can use this
in our C program like so:
tiny.c | ||
|
After compiling, we can use objdump(1)
to look at the list of
sections created by the compiler:
As you can see, there are a fair number of sections, though several
are actually empty. But the important point is that there is both a
.text
section and a .main
section, and both of them are non-empty.
This shows that our two functions did in fact wind up in separate
sections. The .text
section appears earlier in the object file, but
our linker script will ensure that the contents of .main
are placed
first.
Typically the linker will determine which kind of output file it's
being asked to create from context, and then load the appropriate
linker script from its own store. We can explicitly tell the linker to
use our hand-rolled script instead, via the -T
command-line option:
The binary's increased size is mainly due to exit()
being written as
a separate function. If we turn optimization back on, the optimizer
will see that the exit()
function can be inlined instead, and we
will quickly get our 12-byte executable again.
Now that we can safely write code with multiple functions, we can
create proper wrapper functions for system calls. We've got a working
exit()
function; now let's see what a wrapper function for write()
might look like:
write.c | ||
|
This one is a bit more involved. That's partly because write()
takes
more arguments, but it's mainly because unlike exit()
it actually
returns. The part after the first colon declares the assembly's
outputs — i.e. the values that need to be transferred into C
variables afterwards. As with all system calls, write
's return value
is stored in rax
, which is indicated by the constraint string
"=a"
. Now rax
also appears in the second list, the list of inputs,
but this time the constraint string is "0"
instead of "a"
. This is
because registers cannot be repeated across constraint strings: if a
register is both an output and an input, the latter needs to be
referred to by its (zero-based) index instead.
The final clause in the asm
statement is a list of operands that are
neither inputs nor outputs but are nonetheless modified by the inline
assembly. The optimizer will assume that any register not mentioned
will be diligently preserved, so it's important to be thorough here.
As it happens, Linux system calls are documented as preserving all
registers except rcx
and r11
(and rax
, obviously), so our list
is mercifully short. The last item, "memory"
is a general indicator
that the inline assembly reads and/or writes memory locations other
than the explicitly named C variables. (If we didn't include this
constraint, the optimizer might feel justified in reordering
statements so that, for example, the write
system call is made
before the code that actually populates the buffer contents.)
Finally, our function checks for a negative return, and stores such
values in a global errno
variable, so that it matches the behavior
of the standard library function.
Let's put all this together into a self-contained hello-world C program:
hello.c | ||
|
And test it:
With the extra error-handling, our file size has ballooned to a whopping 72 bytes — but the fact that the C compiler generated it all for us is a breath of fresh air.
If we want to move on to a program that actually does something
useful, however, we are once again faced with the issue of accessing
argc
, argv
, and envp
. Ideally we would like to access those
values as parameters to our top-level function, just as we normally do
with main()
. But right now those values are stored on the stack, and
under the x86 64-bit calling convention, function arguments are passed
through registers instead. How can we fix this?
One way would be to add some inline assembly at the beginning of our top-level function to grab these three values from the stack and cram them into variables. This is not a great solution, because functions usually set up a stack frame first thing, the size of which varies. So it would be preferable to address the issue before the function runs, instead of during.
Another possible approach would be to work the opposite end, and
modify our kernel module so that it stores those three values in
registers in the first place. Is that even possible? It actually is.
You may remember that the final step in our kernel module's loader
function is calling start_thread()
, and that the first argument to
that function is a pointer to a struct holding the process's register
contents. Nothing is stopping us from modifying those values before
passing them along to start_thread()
. In fact, it's actually
considered good policy to do just that, and set all of the
general-purpose registers to known values (zero if nothing else).
Otherwise, the registers will contain whatever values were left in
them from the parent process, and this could, at least in theory, leak
information and become a security issue. Okay, so why haven't we done
that already? Well, the issue is that registers are
architecture-specific, and so naturally the struct storing their
values is too. The kernel source tree is set up so that most of the
code is architecture-neutral, with the minimal amount of necessary
architecture-specific code being relegated to separate directory
subtrees. Right now, our kernel module is architecture-neutral, so it
would be preferable not to embed x86-specific code in the middle of
it.
(Of course, if I thought there was a snowball's chance in hell that this code could ever become an official kernel module, I would happily break out the architecture-specific code into separate files and integrate them into the full directory structure, not to mention doing the necessary research to determine the proper calling convention for the Tensilica Xtensa architecture and all the other less-popular platforms that Linux runs on. But you and I both know that this flat binary file format is never going to be officially adopted, so I'd prefer to find a solution that doesn't involve the kernel module code.)
a2
through a7
, in order, with
further arguments stored on the stack. However, the register window
can shift these forward depending on the call instruction, so I
recommend you read this reference
page to get
further details.)A third possibility would be to provide a tiny bit of prolog code, to
be inserted at the top of the binary, that pops the stack values into
registers. This is more or less what the implicit object file crt1.o
does for normal C programs, via the _start()
function that in turn
calls main()
. In fact, it's such a common thing to need that the
linker script has a feature supporting it. The STARTUP()
command can
be used at the top level to implicitly include an object file and
ensure that it will be linked first, before the other input files. The
nice thing about this approach is that if we did want to add support
for another architecture, we would just need to provide a different
startup object file.
For the x86 64-bit architecture, the calling convention is that the
first six function arguments are stored in rdi
, rsi
, rdx
, rcx
,
r8
, and r9
, with further arguments stored on the stack. (Note that
this isn't the whole story: SSE registers are used for floating-point
arguments, for example. But these details are sufficient for our
purposes.) So, converting our three stack entries to function
parameters simply requires the following:
startup.asm | ||
|
This reduces down to a measly three bytes of machine code, but we need to house it in an ELF object file so that the linker can use it at link time:
comfile.x | ||
|
In order to verify that this change really does give us working function arguments, we'll write a quick-and-dirty test program:
test.c | ||
|
When we build everything, we get:
And with that, we now have a working system for writing (almost) idiomatic C code, with the output being a binary file format of our own design.
Of course, this is still little more than a proof of concept. The next
step would be to create a larger library of wrapper functions around
the system calls, as well as providing popular standard function like
strlen()
. (We should also define a nice macro to hide the messiness
of the _main()
function declaration.) A dozen or so functions is
enough to be able to start writing some non-trivial programs.
In fact, I did enough of this to be able to build my own replacement
version of factor(1)
. A standalone binary with all the same features
as the standard utility, it manages to squeeze in at just under 1k in
size — whereas the /usr/bin/factor
on my machine is dependent on
libc
and is still over 74k. What a grotesque cyclopean boat anchor
of a binary, am I right? (Of course, the standard utility includes a
bunch of complicated math functions that allow it to complete quickly
even when given very large numbers, but you don't really notice the
difference in speed until you reach the trillions. Hey, there are
always tradeoffs.)
That said, my version is a standalone executable out of necessity. If
it could have linked with a shared system library, like libc
, it
might have been smaller still. But our binary file format has no
support for dynamic linking. Nor is it likely to in the future, as
that would require the ability to look up functions by name, and to
identify addresses needing fixups at runtime. All of which calls for
… metadata. So, static libraries for us it is. If you'd like to see
and/or build any of this code yourself, by the way, I've provided a
tarball with all of the source code.
Click here to download comfile-0.3.tar.gz.
It would be awfully nice, though, if we had to option to statically
link with libc
itself. In theory this ought to be possible, since
our private binary file format doesn't really require any special
handling of code before the linking stage, and building a static
library doesn't involve the linker. Unfortunately, however, libc
is
a bit of a special case, as the library depends on special features of
ELF executables (such as defining initialization functions that run in
advance of main()
). So even though I was able to get my programs to
statically link with libc.a
, they invariably crashed on startup.
It's possible that someone else's nonstandard, less-featureful
implementation of the C library would work for our programs, but so
far I haven't found a decent solution. I'll keep looking, though. This
binary file format may never be installed on any Linux machine besides
my own, but I'm fond of it nonetheless.
libc
by
Bomberaniancomfile-0.3.tar.gz
(main essay)