Interpretation and compilation

Bash, Python, Lua, Perl (among many, many others) are interpreted languages. When we "run" a script from an interpreted language, we need a program called an "interpreter" to run our code.

A compiled program is completely different. Its code is actually loaded into memory and executed as machine language, literally a sequence of bytes that the hardware processor understands as instructions.

Turning text into bits

Let's look at our obligatory "helloworld.c":

#include <stdio.h>

int main(int argc, char **argv)
{
  printf("Hello world!\n");
}

Peering inside

We can use the program xxd to illustrate the difference in a Bash script and a compiled C program:

$ xxd ~/anagrams/anagram-build.sh | head -7
0000000: 2321 2f62 696e 2f62 6173 680a 0a64 6563  #!/bin/bash..dec
0000010: 6c61 7265 202d 4120 6469 6374 696f 6e61  lare -A dictiona
0000020: 7279 0a0a 2366 6f72 2828 203b 203b 2029  ry..#for(( ; ; )
0000030: 290a 7768 696c 6520 7265 6164 0a64 6f0a  ).while read.do.
0000040: 2320 2020 2072 6561 6420 0a23 2020 2020  #    read .#    
0000050: 6966 205b 2024 3f20 2d67 7420 3020 5d0a  if [ $? -gt 0 ].
0000060: 2320 2020 2074 6865 6e0a 2309 6563 686f  #    then.#.echo
$ xxd ~/helloworld | head -3
0000000: 7f45 4c46 0201 0100 0000 0000 0000 0000  .ELF............
0000010: 0200 3e00 0100 0000 1004 4000 0000 0000  ..>.......@.....
0000020: 4000 0000 0000 0000 4811 0000 0000 0000  @.......H.......

Simple compilation

Compiling a C program can be (though not necessarily!) a simple process.

Compiling a "Hello World" can be done in a single line:

$ gcc -o helloworld helloworld.c

More details on compilation

We are actually hiding some important steps; actually the first thing that happens is that a text preprocessor (generally m4 these days) is run over the helloworld.c file; we can ask the compiler to just do this stage with the -E option:

$ gcc -E helloworld.c

Assembly language

The next stage is the translation of the pre-processed C source into assembly language (a human-readable represenation of actual machine language); we can ask the compiler to stop after this stage with the -S option:

$ gcc -S helloworld.c
$ cat helloworld.s
	.file	"helloworld.c"
	.section	.rodata
.LC0:
	.string	"Hello world!"
	.text
	.globl	main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	subq	$16, %rsp
	movl	%edi, -4(%rbp)
	movq	%rsi, -16(%rbp)
	movl	$.LC0, %edi
	call	puts
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE0:
	.size	main, .-main
	.ident	"GCC: (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3"
	.section	.note.GNU-stack,"",@progbits

From human-readable
to computer-readable

We can also ask the compiler to stop at the next stage, the creation of actual machine language but before we "link" the C program with the C runtime and its shared libraries:

$ gcc -c helloworld.c
$ file helloworld.o
$ xxd helloworld.o
0000000: 7f45 4c46 0201 0100 0000 0000 0000 0000  .ELF............
0000010: 0100 3e00 0100 0000 0000 0000 0000 0000  ..>.............
0000020: 0000 0000 0000 0000 3801 0000 0000 0000  ........8.......
0000030: 0000 0000 4000 0000 0000 4000 0d00 0a00  ....@.....@.....
0000040: 5548 89e5 4883 ec10 897d fc48 8975 f0bf  UH..H....}.H.u..
0000050: 0000 0000 e800 0000 00c9 c300 4865 6c6c  ............Hell
0000060: 6f20 776f 726c 6421 0000 4743 433a 2028  o world!..GCC: (
0000070: 5562 756e 7475 2f4c 696e 6172 6f20 342e  Ubuntu/Linaro 4.
0000080: 362e 332d 3175 6275 6e74 7535 2920 342e  6.3-1ubuntu5) 4.
0000090: 362e 3300 0000 0000 1400 0000 0000 0000  6.3.............
00000a0: 017a 5200 0178 1001 1b0c 0708 9001 0000  .zR..x..........
00000b0: 1c00 0000 1c00 0000 0000 0000 1b00 0000  ................
00000c0: 0041 0e10 8602 430d 0656 0c07 0800 0000  .A....C..V......
00000d0: 002e 7379 6d74 6162 002e 7374 7274 6162  ..symtab..strtab
00000e0: 002e 736

Linking/Loading

The final stage is the linking/loading stage, where we resolve any outstanding references, and combine any other needed modules with our code modules to make our final executable. (With C, we generally need at least the C runtime files, such as libcrt?.o)

Automating all of this!

The traditional program in the Unix world to automate the process of compilation is called make. It allows one to specify a set of rules to specify how the units in a compilation (or compilations!) all depend on each other and how to create each bit.

Our first Makefile

helloworld: helloworld.c
<tab> gcc -o helloworld helloworld.c

This is a complete Makefile. Let's try it out:

$ make
make: `helloworld' is up to date.
$ rm helloworld
$ make
gcc -o helloworld helloworld.c
$ make
make: `helloworld' is up to date.
$ touch helloworld.c
$ make
gcc -o helloworld helloworld.c

So make is quite intelligent about when to re-create a binary using the dependency information we have provided in the first line.

More with `make`

We can quieten make down by using the "@" sign:

helloworld: helloworld.c
	@gcc -o helloworld helloworld.c

We can also use the very powerful "wildcard" system to automate compilation:

%.o: %.c
	gcc -c $*.c 

helloworld: helloworld.o
	gcc -o helloworld helloworld.o

Targets of convenience

Another popular thing to do is add targets that are merely conveniences, such as a clean target:

%.o: %.c
	gcc -c $*.c 

helloworld: helloworld.o
	gcc -o helloworld helloworld.o

clean:
	@rm -f helloworld helloworld.o helloworld.s

Now when we do a "clean", all of the generated files that might be lingering around are removed:

$ make clean
$ make
gcc -c helloworld.c 
gcc -o helloworld helloworld.o

Even more automation with CMake

In recent years, cmake has been gaining some ground as a tool to automate the creation of Makefiles.

We can use cmake with our helloworld program:

$ mkdir helloworld.d
$ cd helloworld.d
$ cp ~/helloworld.c .
$ cat > CMakeLists.txt <<EOF
project(helloworld)
add_executable(helloworld helloworld.c)
EOF
$ mkdir build-dir
$ cd build-dir
$ cmake ..
$ make
$ ./helloworld
Hello world!