Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Session 1: C Pitfalls

Caution

You don’t have the authorization to feed any of this content to any online LLM for any purpose. If for some reason, you need to interact with a LLM, you may use an open-source model on your local machine to feed the course content. See llama.cpp to install a CPU efficient LLM inference and use your own computer to ask your questions.

In this first practical session, we’ll learn what a memory-safe programming language is. And the best way to understand memory-safety is to first have a look at memory-unsafety.

Is my RAM Unsafe?

Let’s open Godbolt’s compiler explorer, write a little C function that returns a character, then look at the produced x86 assembly code:

An invalid memory access.

Evidently, trying to access the 400th item of a 3 items array won’t work. However, the code happily compiles! Look at the assembly code, instruction by instruction:

  1. subl: We reserve 16 bytes on the stack by decrementing the ESP register (i.e., the stack pointer).
  2. movzbl: We read a byte at address ESP + 413, then put it into the EAX register (at this point, we just need to know that the first byte of the list array is located at ESP + 13).
  3. addl: We free 16 bytes from the stack.
  4. ret: We finally jump back to the caller. The 32 bits return address is read then removed from the stack, all in one instruction. The caller expects the return value to be written in EAX, so that’s OK.

The movzbl instruction here shows us that C is a memory-unsafe programming language. Indeed, the memory address is invalid but the compiled program will try to read it nonetheless.

Now, imagine that instead of a trivially hard-coded 400, we add a new index parameter. And suppose that, at some point, a malicious user can provide whatever value they want for it. Now you got yourself a memory-safety issue:

char get_char(int index) {
    char list[3] = { 24, 75, 3 };
    return list[index];
}

Given an invalid index, in the best case, the operating system will notice at runtime and immediately kill the program. In the worst case, the program will read data it is not supposed to. To ensure index is valid, we must manually and explicitely do something such as:

char get_char(int index) {
    assert(index >= 0 && index <= 2); // Kills the program if not true.
    char list[3] = { 24, 75, 3 };
    return list[index];
}

The burden of memory-safety is on the programmers’ shoulders.

A memory-safe language such as Python, Java, Go, or Rust would have injected and executed the assertion automatically and implicitely, crashing for invalid indices. Keep in mind that this is a trivial example where list’s size is known beforehand, at compilation time. Memory-safe languages are able to handle cases where this is not even the case. They achieve this with runtime information. Basically, they keep the size of list in another variable.

Memory-safety doesn’t stop at array bounds checking. There are other issues that are usually taken care of using a mix of: virtual machine (Python, Java, WebAssembly), garbage collection (Python, Java, Go), runtime abstractions (modern C++, Rust), and/or compile-time rules and checks (Rust).

In this course, we have decided to make you use Rust. The aim of this programming language is to keep maximum performance (identical to C) with no garbage collector or virtual machine while still being memory-safe.

So, the goal of this first practical session is to introduce you to several types of bugs that are often encountered in the C programming language and that can lead to vulnerabilities. To do this, you will be asked to complete several exercises designed to help you detect bugs in C programs and debug them. We’ll also discover how the Rust programming language avoids these pitfalls.

So, let’s write C, but…

Isn’t C a Dead Language?

Well, for better or for worse, C is not dead. The whole world runs on it.

More specifically, let’s check our favorite kernel (or, if not your favorite yet, the one that makes servers, and Android, work):

$ git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ scc linux
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
C                        36322  25771328  3716880   2853379 19201069    2521084
C Header                 26355  10362694   774845   1551184  8036665      57824
Assembly                  1360    381699    42568     50352   288779       3489
Rust                       338    135822    10993     35002    89827       9261
...
───────────────────────────────────────────────────────────────────────────────
Total                    86869  41257447  5188536   4660857 31408054    2622443
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop (organic) $1 423 929 234
Estimated Schedule Effort (organic) 217,14 months
Estimated People Required (organic) 582,59

What about the most reliable media player?

$ git clone https://code.videolan.org/videolan/vlc.git
$ scc vlc
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
C                         1254    595881    85875     65071   444935      83892
C Header                   852    136722    17191     51278    68253       3828
C++                        476    166643    23613     17074   125956      21161
C++ Header                 431     46244     7711     12282    26251        583
Assembly                    20      4850      449       435     3966         90
Rust                        20      3221      344       594     2283        270
...
───────────────────────────────────────────────────────────────────────────────
Total                     4513   1166304   165722    168953   831629     124108
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop (organic) $31 442 815
Estimated Schedule Effort (organic) 50,99 months
Estimated People Required (organic) 54,79

Let’s not forget its internal library that can read whatever media file you throw at it:

$ git clone https://git.ffmpeg.org/ffmpeg.git
$ scc ffmpeg
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
C                         3338   1523389   189760    122465  1211164     216203
C Header                  1187    247514    21001     66554   159959       2954
Assembly                   400    182553    14770     13003   154780       1926
...
───────────────────────────────────────────────────────────────────────────────
Total                     5230   2005313   233212    205006  1567095     224570
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop (organic) $61 156 893
Estimated Schedule Effort (organic) 65,65 months
Estimated People Required (organic) 82,76

Do you know Python?

$ git clone https://github.com/python/cpython.git
$ scc cpython
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Python                    2217   1089091   114654     91597   882840      87172
C Header                   637    356738    31076     18746   306916      20635
C                          485    653964    64822     80599   508543     104795
...
───────────────────────────────────────────────────────────────────────────────
Total                     4955   2904156   367555    197045  2339556     221030
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop (organic) $93 150 554
Estimated Schedule Effort (organic) 77,04 months
Estimated People Required (organic) 107,42

What about databases?

$ git clone https://git.postgresql.org/git/postgresql.git
$ scc postgresql
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
C                         1584   1557272   187314    393892   976066     163484
C Header                   989    200845    18181     65439   117225       2665
...
───────────────────────────────────────────────────────────────────────────────
Total                     4579   2170000   261470    508166  1400364     173853
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop (organic) $54 343 588
Estimated Schedule Effort (organic) 62,77 months
Estimated People Required (organic) 76,91

Machine learning?

$ git clone https://github.com/pytorch/pytorch
$ scc pytorch
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Python                    4296   2238376   245271    242866  1750239     169393
C Header                  2264    394339    49378     61164   283797      20369
C++                       2152    849234    83222     64656   701356      78927
C                          193     41985     4149      2662    35174       3631
C++ Header                  67     12602     2031      1931     8640        506
Assembly                    34      9603     1420       410     7773         25
...
───────────────────────────────────────────────────────────────────────────────
Total                    11251   3964150   441933    395132  3127085     292013
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop (organic) $126 325 782
Estimated Schedule Effort (organic) 86,49 months
Estimated People Required (organic) 129,76

And last, but not least, let’s have a look at the 60 GiB source code (don’t try this at home) of the base browser for Chrome, Edge, Brave, Opera, et al. (we don’t forget you either, Samsung Internet):

$ git clone https://chromium.googlesource.com/chromium/src
$ scc src
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
C++                      66418  21007579  2940888   1954168 16112523    1165580
C Header                 58223   6637079  1186021   1454297  3996761      88573
Rust                      5147   2225829   145434    332292  1748103     125898
C                         1551    855738   110201    124513   621024      90471
C++ Header                 195     19990     3714      2128    14148        406
...
───────────────────────────────────────────────────────────────────────────────
Total                   359565  54543287  6579284   5691291 42272712    2085308
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop (organic) $1 945 173 939
Estimated Schedule Effort (organic) 244,47 months
Estimated People Required (organic) 706,90

A meme about C being everywhere.

Without counting C++, that’s already 49,335,488 lines of C code (or 3,735,522,805 $ but please don’t trust that number). And the list goes on and on. Also, this is only free and open source software. Indeed, the Windows kernel itself is written in C, the same goes for macOS, iOS, a vending machine (probably), your PS5, etc. Heck, even your car probably makes HTTP requests using C. New C code keeps getting written and old C code keeps getting fixed everywhere. We have to deal with it. That makes C a relevant language for developers.

Today, C is still one of the most widespread, performant, and close-to-hardware language that one could use. It is simple to learn but hard to master. It is the perfect way to understand how a computer and its operating system work. And it inspired so many languages that came after it. That makes C a foundational language for computer science.

Finally, it’s easy to “shoot yourself in the foot” using C. Indeed, security is far from built-in because it wasn’t even a concern at the time C was invented (1972). And with all this written-and-running software, many developers actually shot themeselves in the feet. That makes C a critical language for cybersecurity. So let’s learn from its past mistakes and try solving them.