1

I wanted to use sharable plugins for an application. However, the plugins were extremely time critical, so I wanted to distribute them as C libraries. Now, these are not guaranteed to be safe (the programmer could user system calls or cause segfaults).

Then I had this idea: The application puts its source code through a parser which defines what is allowed - not by making restrictions (subtractive), but by enabling features(additive) - then outputs the code, and finally, this output is compiled using an ordinary compiler. How safe would this be? (Compared to using no parser in between, or using a scripting language, like e.g. python)

Johannes
  • 165
  • 3
  • 1
    The easiest way is just to make all plugins submit their source code for review before you accept them, though for the sake of sanity I suggest you look at some scripting languages like Lua or Perl. Modern scripting languages can be very fast thanks to the amount of research that has been put into JIT compilation techniques. – Pharap Apr 15 '14 at 16:14

2 Answers2

11

In order to make C code "safe", even against a malicious developer, then you have to fix the core issues of C, which have all been repeatedly exploited for arbitrary code execution (exactly what you want to avoid):

  • Weak types. In C, you can take a bunch of bytes, and interpret them as a pointer, an integer, or whatever. All it takes is a cast. As long as such casts are possible, then the language won't be safe. You need to enforce strict unescapable types.

  • Unchecked array accesses. C gives you pointers and from these you can read and write arbitrary bytes in RAM, with no control. From these follow buffer overflows, which can lead to security holes. In your new language, you must enforce strict array bound checks.

  • Manual memory management. In C you dynamically allocate memory with malloc(), and then release it with free(). Nothing prevents you from trying to access a memory block after having freed it, or freeing it twice (these are the classic use-after-free and double-free, both leading to serious consequences). Lack of free() can also imply memory leaks, which can be bothersome. In your new language, you must do something to prevent these memory management issues, and that is called automatic memory management, which is most often done with a garbage collector.

Down this road you get Java. Or C#, if you prefer. This works, but it is a considerable development effort. Moreover, the systematic checks on array bounds can imply a non-negligible overhead. For purely computational tasks with no I/O bottleneck (no disk access, no network access, and all data fits in the CPU L1 cache), I typically get a factor of 2 to 4 between optimized C and optimized Java (tested on numerous cryptographic algorithms such as hash functions; also on zlib-compatible compression). For some algorithms the slowdown was smaller (Java code was achieving up to 70% of the speed of C code), but that's rare. For the specific task of computations with big integers, Java is more like 6 times slower than C+assembly, because of the lack of support for the 64x64->128 multiplication opcode.

There are two main roads out of this (assuming that the overhead of a Java-like language is not acceptable to you). The first one is to try to use formal verification. The concept is that the code will be executed only if it can be proven that it "behaves properly". In essence, Java is doing a weak form of verification, in that the JVM proves that a piece of code does not do anything wrong like using a reference with a wrong type; but that verification relies on systematic checks, e.g. on casts and array accesses, and also the presence of the GC. Formal verification is an active research area, whose scope extends beyond security: basically, this is about proving correctness, i.e. absence of any bug.

There are good mathematical reasons why a generic verifier which works on all possible programs cannot exist. Researcher still hope to improve things in several directions:

  • There are programs for which proofs are feasible. For instance, a program with only if but no loop, and no recursive call, cannot loop "forever": that's an example of an easy proof of correctness for programs which follow some specific rules. Researchers try to define more useful sets of rules which still allow for proofs.

  • The human developer may provide proof elements. There are constructions for which no algorithm is known to generate the proof, but a developer-provided proof can still be verified. Assertions fall in that category.

  • We may find optimizations: starting from a generic "safe" language like Java, an automatic prover might find "occasions", e.g. code chunks where it can be proven that some array accesses are necessarily fine, thus allowing for removal of the corresponding checks. In fact, modern JVM apply such strategies, which is why they are not as slow as could be feared. Many researchers believe that there still is room for improvements.

However, right now, don't hope for too much: existing top-of-the-line Java and .NET virtual machines are about the best that can be done for now.

The other method is to surrender the CPU to the potentially hostile code, and instead move the security perimeter at an upper layer. The "plugin" is executed in an address space of its own; the MMU blocks unwanted accesses. The plugin code can do all the memory accesses that it wishes: it will see only its own code and data. If the code tries system calls, then the kernel knows not to honour them.

Among incarnations of this method are FreeBSD's jails and Linux's seccomp. The same concept at a still higher level is virtualization with hypervisors: the guest can be a full operating systems, with access to disks, network... and yet all this hardware is virtual and the guest cannot escape from the closely guarded perimeter set by the hypervisor.

The main problem with that kind of isolation is overhead for data exchanges. If your "plugin" must consume a lot of data from the application into which it is plugged, or if it produces a lot of data for that application (typical example: video codecs), then the moving of data between the two worlds, beyond the MMU/hypervisor barrier, can kill performance.

Tom Leek
  • 170,038
  • 29
  • 342
  • 480
  • So if my parser does not provide functionality for the first 3 issues you mention, the resulting code "should" be save? – Johannes Apr 12 '14 at 22:37
  • 2
    @Johannes, yes...but that's a bit like asking if it's okay to have a bucket without the bottom. – Paul Draper Apr 12 '14 at 22:44
  • @Johannes When your parser does not provide functionality for the first three issues, there wouldn't be much left of the features which make C as fast as it is. The main reason why C is so fast is because it trades security for speed. – Philipp Apr 12 '14 at 23:48
3

Safer than not using a parser in that it will stop casual attackers, but not as safe as using a sandboxed scripting language.

The big problem you'll run into is that some fairly low-level features of the C language, such as pointer arithmetic, can be used in nefarious ways. For example, you might forbid the use of open(), but a plugin could find a point in your program's code that uses it and hijack it to get access to the filesystem -- and the only effective* way to prevent this at the source level is to forbid the use of pointers and arrays. Or consider the first winner of the IOCCC: it looks like it's defining an array of integers, but it's actually creating a block of assembly-language code that the compiler then uses as the body of the program.

*I'm assuming you haven't solved the halting problem.

Mark
  • 34,513
  • 9
  • 86
  • 135