My previous post here and some recent posts on misc@openbsd got me thinking again about significantly mitigating buffer overflows and the like by properly separating the parameters and the return addresses into separate stacks.
I posted my thoughts and asked for suggestions. Philip Guenther pointed out that I needed to unpack my thoughts a bit better.
I responded to several of his points, but I was asleep at the keyboard and didn't get my example of a function call using the separated stacks posted in my reply.
I don't want to add to the chatter on list, so I decided to unpack things here.
This is a lot more work than it seems on the surface.
First, typing in Euclid's method for greatest common divisors was faster than looking it up someplace where I've typed it in before:
Then I compiled to assembler. Don't even have to think to do that:
cc -Wall -S gcd.c
Then I modified the internal gcd() routine to use a separate parameter stack.
Here's how to call the gcd() routine:
subl $2*4, %ebp /* allocate parameters */ movl 3*4(%ebp), %eax /* copy n2 for the call to gcd */ movl %eax, 1*4(%ebp) movl 4*4(%ebp), %eax /* copy n1 for the call to gcd */ movl %eax, (%ebp) call gcd /* return address on the flow-of-control stack */ addl $2*4, %ebp /* de-allocate parameters */ /* Recover the return value. */ movl %eax, (%ebp) /* move the result to divisor */-------------------------------
Here's the gcd() hand-compiled to access it's parameters on the parameter stack:
gcd: pushl %ebp /* Frame for unwinding */ subl $4, %ebp /* Locals */ jmp .L2 .L3: movl 4(%ebp), %eax /* numA */ cmpl 8(%ebp), %eax /* numB */ jge .L4 movl 4(%ebp), %eax /* numA */ movl %eax, (%ebp) /* temp */ movl 8(%ebp), %eax /* numB */ movl %eax, 4(%ebp) /* numA */ movl (%ebp), %eax /* temp */ movl %eax, 8(%ebp) /* numB */ .L4: movl 8(%ebp), %eax /* numB */ subl %eax, 4(%ebp) /* numA */ .L2: movl 4(%ebp), %eax /* numA */ cmpl 8(%ebp), %eax /* numB */ jne .L3 movl 4(%ebp), %eax /* numA */ popl %ebp /* Restore previous frame */ ret-------------------------------
(I'm not putting the full source of that up here because I'm lazy. Sorry. This is taking more time than I have.)
That wasn't too bad, but it really didn't answer Philip's objections.
From there, things got time consuming.
Providing shims to the C library functions was easy. I put generalized shims at the top of the assembler file because it was easy, but the generalized shims require loading the call address to %eax before calling the shim. Specific shims would be one instruction shorter and faster on the call, but would (potentially) require more shims, depending on how many calls of each number of parameters occurs.
There are other ways to do the shims, of course, but the shims should go away once the whole OS is compiled with the separated stacks.
Tracking which variables were where, so I could demonstrate the shims by keeping them all on the parameter stack that I allocated, required a lot of grunt work -- hand de-compiling and re-compiling.
And the comments I inserted were for my own benefit, more than for yours. Without them, I could not have tracked the variables on the stack. Thankfully, current gcc makes it easier, allocating the whole rack of local variables at function entry. I should have done the same, but I found it easier to focus on small bits of code at a time.
Note that I initially just used calloc() to allocate space for the parameter stack in a c source file I called gcddummy.c, and then shifted to using mmap, so I could map the stack up high, around 64M below the return address stack.
Placement of the two stacks requires some thinking. Putting the parameter stack above the return address stack will leave you vulnerable to certain kinds of collisions, primarily deep recursion with large local variables.
Putting the parameter stack below the return address stack will leave you vulnerable to buffer overflows more than recursion -- large overflows, that is, in the 64 megabyte range in this example. A 256M gap would be even better, but the theoretical vulnerability remains.
I chose below because it would have been much more work to move the return address stack down.
But there should be a gap of unallocated memory between the stacks, and the illegal access exception that should happen (if you have proper memory management hardware) when a buffer overflow hits the gap should prevent anything more than a denial of service to the attacker.
Recursion may be more difficult to handle, but, if the OS provides automatic extension of the stack, one might hope that it would also provide some way to detect such recursions.
Anyway, ultimately, the OS should support the second stack, so that the compiler and the source code programmer won't have to deal with the allocation, other than compiler switches for the rare case when the default size or placement of the gap is not desired.
64-bit addresses, even if not fully decoded, should help with the collision issues.
This looks pretty simple.
The problem is that no current compiler I know of does it this way. That means I have to write such a compiler myself, or I have to go find the production rules in the source of an existing compiler and see whether I can successfully modify the function call and return and parameter access without breaking the compiler.
Learning (finally) how to use compiler-compilers and lexical analyzers would be the easy part. Not breaking the compiler is the hard part.
Writing a new compiler myself might be easier, considering how familiar I'd have to get with the compiler of choice.
And then I would have to go digging through the library source, looking for all the places that directly access the stack and fixing the code that avoids the return address that isn't there any more. va_arg is just the tip of the iceburg.
It's a big project, but I think it needs to be done. It's just too easy to walk on the return addresses (and do bad things with them) otherwise.
Now it doesn't fix the general problem of using buffer overflows to modify the behavior of programs, it just makes the easiest way to do so significantly harder.
It also opens the C language to some paradigms that are currently blocked, and those paradigms just happen to make it easier to write robust programs. But that is a discussion for a later date.