Preserving sparsity in bounded difference constraints

June 7th, 2009

I discovered a neat technique in program analysis today, so I’ll try to explain it here. The first section gives a general overview of inferring integer properties of programs, and the second part describes what I found.

Numerical Domains and Difference Constraints

Inferring numerical properties of programs is a pretty important problem in program analysis. Here’s is an example.

if (x > 0) { y = x; } else { y = 10; }

We can infer from that after this program executes, y > 0. Why? Because in the true branch, y is assigned x, which we know to be positive. In the false branch, y is assigned 10, which is also positive.

There are a variety of techniques to infer properties like this. The most common ones are based on a technique called abstract interpretation. The most well-known technique of all is called the polyhedron domain, discovered by Cousot and Halbwachs. It is quite complex to implement, but it is guaranteed to infer all possible linear inequalities satisfied by a program. Unfortunately, it’s also very slow.

A more efficient and simpler technique goes by the name of “difference constraints” or “bounded difference matrices” or the like. In this post I will call it DC. DC can infer any invariant of the form x – y <= c or x <=c or x >= c. Most of the invariants that programmers rely on are of this form, so DC is kind of a sweet spot domain. (Note that constraints like x == y can be converted to the conjunction of x -y <= 0 and y – x <= 0.)

The domain works by putting all constraints into the form x – y <= c. Constraints of the form x <= c or x >= c become x – 0 <= c or 0 – x <= -c (i.e., zero is just treated as a variable). Then it saturates the constraint system by generating all constraints that follow from the known ones by linear combination. For example, if you know that x – y <= 5 and that y – z <= -3, then you can infer that x – z <= 2. The algorithm for saturation is pretty simple: you express the constraints as a weighted directed graph (where the variables are nodes and the constraints are edges) and then use the Floyd-Warshall all-pairs shortest path algorithm.

Once the constraint system has been saturated, you can perform all sorts of operations. First, it’s easy to check if the constraints are satisfiable by looking for a constraint of the form x – x <= c, where c < 0. If you want to see if one set of constraints C1 implies another set C2, you just take every constraint x – y <= c in C2 and make sure that C1 has a constraint of the form x – y <= c’, where c’ <= c.

Probably the most common operation is assignment. If you know that some set of constraints C is true before an assignment, then you want to know the set C’ that holds after the assignment. Assume the assignment is x := e. First we add a new constraint “tmp == e” to the system (using the translation of equality above) and then we saturate the system. Then we eliminate any constraints that talk about x from the system, and rename tmp to x.

Another important operation is the join. This operation takes the constraints generated by two branches of a conditional statement (call them C1 and C2) and tries to find the best set of constraints C that abstracts both branches. For this to be true, the constraints in C1 must imply C and the constraints in C2 must also imply C. To generate C, we take all pairs of variables x and y and look for a constraint of the form x – y <= c1 in C1 and x – y <= c2 in C2. Then we add x – y <= max(c1, c2) to C.

Sparsity

The problem with DC is that saturation is expensive: it’s an O(n^3) algorithm. My goal was to fix this problem, at least in the common case. In the common case, the initial set of constraints is sparse: when you form the directed graph of constraints, there aren’t very many edges. The problem is that saturation adds a lot more edges, and those edges make later operations much more expensive.

In my experience with difference constraints, most of the extra edges added by saturation are caused by the zero node (remember that we treat zero as a variable). As an example, consider these constraints: 0 <= x <= 10, 5 <= y <= 20, 10 <= z <= 30. If we draw the graph for this system, there will be edges between x and the zero node, between y and the zero node, and between z and the zero node. But once we saturate, there will be edges between all pairs of variables saying things like y – z <= 10 and z – y <= 25.

It turns out that for some operations like assignment and checking satisfiability, there’s no need to saturate across the zero node. That is, we allow saturation to take the linear combination of x – y <= c1 and y – z <= c2 as long as y is not the zero node. It’s pretty easy to modify the Floyd-Warshall algorithm to do this. And when you do, the edges that were added in the example in the last paragraph will no longer be added, so everything will be faster and we won’t have lost any precision.

Things are a bit more complicated for join. We can still do the same trick of restricting saturation, but now we will actually lose the ability to infer some invariants. Consider the case where we know x = 1 and y = 2 on one branch and x = 5 and y = 6 on the other. We would like to infer y – x = 1. With normal saturation we will, but not with restricted saturation.

Here’s the solution. We use restricted saturation to get a joined result C from the inputs C1 and C2. Then we iterate over all pairs of variables x and y. We find the shortest path from x to y in C1 by looking for an edge directly from x to y as well as edges from x to the zero node and zero to y. Call this distance c1. Do the same in C2 to get distance c2. Now do the same in C to get c. If max(c1, c2) < c, then we simply add an edge in C directly from x to y with distance max(c1, c2). Otherwise we do nothing.

This technique achieves the same precision as using full saturation, but it adds many fewer edges in practice. My hope is that will be much faster as well, although I haven’t tested it much yet.

Here’s one way to think of this technique. Saturation produces the transitive closure of our constraint graph. However, after we perform operations like join, we’d like to take the transitive reduction so that they have a minimal representation. Unfortunately, the transitive reduction is expensive to compute and isn’t even unique for our graphs. But since we know that zero is the crucial node in most of our graphs, we can compute a sort of transitive reduction, except that the result is only reduced at the zero node.

Kernel of Truth

March 27th, 2009

One topic that has always interested me is performance analysis. In particular, it always frustrates me when I can’t get consistent timings for a piece of code. No matter how much work I do, the numbers are always all over the map. My friend Dave had an interesting blog post related to this subject. He showed that the running times for typical benchmarks do not fall into a Gaussian distribution as we typically expect. He estimated the actual distribution by doing lots of runs and then used it to compute things like confidence intervals, which was cool.

However, I’m a perfectionist and I’d like to know exactly how long my code really takes under ideal conditions. Unfortunately, there are many sources of non-determinism nowadays. First, the CPU executes instructions out of order, so the time your code takes may depend on what instructions were in the pipeline before. Cache effects and swapping are huge sources of non-determinism if they occur. Laptops may throttle the clock to save power. The OS may schedule another process in the middle of your timing loop. And finally, an interrupt may arrive and ruin everything.

Some of these issues are not too hard to address. Pipeline non-determinism isn’t a big deal, since it usually only causes the timinings to change by a few cycles. You can mitigate cache effects by warming the cache before doing the actual timing. Your laptop can be instructed to run in “performance mode” so the clock speed never changes. And you can run your benchmark at real-time priority so that the scheduler never steals a slice from you.

But even after doing all these steps I still see timing variations in the millisecond range. Given the speed of modern CPUs, that’s a huge amount of time. By a process of elimination, it would seem that interrupts are a major source of latency.

To verify if the problem really is with interrupts, I wrote a kernel module that disables interrupts and then runs my benchmark. It’s a very simple benchmark that runs entirely in cache and does nothing but additions and comparisons. You can find it here. (Instructions for compiling your own Linux kernel module can be found in Chapter 2 of Linux Device Drivers.) The next few paragraphs assume you’ve read the code.

For the benchmark to work properly, you have to have a single core with no hyperthreading (a laptop typically works). To disable interrupts, the code acquires a spinlock. It uses the rdtsc instruction to get a cycle-accurate timestamp. The cpuid instruction is used to serialize execution, as described by Intel.

I observed a few things. First, I was able to get a very accurate measurement for the latency of the CPUID/RDTSC instruction pair. On my machine, it always takes 166 cycles. The benchmark loop itself is a little more variable. It usually takes 2,000,014,244 cycles (about 1.07 seconds), but on rare occasions (1 in 10 times or so) it varies by 10 cycles or so. Being mostly ignorant of computer architecture, I have to assume that this small bit of non-determinism is caused by pipelining, or branch prediction, or something else that I can’t control.

Even though the benchmark doesn’t use the memory at all, warming the loop turns out to be important. Without this step, it takes 2,000,025,251 cycles in the common case, and it varies by as much as 150 cycles. I guess this is due to pipelining.

Disabling interrupts has a much greater effect. If I comment out the spinlock, then the running time is considerably longer. I observed differences of nearly 4 milliseconds.

In the future, I would like to investigate a bit more. First, what effect does using a real-time kernel have? Hopefully the interrupt latency would go down significantly. Second, I would like to track down which interrupts are actually causing the problem. Finally, I would like to learn enough about computer architecture to understand the sources of variation in the interrupt-free cases above. (As an aside, while working on this today I found a neat web page about the x86 architecture.)

Finally, I’d like to say that I realize this experiment was completely useless for doing actual benchmarks. No sane person would ever run their benchmark suite in a kernel module. My real goal was to understand the sources of non-determinism in timing experiments.

Engineering Software Engineers

March 20th, 2009

I was talking to my advisor yesterday about the software engineering class that he’s teaching. It seems like no one really knows what to teach in these classes lately. Methodologies like the waterfall model have gone out of fashion, and academia isn’t quite ready to embrace newer ideas like extreme programming, patterns, and all that agile development business. It’s hard to blame the professors, given how little evidence there is that any of these techniques is actually beneficial.

I’m actually not all that interested in what gets taught in introductory software engineering classes. Most students probably learn far more from the project than from the lectures. But I am interested in improving my own programming abilities. And unfortunately there aren’t any classes for advanced software engineers. So here is my question: what sort of material should go in such a class? That is, what would you try to teach someone who has already been programming for five years or so? Or to put it yet another way, how do you turn good programmers into great ones?

I’m not really thinking of a lecture-style class, but more of a web resource. In some ways, I think that a site like stackoverflow.com follows this idea. But it doesn’t really have the format that I’m interested in. I think something like a Wiki would be more valuable. But more likely than not it would have to be moderated, perhaps by the community; the people with the best ideas are rarely the loudest.

Since I’m too lazy to create such a site, here are some topics that I’ve found valuable in the past. I think they’d be interesting material for such a course.

  • Language and syntax matter. Sometimes inventing your own little language can really boost productivity. Many many people have written about this before, so I won’t belabor the point.
  • Tools also matter a lot. If you’re writing C code, Valgrind will change your life.
  • Over-design is a far more pernicious evil than under-design in the long run. This is a lesson that I unfortunately have learned over and over. As such, I still don’t know how to teach it.
  • A few years ago, my friend Chet introduced me to some great papers from the ’80s on system design. They are case studies of systems of that day. Mostly they’re written as interviews with the designers and they detail what worked and what didn’t work. Although they’re old, the material is still surprisingly relevant. All of them were written by David Gifford and Alfred Spector. A list of them appears at the bottom of Spector’s web page. Unfortunately, you need ACM access to read them.
  • A blog post by my friend Dave perfectly illustrates another topic: the importance of logging, tracing, and visualization.

I’d like to elaborate on the last point in the rest of this post. Dave created a tool that gives you a nice picture of how Mozilla’s TraceMonkey JavaScript JIT spends its time. It makes it much easier to quickly get a sense of what’s going on and where the performance problems are.

A more advanced version of this idea appears in IBM’s TuningFork project. Originally, TuningFork was used to visualize the performance of the Metronome real-time garbage collector. I’ve worked with it in this context as an intern, and it’s incredibly valuable. However, TF is now freely available and you can use it for all sorts of projects.

Tools like TuningFork are valuable whenever your software is doing lots of complicated stuff very quickly. Without a tool, the only way you know whether your software is correct is by ensuring that the final result is correct. With TF, you can watch your program’s execution and make sure that it’s doing all the right steps. And since you can track how long these steps take, you can debug performance problems.

Some people argue that printf is the only kind of debugging that they need. Although it’s extreme, this is not an entirely invalid point. It’s interesting to examine how a tool like TF improves on printf. The first and most obvious way is that graphs are usually easier for your brain to process than text. But this reasoning isn’t always applicable, since sometimes text is the most effective way to represent data. In those cases, I think that there’s one additional advantage that custom tracing tools have over printf: they can summarize the data and report it in different ways depending on what the user is interested in at the time. You can do this with printf, too, but it requires changing your printf code each time you want to view the data in a different way, and that gets unwieldy very quickly.

In my own work, I’ve found that a very simple form of summarization is surprisingly effective. I write all my trace data out in a tree-structured form. Then my trace viewer allows me to hide and show subtrees. (It also permits searching, but you can do that with printf too.) The ability to inspect a particular subtree and ignore the rest is pretty useful. Rather than wading through a giant log file, you can selectively view only the sections that matter. I’ve also added features so that I can highlight a particular part of the trace and immediately re-run it inside the debugger, allowing me to see stack traces and inspect data.

I’d like to end the post with a link to a slightly different tool. It’s from a company called Azul Systems, and they call it Real-Time Performance Monitor (RTPM). Unfortunately, you have to sign a contract with Azul for like $1,000,000 just to be able to use the tool. But they have a video that shows how the tool is used. Simply watching it was a learning experience for me, since I realized how effective performance monitoring tools can be. This one is the best I’ve ever seen. Although the video is long, I promise you’ll be amazed at what they can do.

On …

March 15th, 2009

For the last few weeks I’ve been wondering about how exactly the compiler implements varargs functions. Tonight, I finally got a chance to sit down and take a look at the code it generates. But first let’s look at how varargs works in theory.

To write a varargs function, you include <stdarg.h>. Typically, you’ll have a non-varying argument that will describe the number of and types of the rest of the arguments (the format string for printf, say). Then, for each variable argument, you call va_arg(T), where T is the type of the argument (va_arg is a macro that expands to a compiler intrinsic, so it can do funny things like take a type as an argument). The compiler keeps a pointer to the argument on the stack that you’re currently dealing with. Each time you call va_arg, that pointer moves down according to the size of the type that you specify. This works because the arguments all appear in sequence on the stack. Seems pretty straightforward.

To understand the source of my confusion, consider the call “printf(”%c%c”, 97, 97)”. This will print the letter ‘a’ twice. Now, the compiler doesn’t know anything about printf–it just sees one fixed argument and two varying arguments. It doesn’t know that the integer 97 is a char and not an int or a long long. This is an issue because if 97 is passed as an int while printf interprets it as a char, then the pointer into the stack that va_arg keeps will get screwed up. Why doesn’t this happen in practice?

It turns out that gcc uses an argument passing convention that manages to avoid this problem most of the time, but not always. All integral types (except long long) are passed on the stack as 4-byte integers. And printf interprets them all as 4-byte integers. Even though you write %c, printf calls va_arg(int) for that argument. It does a similar thing for floating-point numbers: they’re all promoted to doubles. And %f is actually calling va_arg(double).

This still leaves a few corner cases. First, what happens if you call printf(”%lld %lld”, 12, 12)? Note that although gcc could be conservative and promote all integral types to long long, it doesn’t go quite that far, presumably for efficiency. As a result, the printf call won’t work–it prints garbage. Luckily, gcc prints a warning because it has special knowledge of printf-style format strings and it knows that you really should have written printf(”%lld %lld”, 12LL, 12LL). Unfortunately, it can’t handle this problem if you use your own kind of format string (or some other kind of calling convention all together).

The other corner case is what happens when you call va_arg(char) or va_arg(short) or va_arg(double). Once again, the gcc developers provide a helpful warning:

test.c: In function ‘f’:
test.c:7: warning: ‘short int’ is promoted to ‘int’ when passed through ‘...’
test.c:7: note: (so you should pass ‘int’ not ‘short int’ to ‘va_arg’)
test.c:7: note: if this code is reached, the program will abort

And indeed, this code generates an illegal instruction error if you reach it.

What does this mean in practice? If you write your own function, then either declare it to take printf-style format strings or else don’t ever try passing a long long to it. Otherwise you’re setting yourself up for weird bugs.

GCC Stack Protection

March 4th, 2009

I did some more playing around with assembly code today, and I found some interesting stuff.

First, I discovered that GCC will use a jump table for switch statements even when no optimization is enabled. The choice is based purely on the number of cases and how close together the discriminator values are. This is annoying for my decompilation work, since jump tables require a pointer analysis just to figure out the control flow. Luckily you can tell GCC not to use jump tables with the -fno-jump-tables command lines switch. (As an aside, GCC’s documentation on code generation was a very interesting read.)

Second, I found out that GCC uses stack smashing protection by default (at least in Ubuntu). You can turn off this feature with the option -fno-stack-protector, which is what I do when decompiling. Wikipedia has more information about stack protection and about GCC’s own stack protector, called ProPolice.

How does stack protection work? It seems to be pretty simple. When the program is started, GCC chooses a canary value, probably by getting it from /dev/urandom. This value is stored in a special location: 0×14 bytes from the start of the GS segment. I’m not sure where the value actually gets initialized. I looked around in libc and libgcc for a while but I didn’t find anything.

If a function uses a stack-allocated buffer, GCC enables stack protection. It adds the following code to the function prologue. (I’m showing this code using Intel syntax.)

 804a654:       65 a1 14 00 00 00       mov    eax,gs:0x14
 804a65a:       89 45 ec                mov    DWORD PTR [ebp-0x4],eax
 804a65d:       31 c0                   xor    eax,eax

First it loads the canary value into eax. Then it copies eax to a location on the stack which is after the stack-allocated buffer. Finally it zeroes-out eax, perhaps so that it’s harder for an attacker to read the canary.

The hope is that if an attacker manages to overflow one of the buffers on the stack, he will overwrite not only the return address (which he intended) but also the canary value. As long as he can’t guess the canary value, it’s very likely that he will write a bogus value there. So we can detect stack corruption by checking if the canary has changed during function execution.

The canary check is done right before returning (at which time the possibly malicious return address is used). GCC generates these instructions.

 804aec0:       8b 45 fc                mov    eax,DWORD PTR [ebp-0x4]
 804aec3:       65 33 05 14 00 00 00    xor    eax,DWORD PTR gs:0x14
 804aeca:       74 05                   je     804aed1
 804aecc:       e8 07 e6 ff ff          call   80494d8 <__stack_chk_fail@plt>
 804aed1:       c9                      leave
 804aed2:       c3                      ret

First, it copies the canary into eax. Then it XORs eax with the original canary value. If the canary has not changed, the two values should be the same and eax will be zero. In that case, we jump directly to the return instructions. However, if the canary has changed from its original value, then we call the function __stack_chk_fail, which is part of glibc. This function does its best to print a stack trace and then it aborts the program.

It’s interesting to see projects like this. Many people debate about how much of a performance penalty programmer’s will accept for greater safety. In this case, the GCC developers have made a choice in favor of safety. Stack protection is on by default even at the highest optimization level. There doesn’t seem to be have much of a stir about this, since most people probably don’t even know the checking is being done (I certainly didn’t). Perhaps the rationale is that performance-critical code shouldn’t be doing lots of calls-and-returns anyway, so the cost of stack protection should be minimal.

Why Does GCC LEA EIZ?

February 27th, 2009

I’ll start this blog with something mundane to avoid arousing any unwarranted excitement.

I’ve spent some time recently looking at the ELF binary format for Linux executables, as well as the DWARF format used for storing debugging information. In my gut, I feel like there’s a lot of interesting information to be mined here, particularly for program analysis. Some possible ideas:

  • Write an infrastructure similar to CIL, but operating on binaries instead of C code. In the future I’ll write more about this topic. There are several advantages of this approach. First, you’re seeing the actual code that the machine executes, instead of the CIL parser’s own interpretation. Second, you don’t have to go to the trouble of inserting any tools in the build process; you just need to ensure that the program is compiled with “-g”. This is especially important for bigger programs that use lots of libraries. Third, it’s easier to support other languages, particularly C++.
  • Write a program understanding tool to look up definitions and uses of symbols, fields, macros, etc. DWARF includes all the necessary information. Eventually such a tool could be integrated with emacs like ctags currently is. I pretty much want something like cscope, but without having to worry about the fidelity of cscope’s “fuzzy parser.”
  • Generate stack maps for precise garbage collection. Currently the only GC that works with GCC by default is Boehm’s, which is conservative and not particularly snappy in my experience. Accurate garbage collection requires a map describing all the locations on the stack containing pointers. GCC does not generate such maps, and modifying it to do so would probably be hard. LLVM claims that they support accurate collection, but their GC page suggests that they do so via a shadow stack, which is no better than what GCC offers. Instead, I would like to use DWARF information to figure out where the pointers are. I’ve done a few tests at -O3 to see how much information GCC actually preserves. As far as I can tell, it gives you everything you need, even after optimizations like function inlining and scalar replacement of aggregates, which you might expect to be problematic.

Perhaps none of these ideas will pan out, but I think it’s an interesting area to explore. There’s a lot of research on binary analysis and instrumentation in academia (mostly for security and optimization), but mostly the code seems to be proprietary. I’ve already developed some simple tools in this direction. I plan to release them when they mature a little. I’ll try to post on this tomorrow.

Now to the topic at hand. I was disassembling some code the other day (just using objdump, nothing fancy) and I noticed a weird pattern emitted by GCC.

billm@plums:~$ objdump -d /bin/ls | less
...
8049ba9:       8d bc 27 00 00 00 00    lea    0x0(%edi,%eiz,1),%edi
...

This pattern appears all over the place (though sometimes with other registers subsituted for edi). Mostly I’ve noticed it in between functions (after one returns and before the next begins) but it also appears within function bodies.

What does this instruction do? The weirdo syntax makes it particularly difficult to determine. Apparently %eiz is a pseudo-register that just evaluates to zero at all times (like r0 on MIPS). I decoded the same instruction using an older version of objdump, and it instead gave the more normal result “lea 0×0(%edi),%edi”. I think the binutils people must pull a lot of all-nighters to make their crazy AT&T syntax even more inscrutable with each new version. Why can’t we just use Intel syntax? It seems way more intuitive to me.

Returning to the problem at hand, the instruction “lea 0×0(%edi),%edi” still doesn’t make much sense. Why would you ever want to load the effective address of [edi] into edi? This is semantically equivalent to “mov edi, edi”, or, more simply, to a NOP.

I eventually found a mailing list post by binutils guru Ian Lance Taylor that reveals the answer. Sometimes GCC inserts NOP instructions into the code stream to ensure proper alignment and stuff like that. The NOP instruction takes one byte, so you would think that you could just add as many as needed. But according to Ian Lance Taylor, it’s faster for the chip to execute one long instruction than many short instructions. So rather than inserting seven NOP instructions, they instead use one bizarro LEA, which uses up seven bytes and is semantically equivalent to a NOP.

I must admit to being a bit confused still. Most of the time, these semantic NOPs are not executed so you might as well use regular NOPs. Occasionally, the LEAs do occur in places where they might be executed, but in those cases I don’t always understand why a NOP is needed. Sometimes the next instruction after the NOP is a branch target, and so I can understand that the compiler might want the CPU to start loading those instructions on a cache line boundary if possible. But sometimes the instruction after the NOP is not a branch target. In that case, why is there a random NOP in the instruction stream? It just seems wasteful of cache. Perhaps someone can enlighten me.

Anyway, wasn’t that interesting?

Statement of Purpose

February 26th, 2009

I have a lot of ideas about programming and program analysis that typically end up in the bit bucket. Many of them are likely to be bad, but I rarely get time to explore them as fully as I would like. It seems a shame that they should go to waste.

I also write a lot of code that never makes it off my hard drive. This also seems shameful, particularly in this age of distributed version management and publicly hosted repositories.

Hence, the purpose of this blog is two-fold: to act as a forum for interesting ideas that would otherwise languish, and to link to projects that are unfinished but that may still be of interest to people.

Undoubtedly, this blog will end up as a sad testament to my own fecklessness: one of those sad little sites with a few posts from years ago, serving mostly as evidence of the author’s pathetically unrealistic aspirations. Nevertheless, the existentialists tell us that we must press on despite the pointlessness of our endeavor. I guess I should feel lucky, since writing a blog is a lot easier than fighting in the resistance or whatever it is Sartre supposedly did.