The Tragedy of Humans Programming

I've been working on Ocaml bindings for kqueue in FreeBSD lately (found here). It's nothing all that complicated, just a struct that needs wrapping and then calling through to a C library. The Ctypes library worked like a charm and made the whole thing quite pleasant.

While playing with the binding I managed to cause kevent to error out (EINTR) and my little test program stopped working. "That's odd" I thought. I ran a ktrace on it and noticed it was in an infinite loop, calling kqueue with a zero length event list.

The way kqueue works is that it takes two lists of kevent structs. One defines what changes to listen to and the other can be filled in with any events that have triggered. If both are empty and the timeout is empty, kevent returns immediately. In my program's case, I had set the change's in the beginning and I was just calling it with a list of events to be filled in, so the fact that ktrace told me I was calling it with an empty event list made no sense.

Luckily, the kqueue bindings are pretty small so I poked around the code and immediately found the bug. The event list data type I have has a capacity and a size. The capacity is how much memory is allocated and the size is how much of that capacity is being used. The way the kqueue bindings are supposed to work is it takes the capacity of the passed in event list and sends it to the underlying C function. The kevent syscall returns the number of events that were triggered and sets that to the size of the event list. On error, it sets the size of the event list to zero, so a user doesn't accidentally try to use values that were never set. With the size of the memory region set to zero, the next call to kevent passes in the event list, but it passes a zero-length to the C call, which returns immediately because it has no space to put anything.

Well, that is a silly bug. I just had to fix the code to replace accessing the size for accessing the capacity. Now on the next call the capacity will be given to the kevent call and everything works. Great.

What is so frustrating about this is that I thought very hard about this. In fact, I even have a very long comment describing this. Look at what it says:

In the call, the size of changelist is used as input, and the capacity of eventlist is used as input. In output, if the call to kevent was successful, the size of eventlist will be set to the return value of kevent.

Not only that, I even had it in a README file and moved it into a comment and rewrote it! What is this malarkey?!

And that's the problem with us. Even though I thought heavily about this, and even remember experimenting with the idea of having the user set the size every call or just use the capacity and decided using the capacity was right, and then writing documentation about this, I still didn't write the code correctly.

What would have been the outcome if this made it to production? Well, it's a C binding, and the C code is accessing the raw memory, so it would have been a really unfortunate experience trying to debug memory errors in, what should be, a memory safe language.

Is there a lesson here? I don't know. I'm choosing to interpret this as a sign to write less code. If it's so easy to mess up such a small thing, what hope does one have in a larger project? This was a personal project, so very little code review, but hopefully that would have caught it in a larger project. There are a lot of tools out there to help us humans reduce the amount of bugs, but in the end we're at the mercy of our own fallibility.

Author: orbitz
Updated: 2016-06-25 Sat 16:58