The Tragedy of Humans Programming
I've been working on Ocaml bindings for kqueue in FreeBSD lately (found here).
It's nothing all that complicated, just a struct that needs wrapping and then
calling through to a C library. The Ctypes library worked like a charm and made
the whole thing quite pleasant.
While playing with the binding I managed to cause kevent to error out
(EINTR) and my little test program stopped working. "That's odd" I thought.
I ran a ktrace on it and noticed it was in an infinite loop, calling kqueue
with a zero length event list.
The way kqueue works is that it takes two lists of kevent structs. One
defines what changes to listen to and the other can be filled in with any events
that have triggered. If both are empty and the timeout is empty, kevent
returns immediately. In my program's case, I had set the change's in the
beginning and I was just calling it with a list of events to be filled in, so
the fact that ktrace told me I was calling it with an empty event list made no
sense.
Luckily, the kqueue bindings are pretty small so I poked around the code and
immediately found the bug. The event list data type I have has a capacity and a
size. The capacity is how much memory is allocated and the size is how much of
that capacity is being used. The way the kqueue bindings are supposed to work
is it takes the capacity of the passed in event list and sends it to the
underlying C function. The kevent syscall returns the number of events that
were triggered and sets that to the size of the event list. On error, it sets
the size of the event list to zero, so a user doesn't accidentally try to use
values that were never set. With the size of the memory region set to zero, the
next call to kevent passes in the event list, but it passes a zero-length to
the C call, which returns immediately because it has no space to put anything.
Well, that is a silly bug. I just had to fix the code to replace accessing the
size for accessing the capacity. Now on the next call the capacity will be
given to the kevent call and everything works. Great.
What is so frustrating about this is that I thought very hard about this. In fact, I even have a very long comment describing this. Look at what it says:
In the call, the
sizeofchangelistis used as input, and thecapacityofeventlistis used as input. In output, if the call tokeventwas successful, the size ofeventlistwill be set to the return value ofkevent.
Not only that, I even had it in a README file and moved it into a comment and rewrote it! What is this malarkey?!
And that's the problem with us. Even though I thought heavily about this, and even remember experimenting with the idea of having the user set the size every call or just use the capacity and decided using the capacity was right, and then writing documentation about this, I still didn't write the code correctly.
What would have been the outcome if this made it to production? Well, it's a C binding, and the C code is accessing the raw memory, so it would have been a really unfortunate experience trying to debug memory errors in, what should be, a memory safe language.
Is there a lesson here? I don't know. I'm choosing to interpret this as a sign to write less code. If it's so easy to mess up such a small thing, what hope does one have in a larger project? This was a personal project, so very little code review, but hopefully that would have caught it in a larger project. There are a lot of tools out there to help us humans reduce the amount of bugs, but in the end we're at the mercy of our own fallibility.
