I find it surreal to have people guessing at my motivations when they could just ask me. On top of all that, I find it weird that people spend so much time with in-fighting. I just want my system not to suck.
Some time ago (during in the Ubuntu Karmic development cycle maybe in September 2009), I started having giant problems with my build system. All I/O would start to stall, wait times would surge, and usually my entire system would just go unresponsive with the disk light on solid. This scared the crap out of me, and it wasn’t entirely obvious what was triggering it. No one else seemed to be seeing it. I managed to start tracking things using “latencytop”, and saw stuff like liferea going crazy. As I eliminated more and more things, I eventually settled on it being a problem with umount, and I reported an Ubuntu bug. It seemed to look like an upstream bug that no one else but the reporter could reproduce either.
Since no one else was seeing this issue, and it seemed related to LVM snapshots, I migrated off of snapshots, and started using aufs overlays for my builds. For a while, it seemed like things went away. It didn’t, and I started hitting it again. I opened a new (now famous) bug in Ubuntu, since now snapshots weren’t in the picture, and I didn’t want to confuse the earlier history. I managed to find a relatively minimal test-case too. A few other people commenting on the bug were seeing the problem now too, but it was less pronounced for them.
As an aside, this wasn’t a “just wait a few seconds longer” kind of issue on my system. A single umount would last 30-40 minutes. And when I’m doing parallel builds of security updates, this would turn into my system being unavailable for hours at a stretch.
Since none of the kernel developers I was in contact with were able to track down the root cause, I asked Ted Ts’o in email if he could just quickly peek in on this for me, since I figured he’d be in a good position to confirm or deny it. I didn’t want to start wasting upstream time with this if it wasn’t reproducible (see earlier upstream kernel bug). To my great relief, Ted found a few minutes to check it, and was able to immediately confirm it and give me a viable work-around (“sync; umount …”) for the time being. I confirmed the work-around, and went off to do other things.
A while later, Ted came back to deliver a bit of a rant, the purpose of which was not clear to me, but I ultimately ignored it — I didn’t seem directed at me. I just wanted my system operating normally, and he’d done me a favor to check in on it and got me a work-around.
More time passes, and I eventually get caught in another I/O-wait melt-down. On investigation, it seemed that the in-kernel work-around in the Ubuntu kernel totally back-fired on me in some cases, rendering even the user-space work-around useless. After investigating the Ubuntu-specific work-around, I re-read Ted’s rant in the course of researching what had happened during this bug’s triage.
It seemed that Ted was basically saying:
– this is an upstream problem
– RedHat hasn’t run into it and he didn’t know why
I figured I should confirm for myself if Fedora was affected, so downloaded and installed Fedora to double-check there. Since I was able to reproduce it there, I opened an upstream bug, linking back to the original Ubuntu bug, and then went to open a bug in the Fedora tracker, linking back to upstream.
And it seems to be these actions that everyone has jumped on. I will now bore you with the reality of my motivations: I wanted to fix the bug so no one would end up experiencing the same pain I’d been through over the last 6 months.
The bug was, from my perspective, a serious issue. Since I’d managed to reproduce it in another distro, it was my duty as a Free Software developer to report it to them. And, in what I felt was an unambiguous gesture, I made sure to include the link to the upstream kernel bug. Reproducing it in Ubuntu, in Fedora, and with a stock kernel had me confident that it was an upstream issue. While Ted did correctly suspect the issue was upstream, I really didn’t want to just open an upstream bug and have it be ignored. I wanted some additional proof of reproduction, which I got when I tested it on Fedora.
So, I’m rather saddened that so many people spent so much time questioning my motivations, making fun of Canonical, or doing anything other than trying to just simply solve this problem. I’m totally disinterested in inter-distro fighting. Instead, I continue to assume we’re all on the same team, fighting a philosophical battle against close-source software. And in that regard, I think it’s still true. If I ignore the rants and jeering, I come away thankful for all the people that spent time trying reproduce the issue at Canonical, at RedHat, and in the larger community. I’m hugely thankful that Ted made some time to let me know I wasn’t crazy, and there was actually a problem. I’m thankful for having some work-arounds, and I’m thankful that the root cause was eventually ferreted out, with some possible solutions. I’m even thankful that some people on the LWN thread saw that, far from malicious, I was trying to be helpful with the bug.
I just wanted my filesystem not to eat my computer. And I was hoping other people could maybe help me, since I’m not a filesystem expert. The drama around this bug is pathetic, and now by talking about it for almost 1000 words, I’m just as guilty.
© 2010, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.