Kernel Crash Fixed! Lessons Learned

It all started back in January 2021, when I saw Greg’s tweet about a Kernel mentorship program for Spring 2021.

@gregkh

It was a huge opportunity for me, a developer who loves Linux and enjoys working on open-source. Indeed, I had tried once to submit a patch, but it was rejected because I didn’t have the required knowledge for working on Kernel. Therefore, I applied for the program so quickly that I didn’t even check the deadline and required tasks. Those tasks included contributing to Kernel, completing code challenges, reproducing crashes, and writing reports about all of the mentioned activities. There were a lot of tasks, and the deadline was about to due.

Among all required tasks, the most interesting ones for me were those I got from Little Penguin[1]. The other one that got my attention more than others was about studying and reproducing Kernel crashes reported on syzbot. syzbot is a fuzzy continuous bot that tests Kernel branches and generates crash reports automatically. Such an amazing place to study! Although I was not chosen for the program (simply because I was too busy as a full-time student and developer to make a new commitment), I never stopped following reported bugs on syzbot.

After some weeks of being busy with writing my Master Thesis report, I got some free time to work on a NULL pointer deference crash report that was recently reported. When I checked the bug, I saw it contains all the required material for a bug. A crash report with stack trace, syz and C reproducers, and the config that builds the same Kernel as what syzbot had tested.

The first patch I wrote was a horrible mistake. I had not even tested it before I sent it. I blindly just made an assumption that the syzbot’s stack trace has shown the culprit and I just need to add one if statement to ensure the instance is not NULL when it is called. However, I was too naive. My patch did not resolve the crash. It was tested by other developers[2] and syzbot which both showed the bugs still occurred. “I can’t afford another silly mistake in this path”, I told myself.

I had made a mistake, but it was not an end to my passion. Instead, I decided to make a good patch for the same crash. It was not important anymore if the crash was resolved by others (which was not), it was important for me to fix it on my side. I decided to test the patch both locally and with syzbot next time before I send it. But how could I test a patch? Should I make my own Kernel crash?

Fortunately no, I should not. The solution was to build Kernel (using the same config from the crash report[3]), load it in a virtual machine (qemu, vmware, virtual box), run the reproducer, and see it crashes. Although it took some days for me to reproduce the crash, the result was good. I could make Kernel crash using the provided C reproducer.

kernel-crashed!

Now it was time to make the patch. I noticed the stack trace that I get from the local reproducer had one more step, the guilty line was placed in another file. It was not surprising that the old patch did not work, it was pointing to an invalid point!. After several tries (each included adding pr_info messages, rebuilding Kernel, loading in qemu, running the reproducer, checking the Kernel messages), I could find and patch the line that caused the crash. The patch was successfully tested both on my local virtual machine and syzbot.

I submitted the patch to Greg. He kindly got back to me soon with some questions about the root cause. I didn’t know about the root cause. I just reproduced the crash and fixed the culprit line, but I never thought about the root cause. Now that I am writing this report the patch has been rejected, mainly because we need to find the root cause that has made the crash happen. We don’t need painkillers, we need to cure the illness.

Although the patch was rejected, there are valuable lessons I learned in this path.


Footnotes