Hello,
Today, I just spent 6 hours hammering through the toughest bug that I have ever solved. Ever. Period. The leads were all dead ends, and there was absolutely nothing remaining except for this:
The Output window wasn't much better with its:
First-chance exception at 0x00000000 in TheGame.exe: 0xC0000005: Access violation reading location 0x00000000.
The app was big and messy, and the error messages worthless. Since the application was haphazardly multithreaded, there wasn't even one logical path that I could use to follow my merry way along until the app crashed. It only crashed if the mouse was created in exclusive mode on DirectInput8, and worked perfectly if the mouse was created in non exclusive mode.
To trigger the bug, the app had to either lose focus before it was finished loading, or I had to manually take away focus. The instant that it regained focus, it would crash.
Step one, find a reproducible case, and see what other things always occur at the same time
In a multithreaded application, the first thing to do is start writing stuff to the Visual Studio Output window. It is an integrated threadsafe logger that you should use when it is the right tool for the job. 4 times in a row, I was able to get the crash to occur right after the first render after the window had regained focus. A lead!
But then I was able to reproduce a crash at a different point in the execution (use the last item logged to narrow down the crash).
Step two, exhaust all the easy options first
Although this bug would occur at different points in the main thread's execution (or in a thread whose activities I was carefully logging), it was reproducible 100% of the time. It never failed to occur. Since the bug could be toggled on and off by changing m_bExclusiveMouse to true or false (which is used to decide which flags to sent to directX), I had two places to look.
- Make sure that the mouse was being created properly. From a valid DI8 device, the flags were good, ect.
- Remember, detailed logging was useless for this bug, at least in the places that I was logging heavily, because the last log entry before the crash was not consistent.
- Check for WM_MOUSEMOVE or other message handlers and hooks that could be triggering the fault.
- In Visual Studio, you can go to Debug/Exceptions/Win32 Exceptions/c0000005 Access violation and check it, which will cause a dialog to come up whenever that exception occurs. This is disabled by default, because many garbage collectors use a memory write barrier. As an aside, 80000001 is thrown for the same reason, but by different GCs.
- I posted the error on gamedev.net, in hopes for an answer. I actually got one right away, and ignored it until google confirmed it.
- I feel really dumb for not looking at this first, but I had been dismissing a dialog box that comes up when the error happens, and it contained slightly new information. When the exception first came up, there is an option to hit Break at the assert dialog. If I choose that option, a second dialog comes up, saying "No symbols are loaded for any call stack frame. The source code cannot be displayed." Turns out, unlike "00000000()", the previous string is actually searchable in google with decent results. A little research confirmed that I was kinda on the wrong track. I was looking for a pure multithreading bug, but apparently this bug is caused by stack corruption of a sort, which in turn may be a multithreading bug.
Unfortunately, after step 5, all that I really knew is that I had a complex bug on my hands, and no clue how to solve it. Time to break out the big guns. There is only one way that I know of to solve an unsolvable bug. By this time, I had already spent 6 hours digging in.
Divide and conquer.
When you cannot solve a bug, there are two useful things to do. One is to start from square one, find the minimal application that reproduces the error and debug that, and the second method is to start ripping out parts of the program, commenting out whole .h and .cpp files, until the error stops occuring. Make a backup first, or better yet, use a source control system. Divide and conquer did some very nice things for me:
- First, many of the places that I suspected to be at fault simply weren't. Whole dependencies on 3rd party libraries were removed.
- Surprisingly, the very very very last spot in the entire codebase or dependency codebase that created a secondary thread was commented out, and the error still occured!
- Finally, I found a single line of code that caused the problem.
The cause of the bug
Microsoft! Their detours library caused the problem. Well, more correctly, is that someone got in over his head with *drum roll* ---> a custom hacked version of DetourFunction! Look!
/////////////////////////////////////////////////////////////////
//Function : DetourFunction()
//Description : Detours a function by putting a jmp in the
// function to our detour function, and copying
// the old function so we can still use that too.
//
//addroffunction : Address of function to detour
//addrofdetour : address of where to detour it to
//addrofreal : address returned of original function
/////////////////////////////////////////////////////////////////
And then it just goes on with memcpy calls and a memory permission unlock for a code segment to be made writable. Basically, this function is designed so that you can provide a hook for something that windows doesn't provide a hook for, but this hook is installed processwide. I'm guessing that one part of this Rube Goldberg contraption (look at the source for yourself if you don't believe me!) was off by just a little bit, and somehow made the return value for one of the hacked in functions be 0x00000000, which is how I got my nice useful call stack. And since it wasn't a function pointer, like I had also considered at one point, MSVC was completely oblivious until after I lost all stack information.
The library was Detouring ::GetCursorPos, which was being used in other places. I just wanted to thank you Microsoft for making libraries that only a very clever person could use, and I thank you, you very clever person, for choosing to use it. That is all.

RSS
technochakra said 8/3/08
Interesting problem. I ran across your blog today. Thought you'd want to see my article http://www.technochakra.com/debugging-divide-and-conquer-the-input-data/ which also talks about divide and conquer but not with code as you do.