Article

How to fix bugs in someone else's software

Posted  by Tim Kerchmar.

PublicCategorized as Public.

Not tagged.

Hi Reader,

 

Occasionally, you may run across a commercial software vendor who is a diamond in the rough. Roger Corman (author of Corman Common Lisp), is just such a person. No matter how time consuming or unimportant a particular stability fix might seem to him, he's never told me go away, and has always at least helped me help myself or gave me a fix to test. He also gets props for having distributed the source to the Lisp runtime to all users, since without the source, I would have been so screwed.

 

Here's the steps for fixing bugs in someone else's software:

 

  1. Check their forums and newsgroup postings. Are they helpful and informative? Are there any long support threads where the vendor helped someone work through a tricky issue? Is the software complex enough that you could imagine complex failure states? If the software is badly enough written that the author can barely walk through the code, or the vendor is unresponsive, you're out of luck.
  2. When you ask for help, remember a few things:
    1. If the software vendor is a one man shop, support is a distraction from other work. He is not sitting around hitting refresh on his email waiting for the next question to answer. Your good attitude about this frustrating bug in his product will go a long way towards his feelings of good will.
    2. You don't want to piss him off. He's probably the only guy in the world who knows this software product deeply, and if you couldn't trivially solve the bug from header files or documentation, you're going to need his help.
    3. When it comes to support, you get what you pay for, although the author's pride in his product can go a small ways.
    4. He is one person. The official term for you is "user". There are lots of you guys clamoring for his time.
    5. If you do your homework, he will do his.
    6. Don't distract him with your bad spelling or grammar.
  3. Your posts or emails to this vendor should be informative and easy for him to determine the nature of the problem.
    1. State "If I do X, Y occurs. But if I don't do X, then Z occurs." in specific terms.
    2. Give some context, somewhere between "I was trying to do this", and the source code for your whole application. Tell him which version of the product you were using.
    3. Give him specific error messages and debug output. If you observe anything else out of the ordinary, mention that, too. He might not know to ask you.
    4. Give a theory about what you think is happening. That will often trigger him to educate you about that part of the product.
    5. If you don't get a response within 2 business days, it is appropriate to append a "did you get this" to your original message and resend it. They're busy people, and your email might have gotten lost. Bonus points if you did more research about how to trigger the problem.
    6. Make a copy of your app, and strip away extraneous stuff until you have the minimum amount of code required to reproduce the error. This builds confidence on his and your parts that you did in fact find a bug in his software, or he will quickly be able to educate you about how to use it properly. Don't send this to him until he asks for it. He'll ask for it if he wants it.
  4. If you get a response from the vender asking you to try something, then do it! I work support for my day job, and I can't tell you how frustrating it is when I lay out exactly what I would like the customer to try, and the customer says basically, "that looks like work, you do it for me".
  5. A vendor educating you about how his software works in the context of a strange bug, is alot like when you learned calculus in school. It has the potential for an AHA! moment, but requires some digging in and comprehending on your part. You are a programmer, and source code can often be read and comprehended. If he's given you a technical explanation of the part of his product that is likely to be the cause of the failure, then it is your job to fire up a debugger and watch the code do what he said it would do. See where things diverge from what he says is expected behavior.
    1. Ask questions to clarify whether you really saw the execution path diverge from the expected.
    2. Ask him for clarification on confusing points. Now that you've done your homework, you won't come across like an idiot, and he will want to help you help yourself.
  6. If you comprehend and send an informative response, but you desperately need the bug resolved, then keep working on it. Try stuff. See if you can reduce the failure causing program down to the smallest thing that still triggers the bug. If you come from a place of desperate, polite, and doing your homework, a good vendor will stick with you to the end.

 

Here was my first email to Roger on the latest thread:

 

Hi Roger,

I've been using CCL 3.01, and I think that I'm encountering some sort of GC failure that causes a crash. I forced a gc each frame (tried all levels), and there is no immediate crash. I did notice a curiously consistent memory access pattern right before the crash, and I was wondering if you had a theory about it.

Top of call stack:

Address, code bytes, code

All zeros below this point....
02E2FFF7 00 00 add byte ptr [eax],al
02E2FFF9 00 00 add byte ptr [eax],al
02E2FFFB 00 00 add byte ptr [eax],al
02E2FFFD 00 00 add byte ptr [eax],al
02E2FFFF 00 08 add byte ptr [eax],cl
02E30001 00 00 add byte ptr [eax],al
02E30003 00 00 add byte ptr [eax],al
02E30005 01 00 add dword ptr [eax],eax
02E30007 01 EE add esi,ebp
02E30009 ?? db ffh <-- debugger dumps me here
02E3000A EE out dx,al
02E3000B FF 00 inc dword ptr [eax]
02E3000D 00 00 add byte ptr [eax],al
02E3000F 00 00 add byte ptr [eax],al
02E30011 00 C3 add bl,al
02E30013 01 00 add dword ptr [eax],eax
02E30015 50 push eax
02E30016 06 push es
02E30017 00 00 add byte ptr [eax],al
02E30019 00 E3 add bl,ah
02E3001B 02 00 add al,byte ptr [eax]
The previous frame contained a function pointer that was returned from GetProcAddress a few minutes prior. That function pointer had been valid and frequently called during those minutes.

The Visual Studio Output window contains blocks like this:

First-chance exception at 0x01c76dae (CormanLispServer.dll) in CarrotRun.exe: 0xC0000005: Access violation writing location 0x02972464.
First-chance exception at 0x01c76dae (CormanLispServer.dll) in CarrotRun.exe: 0xC0000005: Access violation writing location 0x0297300c.
First-chance exception at 0x01c76dae (CormanLispServer.dll) in CarrotRun.exe: 0xC0000005: Access violation writing location 0x0297400c.
First-chance exception at 0x01c76dae (CormanLispServer.dll) in CarrotRun.exe: 0xC0000005: Access violation writing location 0x0297500c.
First-chance exception at 0x01c76dae (CormanLispServer.dll) in CarrotRun.exe: 0xC0000005: Access violation writing location 0x0297600c.
First-chance exception at 0x01c76dae (CormanLispServer.dll) in CarrotRun.exe: 0xC0000005: Access violation writing location 0x0297700c.
First-chance exception at 0x01c76dae (CormanLispServer.dll) in CarrotRun.exe: 0xC0000005: Access violation writing location 0x0297800c.
First-chance exception at 0x01c76dae (CormanLispServer.dll) in CarrotRun.exe: 0xC0000005: Access violation writing location 0x0297900c.
First-chance exception at 0x01c76dae (CormanLispServer.dll) in CarrotRun.exe: 0xC0000005: Access violation writing location 0x0297a00c.
First-chance exception at 0x01c76dae (CormanLispServer.dll) in CarrotRun.exe: 0xC0000005: Access violation writing location 0x0297b00c.
First-chance exception at 0x202dd2b1 in CarrotRun.exe: 0xC0000005: Access violation writing location 0x02730008.
First-chance exception at 0x028c8645 in CarrotRun.exe: 0xC0000005: Access violation writing location 0x028c8ff0.
First-chance exception at 0x2033d63c in CarrotRun.exe: 0xC0000005: Access violation writing location 0x02830924.
First-chance exception at 0x028c8ce1 in CarrotRun.exe: 0xC0000005: Access violation writing location 0x028c9000.
First-chance exception at 0x028c84fa in CarrotRun.exe: 0xC0000005: Access violation writing location 0x02872c58.
First-chance exception at 0x028a933c in CarrotRun.exe: 0xC0000005: Access violation writing location 0x028add3c.
They would corrospond with frame spikes suggesting that they were ordinary (gc 0) calls triggered by the small amount of single-float heap allocations that occur each frame.

Now one really frustrating thing is that microsoft uses the same code for Access violations writing a location as it does for reading it, even though the first access violation from reading a memory location is part of the sequence immediately prior to the crash. This is what I see right before the crash:

First-chance exception at 0x027730e1 in CarrotRun.exe: 0xC0000005: Access violation reading location 0x027730e1.
First-chance exception at 0x02773fff in CarrotRun.exe: 0xC0000005: Access violation reading location 0x02774000.
First-chance exception at 0x02774fff in CarrotRun.exe: 0xC0000005: Access violation reading location 0x02775000.
First-chance exception at 0x02775fff in CarrotRun.exe: 0xC0000005: Access violation reading location 0x02776000.
First-chance exception at 0x02776fff in CarrotRun.exe: 0xC0000005: Access violation reading location 0x02777000.
First-chance exception at 0x02777fff in CarrotRun.exe: 0xC0000005: Access violation reading location 0x02778000.
...
First-chance exception at 0x02e2dfff in CarrotRun.exe: 0xC0000005: Access violation reading location 0x02e2e000.
First-chance exception at 0x02e2efff in CarrotRun.exe: 0xC0000005: Access violation reading location 0x02e2f000.
First-chance exception at 0x02e30009 in CarrotRun.exe: 0xC000001D: Illegal Instruction.
Unhandled exception at 0x02e30009 in CarrotRun.exe: 0xC000001D: Illegal Instruction.
A read is attempted at the beginning of each 4k block. I don't know enough about CCL's GC to say that this is definately a GC problem, but I cannot think of anything in my app that has a memory access pattern like this. Before I rebuilt the CCL.DLL, when I was just debugging from the .exe, the problem manifested itself as an invalid function pointer. Anyhow, I'll look for more evidence about reproducibility. In the meantime, if you can think of a theory and ways for me test test those theorys, that would be helpful, since I know very little about the internals of the GC. Thank you Roger!

-Tim Kerchmar

 

Roger goes on to educate me about the product and ask for specific information. The debugging process was basically detecting the error state earlier and earlier before the crash occured using theories and tests to find oddities, and reducing the test application down to the simplest possible. Eventually, the error state was detected right at the incorrect line of code in the foreign function interface, and the problem was solved. It did take three weeks to find, though.

 

-Tim Kerchmar


Arrow_down Hide comments
  1. John Pallister said 9/29/08  

    I agree, Roger Corman is a great guy and Corman Lisp is a nice product, especially for the price and the fact that you get all the source. I have also found him pleasant and helpful (and he is a busy guy...).

     


  2. Tim Kerchmar said  

    You should vote on the implementation of choice poll at lispforum.com. I was the only one claiming to use CormanCL. :) Do you have a tech blog?



The Night School, LLC, empowering our users to create and play!

Powered by Near-TimeTerms of Services | Privacy Policy | Security Policy | Support | Feedback | Help Center |