The Internet is amazing | Rickard Andersson

The Internet is amazing

Much of what I do at work is problem solving. Something doesn’t work the way it should and I dive in and try to figure out what the problem is, what’s causing the problem and ultimately what to do about it. I love this part of my job. Whether it’s digging through code, browsing logs or troubleshooting application errors, I’m a happy camper.

Some time ago, we noticed that virtually all of the many hundreds of machines on campus, regardless of hardware configuration, were intermittently crashing. We were never able to reproduce the problem, but it was happening at a rate of maybe one or two machines a day. We started out by trying all the standard fixes such as updating drivers and BIOS as well as completely reinstalling some of the machines. Nothing seemed to help. The machines blue-screened and reported various different error codes (7e, 50, 0a etc). We were stumped.

At this point, I was getting pretty frustrated at not being able to solve the problem. In a desperate attempt at finding out what was causing this, I installed WinDbg (part of the Windows Debugging Tools) and loaded up a couple of minidumps from a handful of machines. Using the analyze command, you can get WinDbg to parse the memory dump and output what it thinks might be the culprit behind the crash. I was hoping that the different memory dumps would point to some kind of common driver or executable, but some of them blamed fastfat.sys, others pointed a finger at ntkrpamp.exe and some put the blame on “memory_corruption”. I was getting nowhere and I needed help.

I searched around for a good discussion forum to ask for help and ended up in the troubleshooting section of the Sysinternals forums. More or less immediately, I got a response from someone called Scott. He directed me to enable full memory dumps on a couple of machines as well as enabling Driver Verified on any non-Microsoft drivers. At this time, I had never even heard of the tool called Driver Verified, but apparently, it’s been included in Windows since Windows 2000. Here’s what Wikipedia has to say about it:

Driver Verifier is a tool included in Microsoft Windows that replaces the default operating system subroutines with ones that are specifically developed to catch device driver bugs. [1] Once enabled, it monitors and stresses drivers to detect illegal function calls or actions that may be causing system corruption. It acts within the kernel mode and can target specific device drivers for continual checking or make driver verifier functionality multithreaded, so that several device drivers can be monitored at the same time. [1] It can simulate certain conditions such as low memory, I/O verification, pool tracking, IRQL checking, deadlock detection, DMA checks, IRP logging etc.

So I enabled it and after a couple of days I had a number of full memory dumps created while Driver Verified was running. I loaded them up in WinDbg, but I was none the wiser. I needed more help so I took the liberty of sending Scott a private message asking him if he would be willing to take a quick look at the dumps for me. Scott replied that he actually enjoyed groveling trough memory dumps and that he in fact taught a week long crash dump analysis lab! Talk about finding the right man for the job. So I sent the dumps to Scott who got back to me shortly thereafter with a theory.

At the time of the crash, it appeared that a ZIP file was being flushed out to a removable FAT drive, but the in-memory structures for the file had been torn down already, causing the crash. Scott was able to track the memory address of the prematurely torn down structure to an “SRTSP structure”. SRTSP.sys is a Symantec Antivirus filter driver. It seemed to make sense. Symantec does indeed check files before they are saved to removable drives. Scott also informed me that the driver in question was about a year old and that we could try upgrading it to the latest version. We did and after about two weeks, we have yet to experience one single crash.

The moral of the story I guess is that I should have known better than to use an almost 1 year old version of the Symantec Endpoint Protection client, but the cool thing about the story is Scott. I was stuck and asked for help in an online discussion forum. To my rescue came a complete stranger that not only put time and effort into helping me, but also turned out to be extremely competent at what he did. Amazing!

Thank you very much Scott!

One comment

  1. Hugh Jackson
    Posted March 30, 2010 at 13:58 | Permalink

    The use of internet in my life to solve technical problems has become so pervasive, I can’t imagine how the people who came before you and I dealt with ‘stumping’ problems. Thank goodness for the interwebs.

Post a Comment

Comments are moderated. Your email is never published nor shared.