How The Hell Things Work: How Windows Debugger Finds Symbols For Your Code

Every once in a while, when I am debugging something at work, I run into the problem of missing symbols. All of my products build symbols should be copied up to a symbol server, which makes everything “just work.” When it doesn’t just work, it always takes a manually process of figuring out how the hell symbols are supposed to work… again.

The basic scenario is that a debugger needs to use information it has available (e.g., in a crash dump file) to find the pdb file that matches the code you are debugging. Often, not a lot of information is available to the debugger, but where we are going to start is the header for the image. Let us first examine the module information…

0:017> lmvm mysvc
Browse full module list
start end module name
00007ff6`1a070000 00007ff6`1a2c2000 mysvc T (no symbols)
   Loaded symbol image file: mysvc.exe
   Image path: mysvc.exe
   Image name: mysvc.exe
   Browse all global symbols functions data
   Timestamp: Sun May 15 23:47:13 2022 (6281E561)
   CheckSum: 00254124
   ImageSize: 00252000
   File version: 2.9.8171.14983
   Product version: 2.9.8171.14983
   File flags: 0 (Mask 3F)
   File OS: 40004 NT Win32
   File type: 1.0 App
   File date: 00000000.00000000
   Translations: 0000.04b0 0000.04e4 0409.04b0 0409.04e4
   Information from resource tables:

Notice the checksum and image size here. These two pieces of information will be used by the debugger to build a hash that it uses to locate the indexed image file on the symbol server. To see this in action, we can configure the debugger to point to an empty symbol server path and turn on noisy symbol output…

0:017> .sympath SRV*D:\dust
DBGHELP: Symbol Search Path: srv*d:\dust
Symbol search path is: SRV*D:\dust
Expanded Symbol search path is: srv*d:\dust
************* Path validation summary **************
Response Time (ms) Location
Deferred SRV*D:\dust
0:017> !sym noisy
0:017> .reload /f mysvc.exe
SYMSRV: BYINDEX: 0x7
   d:\dust
   mysvc.exe
   6281E561252000
SYMSRV: UNC: d:\dust\mysvc.exe\6281E561252000\mysvc.exe - path not found
SYMSRV: UNC: d:\dust\mysvc.exe\6281E561252000\mysvc.ex_ - path not found
SYMSRV: UNC: d:\dust\mysvc.exe\6281E561252000\file.ptr - path not found
SYMSRV: RESULT: 0x80070003
DBGHELP: DBGENG: mysvc.exe - Image mapping disallowed by non-local path.
Unable to load image mysvc.exe, Win32 error 0n2
DBGENG: mysvc.exe - Partial symbol image load missing image info
DBGHELP: No header for mysvc.exe. Searching for dbg file
DBGHELP: .\mysvc.dbg - file not found
DBGHELP: .\exe\mysvc.dbg - path not found
DBGHELP: .\symbols\exe\mysvc.dbg - path not found
DBGHELP: mysvc.exe missing debug info. Searching for pdb anyway
DBGHELP: Can't use symbol server for mysvc.pdb - no header information available
DBGHELP: mysvc.pdb - file not found
*** WARNING: Unable to verify timestamp for mysvc.exe
DBGHELP: mysvc- no symbols loaded

We can see that the debugger tries to find the indexed binary file based on a concatenated image timestamp and image size. If it can’t find this (i.e., if the binary itself is not indexed), then the debugger is not going to be able to figure out the hash to lookup your pdb. So next, let us manually add the binary to our symbol store and see what the next step looks like.

D:\dev> & symstore.exe add /f mysvc.exe /s D:\dust /t ProductName
Finding ID... 0000000001

SYMSTORE: Number of files stored = 1
SYMSTORE: Number of errors = 0
SYMSTORE: Number of files ignored = 0
0:017> .reload /f mysvc.exe
SYMSRV: BYINDEX: 0x8
   d:\dust
   mysvc.exe
   6281E561252000
SYMSRV: PATH: d:\dust\mysvc.exe\6281E561252000\mysvc.exe
SYMSRV: RESULT: 0x00000000
DBGHELP: d:\dust\mysvc.exe\6281E561252000\mysvc.exe - OK
DBGENG: d:\dust\mysvc.exe\6281E561252000\mysvc.exe - Mapped image memory
SYMSRV: BYINDEX: 0x9
   d:\dust
   mysvc.pdb
   B8FA9DF507C542C097244F72C28E50EE1
SYMSRV: UNC: d:\dust\mysvc.pdb\B8FA9DF507C542C097244F72C28E50EE1\mysvc.pdb - path not found
SYMSRV: UNC: d:\dust\mysvc.pdb\B8FA9DF507C542C097244F72C28E50EE1\mysvc.pd_ - path not found
SYMSRV: UNC: d:\dust\mysvc.pdb\B8FA9DF507C542C097244F72C28E50EE1\file.ptr - path not found
SYMSRV: RESULT: 0x80070003
DBGHELP: mysvc.pdb - file not found
DBGHELP: D:\a\_work\1\s\bld\x64\Release\mysvc.pdb - file not found
DBGHELP: mysvc- no symbols loaded

Ok, so the debugger was able to find the indexed binary, and somehow got this magical hash, B8FA9DF507C542C097244F72C28E50EE1, that it is trying to use to find the pdb file. To see where this comes from, we can examine the headers of the binary file using dumpbin.exe. For brevity, I will filter the output to just find the section I am interested in…

dumpbin /headers mysvc.exe | Select-String "Format: RSDS"

6281E561 cv            42 001FAA10   1F9610    Format: RSDS, {B8FA9DF5-07C5-42C0-9724-4F72C28E50EE}, 1, D:\a\_work\1\s\bld\x64\Release\mysvc.pdb

Here the exe file has embedded a link to the debug information. The guid here is taken as a pure string of hex digits (strip out the dashes and braces) and used as part of the file path to find the pdb. Again, to confirm, we will manually add the pdb file to our symbol store and retry the reload command.

D:\dev> & symstore.exe add /f mysvc.pdb /s D:\dust /t ProductName
Finding ID... 0000000002

SYMSTORE: Number of files stored = 1
SYMSTORE: Number of errors = 0
SYMSTORE: Number of files ignored = 0
0:017> .reload /f mysvc.exe
SYMSRV: BYINDEX: 0xA
   d:\dust
   mysvc.exe
   6281E561252000
SYMSRV: PATH: d:\dust\mysvc.exe\6281E561252000\mysvc.exe
SYMSRV: RESULT: 0x00000000
DBGHELP: d:\dust\mysvc.exe\6281E561252000\mysvc.exe - OK
DBGENG: d:\dust\mysvc.exe\6281E561252000\mysvc.exe - Mapped image memory
SYMSRV: BYINDEX: 0xB
   d:\dust
   mysvc.pdb
   B8FA9DF507C542C097244F72C28E50EE1
SYMSRV: PATH: d:\dust\mysvc.pdb\B8FA9DF507C542C097244F72C28E50EE1\mysvc.pdb
SYMSRV: RESULT: 0x00000000
DBGHELP: mysvc- private symbols & lines
d:\dust\mysvc.pdb\B8FA9DF507C542C097244F72C28E50EE1\mysvc.pdb

Now remember in elementary school how they used to teach you all the basic ways to do things that mostly worked, but then after all that they would just show you the “standard algorithm” that just always works? Yeah, well apparently, I do the same thing, so here is a nifty little debugger command that just kind does everything above…

0:017> !lmi mysvc
Loaded Module Info: [mysvc] 
         Module: mysvc
   Base Address: 00007ff61a070000
     Image Name: mysvc.exe
   Machine Type: 34404 (X64)
     Time Stamp: 6281e561 Sun May 15 23:47:13 2022
           Size: 252000
       CheckSum: 254124
Characteristics: 22  
Debug Data Dirs: Type  Size     VA  Pointer
             CODEVIEW    42, 1faa10,  1f9610 RSDS - GUID: {B8FA9DF5-07C5-42C0-9724-4F72C28E50EE}
               Age: 1, Pdb: D:\a\_work\1\s\bld\x64\Release\mysvc.pdb
           VC_FEATURE    14, 1faa54,  1f9654 [Data not mapped]
                 POGO   414, 1faa68,  1f9668 [Data not mapped]
    Symbol Type: DEFERRED - No error - symbol load deferred
    Load Report: no symbols loaded

Well, that about does it for the basics of how symbol resolution works. In a future article I will address some more advanced topics, such as indexing pointers to files and compressed files and such advanced topics.

Debug Break on Access

I had to diagnose a crash in my driver code recently. Every time it happened, it was because a particular pointer value inside a structure had been decremented incorrectly. Decrementing the actual pointer is something that should never happen, and I couldn’t figure out how the hell it was. I had tried breakpoints in all the code that I thought was potentially problematic, and turned up nothing. Finally I came up with the following gem…

ba w 8 poi(frxdrvvt!ActiveRedirections)+0x78 ".if (poi(poi(frxdrvvt!ActiveRedirections)+0x78) % 2 == 0) { g }"

Basically, it says break on write access to the pointer in question, and if the value is divisible by 2 (since pointers have to be aligned on 8-byte boundaries) then just continue. It broke on the decrement I suspected was there and was immediately obvious what was happening. (It was my new code, but it was code I never suspected of being problematic because it was so simple, lol.)

C# Garbage Collection, or Why Did My Object Just Disappear?

I ran across a bug in a project just the other day that I thought others could find interesting. In this project, I had a main thread that was listening for connections and then serving back some data. I also had a timer that would periodically trigger an update of the data that was being served. This was using a System.Threading.Timer and therefore was running on a secondary thread from the thread pool.

The problem was that the timer would run two or three times (in fifteen minute intervals) and then it would just magically stop running. I initially thought perhaps locking issues between the threads, so I went through and locked everything that was shared, all to no avail.

And to make the problem even more frustrating, I couldn’t reproduce it in a debugger. I initially thought that this was perhaps because I was not patient enough to wait 45 minutes for it to happen. But it turned out to be a release vs. debug kind of problem: the release build had the problem, while the debug build didn’t seem to.

For research purposes, take as an example the follow little program. This program should have a main thread that just sleeps the day away, and a timer that prints out a debug message every 5 seconds. If I run the debug build of this, it works great, but running the release build on my machine, the timer thread didn’t even run a single time! Waahh?!?!

class Program
{
  static long i = 0;

  static void TimerCallback(object state)
  {
    Debug.WriteLine("{0:D5}: TimerCallback", i);
  }

  static void Main(string[] args)
  {
    // Trigger the callback every 5 seconds
    System.Threading.Timer t = new System.Threading.Timer(TimerCallback, null, 0, 5000); 

    while (true)
    {
      Thread.Sleep(2500);
    }
  }
}

It turns out what is going on here is that the system is happily garbage collecting my Timer object. According to the system, that t variable never gets used after it is initialized, so it’s safe to just throw it away. If you look at the MSIL using the ILDASM tool, you see the following for the release build. Notice that it does a newobj to create the Timer object, and then rather than storing it in a local with something like stloc.0, it just pops it off the stack and doesn’t keep any reference on it.

IL_0013:  newobj     instance void [mscorlib]System.Threading.Timer::.ctor(class [mscorlib]System.Threading.TimerCallback,
                                                                             object,
                                                                             int32,
                                                                             int32)
IL_0018:  pop

The debug version of the same code like the following, and note that it declares a local object, and then stores the reference to the Timer object in that local object.

.locals init ([0] class [mscorlib]System.Threading.Timer t,
           [1] bool CS$4$0000)
...
IL_0014:  newobj     instance void [mscorlib]System.Threading.Timer::.ctor(class [mscorlib]System.Threading.TimerCallback,
                                                                             object,
                                                                             int32,
                                                                             int32)
IL_0019:  stloc.0

Now once I figured out what was going on, fixing it was trivial. A using statement around the disposable Timer object keeps it in scope, and deterministically cleans it up when appropriate. (Of course, this is how the code should have been written in the first place, but look at the cool problem I got to figure out as a result of my lazy coding.)

Finding the size of a list in Kernel debugger

When I am kernel debugging, I use the !list command a LOT. It basically walks a doubly-linked list and dumps the memory of each list entry. You can also have it dump a more readable type, or have it run commands for each list entry. Today I ran across a list that was extremely large. I wasn’t really interested in the data so much as understanding just how large it was. So I had to figure out how to count the number of list entries and I figured I would share.

r $t0 = 0; !list " -x \"dd @$extret L0; r $t0 = @$t0 + 1\" 0xffffffff'0a1bcdef "; ? @$t0

Gibberish? Let’s dig in and analyze the statement. For starters, it’s a multi-part command, separated by semi-colons. The first statement sets a pseudo-register to 0, the next part uses the !list command to hit every list entry, and the last part just evaluates the pseudo-registry and dumps it’s value afterward.

r $t0 = 0
!list " -x \"dd @$extret L0; r $t0 = @$t0 + 1\" 0xffffffff'0a1bcdef "
? @$t0

Now the guts of this is in the !list command, which runs a multi-part command for each list entry. The first part of the command is just there to use up the @$extret parameter, so windbg doesn’t try to tack it onto the end of our second command, which is simply incrementing the value in the pseudo-register.

Voila! Now I know that my list has 21,000 or so entries in it, and I now understand why there just might be a performance problem. 😉

Happy debugging!

Signing Kernel-mode Drivers with SHA-2/SHA-256

I hit a pretty frustrating problem the other day when I renewed my digital certificate for code signing and my drivers stopped working. When I renewed the certificate through VeriSign (Symantec), I was given the option of choosing a SHA-1 or a SHA-2 (with 256-bit digest) hashing algorithm for the certificate. After reading through the background information on these options, and understanding that the new and improved SHA-2 was supported on Vista and above, I determined that SHA-2 would meet my requirements and clicked the button.

I got the new cert, put it in my cert store, and reconfigured all my projects in Visual Studio to sign with the new cert. Everything was working perfectly. It installed and run fine on my test machine, but then one of my co-workers complained that the driver wouldn’t load on his machine. After a short while I realized that I had the same problem on my machine, I just wasn’t seeing it because I had a kernel debugger attached. (Note that there are some registry options that can enable mandatory kernel mode code signing even when the debugger is attached. Read more at MSDN.)

So now that I could reproduce the problem, I started troubleshooting the signing itself. Using “signtool.exe verify /kp”, I verified that the driver was signed, all certificates were valid, and the cross-certificate from Microsoft was also in the signing chain. Signtool claimed that everything was just fine, but the darn thing still wouldn’t load. I was getting the error ERROR_INVALID_IMAGE_HASH (0x80070241): “Windows cannot verify the digital signature for this file. A recent hardware or software change might have installed a file that is signed incorrectly or damaged, or that might be malicious software from an unknown source.” After searching the event logs for a bit, I found the Microsoft-Windows-CodeIntegrity log which contained an Event 3004 with basically the same message as the error code above.

After exhausting my Google-Fu skills with no results, we started brainstorming about what was different between the build yesterday that worked and the one today. Somebody mentioned encryption algorithms, and the light bulbs went on in my head. Adding the specific case of SHA-2 to my searching yielded a couple of pages: Practical Windows Code and Driver Signing and PiXCL: Signing Windows 8 Drivers.

Some of my own testing showed that I couldn’t get a driver built with Visual Studio and a SHA-2 certificate to load on both Windows 7 and Windows 8. Theoretically you could sign it twice with each algorithm (using the /fd switch of signtool), but Visual Studio with the WDK is not really set up to do that, I didn’t want to go monkeying with MSBuild stuff too much.

After calling Symantec support and describing the problem we were basically told, “oh yeah, the support for SHA-2 is not really there for a lot of systems.” You think? Maybe you could have mentioned this when we were renewing the certificate?

The moral of the story, dear reader, is stick with the SHA-1 hash algorithm for signing your kernel mode drivers, at least for the time being.

Debugging User-mode Processes from the Kernel-mode Debugger

In my day job, I almost exclusively use a kernel-mode debugger attached to a Hyper-V virtual machine as the target. Most of the serious issues I see are kernel-mode issues (naturally those are the worst because they tend to crash the machine, aka the Blue Screen of Death). But periodically I see something pop up in a user-mode component, and it’s honestly just a big pain in the butt to have to install debugging tools on the target machine, and then reproduce whatever is going on under that debugger. Especially since I am already in a debugger!

I hit this just the other day when my user-mode component was failing an assertion, and I wanted to see what it was without having to jump through all those hoops.

I recently figured out how to debug the user-mode process from my kernel mode debugger. It’s not perfect, but it works pretty well, and it’s certainly more efficient than the alternative. What you do is use the .process debugger command with a few parameters. So here are the series of commands that I used.

!process 0 0
.process /i /r /p ffffffff12345678
g

The first command is basic, and just lists all the processes on the machine. I used this just so I could get the process object pointer for use in the next command.

The second command is where the magic happens. The switches to the process command attach to the process invasively, which means that the process will actually become active. Another does some address translation to make it easier to see the user-mode process space. The third reloads the user-mode symbols for the process. In my experience, sometimes I have to still do this manually.

Finally, the go command is required because the target system actually has to activate the process. This will return almost immediately, and you will be running in the context of the user-mode process. You can now set breakpoints, step through code, etc. all without leaving the comfort of your kernel-mode debugger.

Happy debugging!

Open Source Project for Testing Native API System Calls

At work I have recently been working on some system software that integrates closely with the registry. This means I require some fairly in-depth knowledge about the Registry APIs, and I also need a way to test their functionality (both with and without my software involved).

I have done this in the past by writing custom code, which then gets lost or out-of-date. This time around I decided to make a more interactive and reusable tool to help me accomplish this, and my company – FSLogix, Inc. – decided to release the tool as open source on CodePlex. I also wrote a blog article with more in-depth information about the tool over at the FSLogix Blog.

IoGetCurrentIrpStackLocation in the debugger

Today I had occasion to debug a problem with some IRP handling code in my driver. In the particular debugger session I found myself in, I wanted to examine some of the Irp parameters, found in the current stack location. Unfortunately I had only a pointer to the Irp in this code, and therefore needed to figure out how to find the stack location pointer using the debugger. Fun!

The function that does this in code is IoGetCurrentIrpStackLocation, which I decided to disassemble. The first part of the function basically just checks the StackCount and CurrentLocation members of the Irp to make sure that everything is ok. (This fires an assert if it doesn’t check out.) Then near the bottom of the function, we find

drv!IoGetCurrentIrpStackLocation+0x42 [source language="removed"][/source]:
23318 9dc43ea2 8b4d08          mov     ecx,dword ptr [ebp+8]
23318 9dc43ea5 8b4160          mov     eax,dword ptr [ecx+60h]
23319 9dc43ea8 8be5            mov     esp,ebp
23319 9dc43eaa 5d              pop     ebp
23319 9dc43eab c20400          ret     4

So first this moves the Irp pointer into ecx, and then goes to offset 60 within that structure (which happens to be outside the range of the documented structure), and puts the pointer there into eax for return to the caller. So I try that in my debugger and compare with the output of the !irp command.

kd> !irp 9f104f68
Irp is active with 1 stacks 1 is current (= 0x9f104fd8)
 No Mdl: System buffer=9f08cbf0: Thread 88e05558:  Irp stack trace.  
     cmd  flg cl Device   File     Completion-Context
>[  e, 0]   5  1 88dcbd18 88e05ab8 00000000-00000000    pending
	       FileSystemFSLX
			Args: 0000040c 00000000 94000004 00000000

kd> dd 9f104f68+60 L1
9f104fc8  9f104fd8

kd> db 9f104fd8
9f104fd8  0e 00 05 01 0c 04 00 00-00 00 00 00 04 00 00 94  ................
9f104fe8  00 00 00 00 18 bd dc 88-b8 5a e0 88 00 00 00 00  .........Z......
9f104ff8  00 00 00 00 15 15 15 15-?? ?? ?? ?? ?? ?? ?? ??  ........????????

At this point, I now realized that this structure that I needed to look at really doesn’t provide anything that’s not already provided in the !irp output. The Args output of that command correspond to the members of the IO_STACK_LOCATION.Parameters union. In this case, I am looking at a device control Irp, so these parameters are OutputBufferLength (40c), InputBufferLength (0), IoControlCode (94000004), and Type3InputBuffer (0).

So I guess the bottom line of this post is that !irp is cool and does just what you need it to, I just had to poke around a little bit before I realized it.

More debugging-geekness

So yesterday I was troubleshooting some window creation issues, and had to fool around in the kernel side of window creation, down in win32k.sys. Specifically I was looking at window class registration, which happens when you call RegisterClassEx from your Windows app. Down in the kernel, some magic happens with creating Atoms as part of the window class registration. I traced through a bunch of win32k.sys routines to figure out where in memory they were storing this, and then I wanted to dump the table. After dumping about 4 of the entries manually, I got bored and wrote this little gem:

r $t0=poi(poi(win32k!UserAtomTableHandle)+c)
.for ( r $t1=0; @$t1 < @$t0; r $t1 = @$t1 + 1 ) { du poi(poi(win32k!UserAtomTableHandle)+10+( @$t1 * 4))+c }

Basically, this uses the symbol win32k!UserAtomTableHandle to find the length of the table, and then uses a for loop to go through, calculating the offset of each item, and them dumping its string value. On my Windows 7 system it produced something like this:

8c2a3d1c  "Native"
878b0c9c  "ObjectLink"
87e1e18c  "AeroWizardInternalFrameButtonCli"
87e1e1cc  "cked"
878cb314  "Static"
878cb104  "DDEMLUnicodeClient"
9620faec  "DataObject"
8c2affa4  "ACTIVATESHELLWINDOW"
8c2afe34  "FlashWState"
9620fa84  "SysCH"
8c2b2ce4  "PBrush"
8c3b8f24  "MSUIM.Msg.RpcSendReceive"
878bb7b4  "SysIC"
878cb1ec  "DDEMLEvent"
878bb784  "SHELLHOOK"
8c2b2e0c  "Custom Link Source"
9159dc84  "AltTab_KeyHookWnd"
91529084  "Search Box"
878bb6f4  "SysDT"
8c2b2dd4  "Link Source"
9620fb8c  "FileName"
87e35b0c  "GDI+ Accessibility"
878bb664  "SysWNDO"
878bb854  "DDEMLAnsiServer"
87e0c0bc  "SysLink"
9620fb24  "NetworkName"
8c2cde3c  "USER32"
8c2b2d14  "OleDraw"
9620fb5c  "FileNameW"
8c2b2bec  "MoreOlePrivateData"
8c282434  "Edit"
9620fbbc  "Binary"
878cb374  "OleClipboardPersistOnFlush"
8c2a3d4c  "OwnerLink"
878cb2e4  "ListBox"
8c2b2e54  "Embed Source"
878bb634  "SysIMEL"
878cb224  "ComboLBox"

Neato Debugging Trick

I had to debug an annoying little problem today that I thought might be worth writing about. I was interested in walking through some code that was failing, but the same code was getting called in a recursive loop, so there were literally hundreds of successful runs that I was not interested in prior to the single failure I did care about.

Now a normal usermode developer might just add some special code at the point of failure to detect the failure and recall the failing function. Nice and easy. But that’s really not any fun, and when you’re doing kernel debugging, writing some new code and getting it running on the machine is not quite as simple (it’s not hard, just more time consuming).

Enter this neato debugging trick…

bp address "j (dwo(status)!=0) 'r @rip=fffff880`02b5bd1f'; 'gc'"

Basically this executes a conditional test (the “j” command) each time the breakpoint is hit. If the DWORD value represented by the variable named ‘status’ is non-zero, then I know I’ve hit the failure condition. In that case, I just adjust the instruction pointer back up to before the failing function call, leaving me right where I am ready to trace into the function and see the failure. Otherwise, the breakpoint essentially just hits ‘Go’ to continue on to the next hit.

The syntax here is a bit rough, and would have to be modified if your program isn’t always at the same code location (since I hard-coded the rip register). It could be replaced with an offset from the current location to be a bit more elegant. But since I was working on a driver, it was always in memory and at the same place, so I was lazy. (A habit that always pays off immediately.)