I learned a new favorite kernel debugger trick tonight. I regularly have a kernel debugger attached while working on my driver, but tonight experienced a crash in my user mode service. Not wanting to set up a new debugger inside the vm, I googled around and came up with the following:

!gflag +soe

This windbg command makes all exceptions go to the kernel mode debugger first. Voila!

I ran across a bug in a project just the other day that I thought others could find interesting. In this project, I had a main thread that was listening for connections and then serving back some data. I also had a timer that would periodically trigger an update of the data that was being served. This was using a System.Threading.Timer and therefore was running on a secondary thread from the thread pool.

The problem was that the timer would run two or three times (in fifteen minute intervals) and then it would just magically stop running. I initially thought perhaps locking issues between the threads, so I went through and locked everything that was shared, all to no avail.

And to make the problem even more frustrating, I couldn’t reproduce it in a debugger. I initially thought that this was perhaps because I was not patient enough to wait 45 minutes for it to happen. But it turned out to be a release vs. debug kind of problem: the release build had the problem, while the debug build didn’t seem to.

For research purposes, take as an example the follow little program. This program should have a main thread that just sleeps the day away, and a timer that prints out a debug message every 5 seconds. If I run the debug build of this, it works great, but running the release build on my machine, the timer thread didn’t even run a single time! Waahh?!?!

class Program
{
  static long i = 0;

  static void TimerCallback(object state)
  {
    Debug.WriteLine("{0:D5}: TimerCallback", i);
  }

  static void Main(string[] args)
  {
    // Trigger the callback every 5 seconds
    System.Threading.Timer t = new System.Threading.Timer(TimerCallback, null, 0, 5000); 

    while (true)
    {
      Thread.Sleep(2500);
    }
  }
}

It turns out what is going on here is that the system is happily garbage collecting my Timer object. According to the system, that t variable never gets used after it is initialized, so it’s safe to just throw it away. If you look at the MSIL using the ILDASM tool, you see the following for the release build. Notice that it does a newobj to create the Timer object, and then rather than storing it in a local with something like stloc.0, it just pops it off the stack and doesn’t keep any reference on it.

IL_0013:  newobj     instance void [mscorlib]System.Threading.Timer::.ctor(class [mscorlib]System.Threading.TimerCallback,
                                                                             object,
                                                                             int32,
                                                                             int32)
IL_0018:  pop

The debug version of the same code like the following, and note that it declares a local object, and then stores the reference to the Timer object in that local object.

.locals init ([0] class [mscorlib]System.Threading.Timer t,
           [1] bool CS$4$0000)
...
IL_0014:  newobj     instance void [mscorlib]System.Threading.Timer::.ctor(class [mscorlib]System.Threading.TimerCallback,
                                                                             object,
                                                                             int32,
                                                                             int32)
IL_0019:  stloc.0

Now once I figured out what was going on, fixing it was trivial. A using statement around the disposable Timer object keeps it in scope, and deterministically cleans it up when appropriate. (Of course, this is how the code should have been written in the first place, but look at the cool problem I got to figure out as a result of my lazy coding.)

When I am kernel debugging, I use the !list command a LOT. It basically walks a doubly-linked list and dumps the memory of each list entry. You can also have it dump a more readable type, or have it run commands for each list entry. Today I ran across a list that was extremely large. I wasn’t really interested in the data so much as understanding just how large it was. So I had to figure out how to count the number of list entries and I figured I would share.

r $t0 = 0; !list " -x \"dd @$extret L0; r $t0 = @$t0 + 1\" 0xffffffff'0a1bcdef "; ? @$t0

Gibberish? Let’s dig in and analyze the statement. For starters, it’s a multi-part command, separated by semi-colons. The first statement sets a pseudo-register to 0, the next part uses the !list command to hit every list entry, and the last part just evaluates the pseudo-registry and dumps it’s value afterward.

r $t0 = 0
!list " -x \"dd @$extret L0; r $t0 = @$t0 + 1\" 0xffffffff'0a1bcdef "
? @$t0

Now the guts of this is in the !list command, which runs a multi-part command for each list entry. The first part of the command is just there to use up the @$extret parameter, so windbg doesn’t try to tack it onto the end of our second command, which is simply incrementing the value in the pseudo-register.

Voila! Now I know that my list has 21,000 or so entries in it, and I now understand why there just might be a performance problem. 😉

Happy debugging!

I hit a pretty frustrating problem the other day when I renewed my digital certificate for code signing and my drivers stopped working. When I renewed the certificate through VeriSign (Symantec), I was given the option of choosing a SHA-1 or a SHA-2 (with 256-bit digest) hashing algorithm for the certificate. After reading through the background information on these options, and understanding that the new and improved SHA-2 was supported on Vista and above, I determined that SHA-2 would meet my requirements and clicked the button.

I got the new cert, put it in my cert store, and reconfigured all my projects in Visual Studio to sign with the new cert. Everything was working perfectly. It installed and run fine on my test machine, but then one of my co-workers complained that the driver wouldn’t load on his machine. After a short while I realized that I had the same problem on my machine, I just wasn’t seeing it because I had a kernel debugger attached. (Note that there are some registry options that can enable mandatory kernel mode code signing even when the debugger is attached. Read more at MSDN.)

So now that I could reproduce the problem, I started troubleshooting the signing itself. Using “signtool.exe verify /kp”, I verified that the driver was signed, all certificates were valid, and the cross-certificate from Microsoft was also in the signing chain. Signtool claimed that everything was just fine, but the darn thing still wouldn’t load. I was getting the error ERROR_INVALID_IMAGE_HASH (0x80070241): “Windows cannot verify the digital signature for this file. A recent hardware or software change might have installed a file that is signed incorrectly or damaged, or that might be malicious software from an unknown source.” After searching the event logs for a bit, I found the Microsoft-Windows-CodeIntegrity log which contained an Event 3004 with basically the same message as the error code above.

After exhausting my Google-Fu skills with no results, we started brainstorming about what was different between the build yesterday that worked and the one today. Somebody mentioned encryption algorithms, and the light bulbs went on in my head. Adding the specific case of SHA-2 to my searching yielded a couple of pages: Practical Windows Code and Driver Signing and PiXCL: Signing Windows 8 Drivers.

Some of my own testing showed that I couldn’t get a driver built with Visual Studio and a SHA-2 certificate to load on both Windows 7 and Windows 8. Theoretically you could sign it twice with each algorithm (using the /fd switch of signtool), but Visual Studio with the WDK is not really set up to do that, I didn’t want to go monkeying with MSBuild stuff too much.

After calling Symantec support and describing the problem we were basically told, “oh yeah, the support for SHA-2 is not really there for a lot of systems.” You think? Maybe you could have mentioned this when we were renewing the certificate?

The moral of the story, dear reader, is stick with the SHA-1 hash algorithm for signing your kernel mode drivers, at least for the time being.

I ran into an interesting case while working on a registry filter driver the other day. Boiled down to its most basic form, my driver is blocking access to registry keys by name. I watch for a name that matches one in my list then fail the open/create. The interesting case is when a registry key is a symbolic link (I.e., was created with REG_CREATE_LINK and has a REG_LINK value named SymbolicLinkValue that points to a second key).

All my filtering work takes place in the Pre-Open/Create path, and because the name coming in doesn’t actually represent what really will be opened, I can’t exactly rely on the name. So after digging into how these operations get handled, I have found some very odd behavior. I see two different behaviors for an open/create under these conditions. I created a sample filter driver to test this and tested on Windows 8 x64 and Windows 7 x64.

  1. When I try to open the link source (the key that links to a second key), I see the following:
  1. RegNtPreOpenKeyEx with CompleteName containing the name of the key – I return STATUS_SUCCESS to allow processing to continue.
  2. RegNtPostOpenKeyEx with PreInformation->CompleteName containing the key name from the SymbolicLinkValue. Status contains STATUS_REPARSE.
  3. RegNtPreOpenKeyEx with Complete Name containing the key name of the link target (from step b).
  4. RegNtPostOpenKeyEx with Status equal to STATUS_SUCCESS.
  • After a small number of repetitions (usually 2) the above behavior stops, and is replaced with the following:
    1. RegNtPreOpenKeyEx with CompleteName containing the name of the key – I return STATUS_SUCCESS to allow processing to continue.
    2. RegNtPostOpenKeyEx with Status equal to STATUS_SUCCESS. PreInformation->CompleteName contains the original name from step a, but the Object contains a pointer to a registry object for the link target key.

    Additionally, the behaviors above seem to revert after some period of time. If I run a simple registry open command for the link source a hundred times, I see behavior #1 twice and then #2 the remainder of the time. But after doing that if I let the machine sit idle for 4-5 minutes and run it again, I will again see behavior #1 twice, followed by #2 for as many times as I care to repeat it.

    I e-mailed some contacts at Microsoft and they confirmed what I have been seeing. The registry maintains a key lookup cache that is intended to speed up open/create operations. One thing that the cache can do is remember symbolic links and avoid the need to reparse them every time they are opened. This is a bit frustrating for me, but it probably makes a significant difference in registry performance (which, incidentally is a heck of a lot faster than file system performance).

    Unfortunately it would seem that the way to solve this problem is to store some of the data I need internally to my driver based on watching the systems behavior. So, for example, I can watch for the STATUS_REPARSE result to come back, and do something with the CompleteName value. So I could add a context to the call in the pre-open, put the original complete name in the context, then in the post-open when I see STATUS_REPARSE, I can add an entry to a list that says “Name A links to Name B”, which I can thereafter use in my pre-open. I then also have to deal with cases where the link key gets removed (I can watch for a pre-open with OBJ_OPENLINK, place a context on the object, then watch for the delete, and clean up my internal cache). I also have to deal with the case where the link gets retargeted to something else (I can watch for the pre-open with OBJ_OPENLINK, context the object, watch for the pre-set-value for REG_LINK named SymbolicLinkValue and update my cache).

    I won’t show the solution code here, since its fairly complicated, and it is the intellectual property of my company, but the solution is fairly straight-forward once you understand what needs to be done. I will, however, show some of the code that I actually used to test this and grok the behavior.

    Continue reading

    A while back I wrote a tool called NativeTest and open-sourced it on CodePlex. You can view the original post about this at the FSLogix Blog. Today I was troubleshooting a problem with registry links that required me to look at what was really going on in the registry, without it redirecting me through symbolic links. So I turned to the NativeTest project, which was able to help out.

    For those who don’t already know, the registry can have symbolic links where a key actually is just a pointer to another key. So, for example, the HKLM\Software\CurrentControlSet key is a symbolic link created during system boot that points to HKLM\Software\CurrentControlSet001 (or some other numbered control set, depending on your hardware configuration).

    Viewing a registry symbolic link involves opening the key with a special flag (REG_OPTION_OPEN_LINK) and then querying the value named “SymbolicLinkValue”. The value’s data will be a fully qualified native path to another registry key. The only difficult part about doing this with NativeTest is figuring out that the numeric option for REG_OPTION_OPEN_LINK is equivalent to 8, which you can find in the Windows SDK header files.

    The NativeTest session for viewing the link at HKLM\System\ControlSet001\Control\Print\Printers is as follows:

    > openkeyex -name \registry\machine\system\controlset001\control\print\printers -access kr -options 8
    Status = 0x00000000
    Handle = 40 (AUTO-0)
    
    > enumvaluekey -handle AUTO-0 -index 0 -class full -bufferlen 256
    AUTO-HANDLE: 40
    Status = 0x00000000
    ResultLength = 210
    00 00 00 00 06 00 00 00     ........
    38 00 00 00 9a 00 00 00     8.......
    22 00 00 00 53 00 79 00     "...S.y.
    6d 00 62 00 6f 00 6c 00     m.b.o.l.
    69 00 63 00 4c 00 69 00     i.c.L.i.
    6e 00 6b 00 56 00 61 00     n.k.V.a.
    6c 00 75 00 65 00 46 00     l.u.e.F.
    5c 00 52 00 65 00 67 00     \.R.e.g.
    69 00 73 00 74 00 72 00     i.s.t.r.
    79 00 5c 00 4d 00 61 00     y.\.M.a.
    63 00 68 00 69 00 6e 00     c.h.i.n.
    65 00 5c 00 53 00 6f 00     e.\.S.o.
    66 00 74 00 77 00 61 00     f.t.w.a.
    72 00 65 00 5c 00 4d 00     r.e.\.M.
    69 00 63 00 72 00 6f 00     i.c.r.o.
    73 00 6f 00 66 00 74 00     s.o.f.t.
    5c 00 57 00 69 00 6e 00     \.W.i.n.
    64 00 6f 00 77 00 73 00     d.o.w.s.
    20 00 4e 00 54 00 5c 00      .N.T.\.
    43 00 75 00 72 00 72 00     C.u.r.r.
    65 00 6e 00 74 00 56 00     e.n.t.V.
    65 00 72 00 73 00 69 00     e.r.s.i.
    6f 00 6e 00 5c 00 50 00     o.n.\.P.
    72 00 69 00 6e 00 74 00     r.i.n.t.
    5c 00 50 00 72 00 69 00     \.P.r.i.
    6e 00 74 00 65 00 72 00     n.t.e.r.
    73 00                       s.
    
    > closekey -handle auto-0
    AUTO-HANDLE: 40
    Status = 0x00000000

    As you can see, this key is really just a pointer to HKLM\Software\Microsoft\Windows NT\CurrentVersion\Print\Printers.

    The next task is how can I actually create one of these links for my own testing, just so I don’t accidentally mess up one of the system keys. Creating a symbolic registry link is pretty easy. You just have to create they key with a certain option (REG_OPTION_CREATE_LINK), and then set the value named “SymbolicLinkValue”, which needs to be of type REG_LINK. Again, the hard part is figuring out that REG_OPTION_CREATE_LINK is the equivalent of 2.

    The session looks like this:

    > createkey -name \registry\machine\software\Test -access ka -options 2
    Status = 0x00000000
    Disposition = Created
    Handle = 48 (AUTO-0)
    
    > setvaluekey -handle auto-0 -name SymbolicLinkValue -type link -data \\Registry\\Machine\\Software\\Test2
    AUTO-HANDLE: 48
    Status = 0x00000000
    
    > queryvaluekey -handle auto-0 -name SymbolicLinkValue -class full -bufferlen 512
    AUTO-HANDLE: 48
    Status = 0x00000000
    ResultLength = 120
    00 00 00 00 06 00 00 00     ........
    38 00 00 00 44 00 00 00     8...D...
    22 00 00 00 53 00 79 00     "...S.y.
    6d 00 62 00 6f 00 6c 00     m.b.o.l.
    69 00 63 00 4c 00 69 00     i.c.L.i.
    6e 00 6b 00 56 00 61 00     n.k.V.a.
    6c 00 75 00 65 00 00 00     l.u.e...
    5c 00 52 00 65 00 67 00     \.R.e.g.
    69 00 73 00 74 00 72 00     i.s.t.r.
    79 00 5c 00 4d 00 61 00     y.\.M.a.
    63 00 68 00 69 00 6e 00     c.h.i.n.
    65 00 5c 00 53 00 6f 00     e.\.S.o.
    66 00 74 00 77 00 61 00     f.t.w.a.
    72 00 65 00 5c 00 46 00     r.e.\.T.
    53 00 4c 00 6f 00 67 00     e.s.t.2.
    
    > closekey -handle auto-0
    AUTO-HANDLE: 48
    Status = 0x00000000

    Once you have created it, it’s kind of fun to poke around at the key with regedit. If you, like me, decide to clean up your test key and delete it in regedit, you will find that deleting the link source actually follows the link and deletes the target. The link source remains and now displays an error about the system cannot find the file specified. In my testing, it seems to be impossible to delete the link source in regedit, so we need to look at one more thing in NativeTest: deleting the link source key. The only trick here is that you have to open the key with REG_OPTION_OPEN_LINK so that the system doesn’t try to follow the link (which, incidentally is what regedit is doing).

    The session looks like this:

    > openkeyex -name \registry\machine\software\test -access ka -options 8
    Status = 0x00000000
    Handle = 52 (AUTO-0)
    
    > deletekey -handle auto-0
    AUTO-HANDLE: 52
    Status = 0x00000000
    
    > closekey -handle auto-0
    AUTO-HANDLE: 52
    Status = 0x00000000

    Windows Installer stores information about components, products, etc. that are installed on the system in the registry. Most of these elements are identified by a GUID, but the Installer doesn’t always store the GUID directly in the registry. Instead it uses what I call SQUIDs, or Squished GUIDs. Basically they just reformat the GUID a bit and remove some of the extra characters that don’t add anything (hyphens and braces). You can read more about this here: A Brief Note on Installer GUIDs.

    Periodically when I am working on some installer code I can end up needing to search the registry for stuff that is related to a component I am working on. It is very helpful to be able to convert between SQUIDs and GUIDs. I usually just do this by hand, but today I decided I needed a bit more help doing this and wrote a tiny vbscript piece of code that does the conversion. On the chance that it might provide value for someone else, I have decided to post it here.

    Function SquidToGuid(s)
      SquidToGuid = _
        StrReverse(Mid(s, 1, 8)) & "-" & _
        StrReverse(Mid(s, 9, 4)) & "-" & _
        StrReverse(Mid(s, 13, 4)) & "-" & _
        StrReverse(Mid(s, 17, 2)) & StrReverse(Mid(s, 19, 2)) & "-" & _
        StrReverse(Mid(s, 21, 2)) & StrReverse(Mid(s, 23, 2)) & StrReverse(Mid(s, 25, 2)) & _
        StrReverse(Mid(s, 27, 2)) & StrReverse(Mid(s, 29, 2)) & StrReverse(Mid(s, 31, 2))
    End Function
    
    Function GuidToSquid(g)
      Replace g, "{", ""
      Replace g, "}", ""
      GuidToSquid = _
        StrReverse(Mid(g, 1, 8)) & _
        StrReverse(Mid(g, 10, 4)) & _
        StrReverse(Mid(g, 15, 4)) & _
        StrReverse(Mid(g, 20, 2)) & StrReverse(Mid(g, 22, 2)) & _
        StrReverse(Mid(g, 25, 2)) & StrReverse(Mid(g, 27, 2)) & StrReverse(Mid(g, 29, 2)) & _
        StrReverse(Mid(g, 31, 2)) & StrReverse(Mid(g, 33, 2)) & StrReverse(Mid(g, 35, 2))
    End Function

    In my day job, I almost exclusively use a kernel-mode debugger attached to a Hyper-V virtual machine as the target. Most of the serious issues I see are kernel-mode issues (naturally those are the worst because they tend to crash the machine, aka the Blue Screen of Death). But periodically I see something pop up in a user-mode component, and it’s honestly just a big pain in the butt to have to install debugging tools on the target machine, and then reproduce whatever is going on under that debugger. Especially since I am already in a debugger!

    I hit this just the other day when my user-mode component was failing an assertion, and I wanted to see what it was without having to jump through all those hoops.

    I recently figured out how to debug the user-mode process from my kernel mode debugger. It’s not perfect, but it works pretty well, and it’s certainly more efficient than the alternative. What you do is use the .process debugger command with a few parameters. So here are the series of commands that I used.

    !process 0 0
    .process /i /r /p ffffffff12345678
    g

    The first command is basic, and just lists all the processes on the machine. I used this just so I could get the process object pointer for use in the next command.

    The second command is where the magic happens. The switches to the process command attach to the process invasively, which means that the process will actually become active. Another does some address translation to make it easier to see the user-mode process space. The third reloads the user-mode symbols for the process. In my experience, sometimes I have to still do this manually.

    Finally, the go command is required because the target system actually has to activate the process. This will return almost immediately, and you will be running in the context of the user-mode process. You can now set breakpoints, step through code, etc. all without leaving the comfort of your kernel-mode debugger.

    Happy debugging!

    At work I have recently been working on some system software that integrates closely with the registry. This means I require some fairly in-depth knowledge about the Registry APIs, and I also need a way to test their functionality (both with and without my software involved).

    I have done this in the past by writing custom code, which then gets lost or out-of-date. This time around I decided to make a more interactive and reusable tool to help me accomplish this, and my company – FSLogix, Inc. – decided to release the tool as open source on CodePlex. I also wrote a blog article with more in-depth information about the tool over at the FSLogix Blog.

    If you have read my blog at all, you already know that I enjoy poking around at the assembly-code-level in a debugger. But I believe it’s really NOT just for the geek interested in the arcane. Understanding what your code looks like at the assembly level can help you figure out what’s really going on, and in some cases it can be a whole lot easier than the same task at a higher level language, such as C. (All you C# and Java guys are laughing at the idea that C is a high level language, but this idea can be applied to understanding the byte code generated from those languages as well.)

    As a first example, a good defensive programming technique is to use asserts to make sure that your assumptions about the code are true. In a case where your assumption turns out to be wrong, these asserts can make a debugging session a nightmare. In a complicated product, such as assert may make it impossible to step through code running on a different thread. Fortunately it is super-easy to disable such an assert right in the debugger by just changing an opcode. No re-compiling, re-deploying, getting the scenario set up to reproduce again. In the assembly code, you can find the test for your assertion, which will be followed by a jump instruction of some type. If you are testing the return value of a function for success, you may see something like:

    test eax, eax
    jne <address>

    If you look at the byte code for the jne (jump not equal) instruction it will likely begin with a 0x75 (assuming an x86 instruction set and a short jump). Disabling that assert is a simple as changing that 0x75 to an 0x74, which changes the jne instruction to a je (jump equal). Good bye assertion.

    Another very closely related example I ran into the other day. I forgot to put a logical negation in a statement. The if statement was testing for the exact opposite of what I intended. Well setting up the scenario to test it was about a 5 minute process. But using the same trick as with the assert, I was able to switch my code to do the right thing and continue debugging the remainder of the scenario.

    One more example that will be more applicable to you C# guys out there. In the latest version of the C# language and .NET Framework, Microsoft added a couple of new keywords: async and await. Well the descriptions of the functionality and the explanations I read didn’t quite add up. Being quite familiar with multi-threaded programming, I just didn’t feel like I was getting a full picture of what was really happening. (I actually asked about it at a developers conference, and the speaker said “I don’t know how they do it, it’s just magic.”) So I wrote a quick sample, fired up a debugger and looked at the generated byte code, to get the complete picture of what was happening. It works about as I expected it would, but without understanding assembly level code, I never would have been able to figure that out.

    So the next time you’re tempted to think that assembly language programming is a dead art, think again.