Writing a Debugger From Scratch - DbgRs Part 8 - Source and Symbols
(New to this series? Consider starting from part 1)
At the end of the last post, DbgRs gained the ability to disassemble the code that it was debugging. While disassembly is critical for many debugging tasks, using source code to step through the code line by line is usually more convenient when it’s available. In this post we’ll start looking at how to use symbols (PDB files on Windows) so that we can display the source for the code that’s being debugged.
The DbgRs code for this post is in the part8 branch on github. You can also view the changes from part7. If you spot any mistakes or have suggestions for improving the code, feel free to create issues on the GitHub repo or submit a PR.
What’s in a PDB?
A symbol file is the debugger’s link between the compiled code and the original source code. Not just the correlation between the lines of code and the assembly instructions, but also function signatures, type definitions, local variables, and more. In the Windows world, this file is called a “PDB File”, which stands for Program DataBase. (Previously, debug information was stored directly in the executable, or optionally in a .DBG file but this is a long-dead format). These days, there are a number of different symbol formats known as “PDB”, including the .NET “Portable PDB” format, as well as the “Fastlink” PDB format that relies on the debug information on the original .OBJ files. For the purpose of this blog post, however, we will ignore those and focus on the “original” flavor of PDB.
While the PDB format is not documented, Microsoft released the Microsoft-PDB repo which contains the source code to a number of key tools that were used for generating and parsing the PDB format. That’s been sufficient for a number of other projects including LLVM to create PDB files that are compatible with WinDbg and Visual Studio. It’s also been the basis for some open source projects that parse Windows PDB files including the pdb crate used by DbgRs.
I’m not going to cover the file format in depth, but I think it’s useful to understand the rough outline of what a PDB contains. At the top level you have the MSF (Multi-Stream File) container that allows a file to be broken up into a collection of numbered streams. It’s a sort of mini-filesystem where individual blocks of the file can be allocated to different streams. Several fixed stream indices are used (streams 0-4), and one of these streams (stream 1, the “PDB stream”) includes a name/index mapping to allow streams to have a name in addition to an index.
Some of the information, like the type description stream, is global for the entire binary and is stored in a single named or indexed stream. The rest of the information is stored on a per-module basis, where the “modules” are the individual compilation units (like OBJ and LIB files) that were statically linked together to create the binary. The list of modules is stored in the “DBI stream” (index 3) and each module has an MSF stream where the symbol information is stored. That stream of data is split into two important parts: The “Symbol Table” and the “Line Data”.
The Symbol Table
The Symbol Table is a sequence of variable-length records, each with a type field that indicates what sort of information is contained. For instance, the S_GPROC32 indicates the start of a global procedure. There are more than a hundred different types of records. The ordering of the records is important, as there are “start” and “end” records of various types. For instance, there can be various records annotating information about a procedure in between an S_GPROC32 record and the terminating S_END record. These child records can describe exception handlers, local variables, parameters, and more.
We’ll look more at the other record types later, but we can start by just looking at the S_GPROC32 and similar entries to get better name resolution in our expression evaluator. Our current name to address resolution only works for export functions, which is pretty inconvenient. We’ll start by factoring out the code for searching the exports, and adding a new function that looks at the symbol information:
pub fn resolve_function_in_module(module: &mut Module, func: &str) -> Option<u64> {
// We'll search exports first and private symbols next
let export_resolution = resolve_export_in_module(module, func);
if export_resolution.is_some() {
return export_resolution;
}
resolve_symbol_name_in_module(module, func).unwrap_or(None)
}
To resolve a name to an address, we’ll need to get the list of modules (compilation units) described by the DBI stream. We’ll also need the “address map” that allows mapping between a section index + offset in the original unoptimized binary to the final RVA as it exists in the binary we are debugging.
fn resolve_symbol_name_in_module(module: &mut Module, func: &str) -> Result<Option<u64>, anyhow::Error> {
let pdb = module.pdb.as_mut().ok_or(anyhow!("No PDB loaded"))?;
let dbi = pdb.debug_information()?;
let mut modules = dbi.modules()?;
let address_map = module.address_map.as_mut().ok_or(anyhow!("No address map available"))?;
Given a function name, we don’t know which compilation unit it came from, so we’ll search through each module in order, and for each module we’ll iterate over the contained symbol records.
while let Some(pdb_module) = modules.next()? {
let mi = pdb.module_info(&pdb_module)?.ok_or(anyhow!("Couldn't get module info"))?;
let mut symbols = mi.symbols()?;
while let Some(sym) = symbols.next()? {
The pdb crate we’re using has a nice way of representing symbol data so that various related records like S_GPROC32 and S_LPROC32 can all be represented as a single logical type. So we can look for anything that’s a “Procedure” and check if it has a name that matches the name we are looking for. If we find it, we’ll convert the section index + offset into an RVA, and then convert that to a virtual address by adding the module base.
if let Ok(parsed) = sym.parse() {
if let SymbolData::Procedure(proc_data) = parsed {
if proc_data.name.to_string() == func {
let rva = proc_data.offset.to_rva(address_map).ok_or(anyhow!("Couldn't convert procedure offset to RVA"))?;
let address = module.address + rva.0 as u64;
return Ok(Some(address));
}
}
}
}
}
Ok(None)
}
We already had the resolve_function_in_module function connected to the expression evaluator, so we should be able to use this with the “bp” command right away. Let’s try it.
Command line was: '"C:\git\HelloWorld\hello.exe" '
LoadDll: 7FF7AB070000 hello.exe
[87F0] 0x00007ffecfe2aa20
> bp hello!main
[87F0] 0x00007ffecfe2aa20
> g
LoadDll: 7FFECFDD0000 ntdll.dll
[87F0] ntdll.dll!RtlUserThreadStart
> g
LoadDll: 7FFECE9F0000 C:\Windows\System32\KERNEL32.DLL
[87F0] ntdll.dll!NtMapViewOfSection+0x14
> g
LoadDll: 7FFECD280000
[87F0] ntdll.dll!NtMapViewOfSection+0x14
> g
Exception code 80000003 (first chance)
[87F0] ntdll.dll!LdrInitShimEngineDynamic+0x345
> g
Breakpoint 0 hit
[87F0] hello.exe!main
>
With that out of the way, we can start looking at how to use the line data to map between code addresses and source lines.
Mapping Addresses to Line Numbers
The information about line number mapping is stored in the “Line Data” (aka line number debug info), which is in the MSF stream for the module, immediately following the symbol table. We can use this information to map a code address like the current instruction pointer to a file name and line number. Using that we can load the corresponding source file and display each line to the user as they step through the code. We’ll start by grabbing the pdb
object that gives us access to the contents of the symbol file.
pub fn resolve_address_to_source_line(address: u64, process: &mut Process) -> Result<(String, u32)> {
let module = process.get_containing_module_mut(address).ok_or(anyhow!("Module not found"))?;
let pdb = module.pdb.as_mut().ok_or(anyhow!("Symbols not available"))?;
The source line information, like most data in the PDB file, is stored in terms of the original section number and section offset, so we first need to convert from an address in the process to an RVA, and then from an RVA to the original section + offset address, which the PDB module calls an “internal offset”.
let address_map = module.address_map.as_mut().ok_or(anyhow!("Address map not found for module"))?;
let rva: u32 = (address - module.address).try_into()?;
let rva = Rva(rva);
let offset = rva.to_internal_offset(address_map).ok_or(anyhow!("Couldn't map address"))?;
Next, we’ll enumerate over all the compilation units/modules looking for one that contains a line mapping for the offset that we just found.
let dbi = pdb.debug_information()?;
let mut modules = dbi.modules()?;
while let Some(module) = modules.next()? {
if let Ok(Some(mi)) = pdb.module_info(&module) {
if let Ok(lp) = mi.line_program() {
The PDB stores line mappings in chunks that correspond to functions or parts of functions. We can use the pdb to get the group of line mappings that corresponds to a particular address, but we’ll still have to enumerate over the line mappings to find the right one. Source lines are not one-to-one mapped with code addresses. A single line of source typically compiles to multiple assembly instructions. Beyond that, optimizations can cause more ambiguity when resolving a source line to an address (more on that later). But mapping from a code address to a source line is mostly unambiguous, and we can simply find the first line entry that contains the code address we’re looking for.
let mut lines = lp.lines_for_symbol(offset);
while let Some(line) = lines.next()? {
if line.offset.offset <= offset.offset {
let diff = offset.offset - line.offset.offset;
if diff < line.length.unwrap_or(0) {
Once we find the right one, we can look up the information about the source file in a separate table, and then use information in that structure to get the actual name of the file from the “string table” (used by the PDB format to deduplicate common strings). And at this point, we have a source file and line number to return.
let file_info = lp.get_file_info(line.file_index)?;
let string_table = pdb.string_table()?;
let file_name = string_table.get(file_info.name)?;
return Ok((file_name.to_string().into(), line.line_start))
}
}
}
}
}
}
Err(anyhow!("Address not found"))
}
Using this we’ll hook up a simple “lsa” command.
CommandExpr::ListSource(_, expr) => {
if let Some(val) = eval_expr(expr) {
match resolve_address_to_source_line(val, &mut process) {
Ok((file_name, line_number)) => {
println!("LSA: {}:{}", file_name, line_number);
if let Ok(file) = File::open(&file_name) {
let reader = io::BufReader::new(file);
let lines: Vec<_> = reader.lines().map(|l| l.unwrap_or("".to_string())) .collect();
And to make sure there’s enough context, we’ll print a small window of source lines around the target line. One important thing to remember is that line numbers are not 0-indexed, since the first line of a file is line 1!
for print_line_num in (max(1, line_number - 2))..=(min(lines.len() as u32, line_number + 2)) {
if print_line_num == line_number {
println!(">{:4}: {}", print_line_num, lines[print_line_num as usize - 1]);
} else {
println!("{:5}: {}", print_line_num, lines[print_line_num as usize - 1]);
}
}
} else {
println!("Couldn't open file: {}", file_name);
}
},
Err(e) => {
println!("Couldn't look up source: {}", e);
}
}
}
}
Now we get back source files and line numbers!
> lsa hello!main
LSA: C:\git\HelloWorld\hello.cpp:35
33:
34: int main()
> 35: {
36: printf("Hello world, %d", func1());
37: printf("Hello world 2!");
The source file came back as an absolute path, so we were able to find the file that was just compiled locally. But some compilation settings will result in a relative path being stored, and binaries that were compiled on a different machine likely won’t have the right source path for the local machine even if it is an absolute path. If we try to list the source for one of the CRT functions, we get this.
> lsa hello!common_lseek_do_seek_nolock
LSA: minkernel\crts\ucrt\src\appcrt\lowio\lseek.cpp:22
Couldn't open file: minkernel\crts\ucrt\src\appcrt\lowio\lseek.cpp
To get this working, we’ll need to implement source search functionality.
Source search
The relative path found in the symbols for the UCRT source file can’t be opened directly, even though I do have the source file available in my SDK install folder. We can get these paths to match up by creating a source search algorithm that takes a set of search paths as potential “roots” to combine with the source path we are looking for. First, we’ll add a “.srcpath” command that can set a Vec of search paths.
CommandExpr::SrcPath(_, path) => {
source_search_paths.clear();
source_search_paths.extend(path.split(';').map(|s| s.to_string()));
}
Now given that list of search paths, we need to find the correct matching path. If the path is an absolute path and exists, we’ll just return that source file. Otherwise, we’ll start iterating through the list of search paths. For each one, we’ll start with the full starting path minus any drive specifier, e.g. “minkernel\crts\ucrt\src\appcrt\lowio\lseek.cpp” and append it to the search path. If the file isn’t found, we’ll start removing path elements from the front, one by one, looking for existing files on disk, until we’re left with just the filename appended to the search path. That way, a search path can point to different levels of the filesystem and still find the right source file as long as some trailing part of the original path matches a local file. So even if the original path was on “Z:\foo\bar\baz\file.cpp”, we can store the files at “C:\src\baz\file.cpp” and have a search path of “C:\src” and the file will be found.
pub fn find_source_file_match(file: &str, search_paths: &Vec<String>) -> Result<PathBuf> {
let file_path = Path::new(file);
// If the file path is absolute and exists, return it immediately.
if file_path.is_absolute() && file_path.exists() {
return Ok(file_path.to_path_buf());
}
// Get all subsets of the input path.
let components: Vec<&str> = file_path.components().map(|c| c.as_os_str().to_str().unwrap()).collect();
for search_path in search_paths {
let search_path = Path::new(search_path);
for i in 0..components.len() {
// Join the search path with the subset of the input path.
let test_path: PathBuf = search_path.join(components[i..].iter().collect::<PathBuf>());
if test_path.exists() {
return Ok(test_path.to_path_buf());
}
}
}
Err(anyhow!("File not found"))
}
If we go back to the case that failed earlier and add a search path pointing to the ucrt source in my local installation of the SDK, we’ll see that DbgRs can now find this file.
> .srcpath C:\Program Files (x86)\Windows Kits\10\Source\10.0.22000.0\ucrt
[8AC8] 0x00007ffecfe2aa20
> lsa hello!common_lseek_do_seek_nolock
LSA: minkernel\crts\ucrt\src\appcrt\lowio\lseek.cpp:22
Found matching file: C:\Program Files (x86)\Windows Kits\10\Source\10.0.22000.0\ucrt\lowio\lseek.cpp
20: // end up moving out of range of 32-bit values.
21: static long __cdecl common_lseek_do_seek_nolock(HANDLE const os_handle, long const offset, int const origin, __crt_cached_ptd_host& ptd) throw()
> 22: {
23: LARGE_INTEGER const origin_pos = { 0 };
24:
It works! Now there is one more problem here, and that is that the content of the file could be totally wrong. Maybe it’s the wrong commit of a repository, or the filename just happens to match and is an unrelated file. Fortunately, the PDB file includes a hash of the source file as it was compiled. There are a number of different hash algorithms that can be used, but new compilers will default to a SHA256 hash. So we can pass that hash along to the find_source_file_match function and it can use that to reject any source files without a matching hash. But for the purposes of DbgRs, I’ll leave that functionality out. (Feel free to make a PR if you’d like to contribute!)
Mapping line numbers to addresses
While we’re stepping through the code, we need to map each address to the corresponding source line, but when setting a source-level breakpoint, we need to do the opposite and map a source line to an address. To do that we’ll extend the expression evaluator to take a syntax that allows us to specify the module, source file name, and line number, and allow that to resolve back to an address that we can use for a breakpoint. This is going to be a bit of a verbose syntax to type on the command line, but if DbgRs gets integrated into a GUI like VsCode, all that information will be available in the UI layer to pass along to the debugging engline.
The syntax for specifying a source line in WinDbg involves using the @@masm
syntax, which is a bit arcane:
0:000> ? @@masm(`notepad__!C:\git\npp\PowerEditor\src\winmain.cpp:721`)
Evaluate expression: 140697038077034 = 00007ff6`94f9bc6a
We’ll simplify that a bit to get rid of the “masm” part but otherwise leave it mostly the same. We can add a new case for our EvalExpr enumeration, and add a regular expression that tries to parse something that looks roughly like “module!file:line”.
#[rust_sitter::language]
pub enum EvalExpr {
Number(#[rust_sitter::leaf(pattern = r"(\d+|0x[0-9a-fA-F]+)", transform = parse_int)] u64),
Symbol(#[rust_sitter::leaf(pattern = r"(([a-zA-Z0-9_@#.]+!)?[a-zA-Z0-9_@#.]+)", transform = parse_sym)] String),
SourceLine(#[rust_sitter::leaf(pattern = r"(`[^`!]+!(?:[a-zA-Z]+:)[^`:!]+:\d+`)", transform = parse_source_line)] (String, String, u32)),
We’ll also need to extend the expression evaluator to handle this case:
pub fn evaluate_expression(expr: EvalExpr, context: &mut EvalContext) -> Result<u64, anyhow::Error> {
match expr {
...
EvalExpr::SourceLine((src_module, src_file, src_line)) => {
resolve_source_line_to_address(&src_module, &src_file, src_line, context.process)
}
To implement resolve_source_line_to_address
, we’ll search through all compilation units to find any that reference the source file that we’re looking at.
pub fn resolve_source_line_to_address(module_name: &str, src_file: &str, src_line: u32, process: &mut Process) -> Result<u64> {
...
let mut modules = dbi.modules()?;
while let Some(module) = modules.next()? {
if let Ok(Some(mi)) = pdb.module_info(&module) {
if let Ok(line_program) = mi.line_program() {
if line_program_references_file(&line_program, src_file, &string_table)? {
The implementation of line_program_references_file is just a straight linear search through the list of files returned from the LineProgram (class provided by the pdb create).
fn line_program_references_file(line_program: &LineProgram, src_file: &str, string_table: &StringTable) -> Result<bool> {
let mut files = line_program.files();
while let Some(file) = files.next()? {
let cur_file_name = string_table.get(file.name)?.to_string();
if cur_file_name == src_file {
return Ok(true);
}
}
Ok(false)
}
For any compilation unit that references the source file in question, we’ll search through the line mappings to find one that matches. There are faster ways of doing this search, but I don’t see an easy way to do better than a linear search with the pdb crate. But it’s fast enough for our purposes and is fairly easy to read.
let mut lines = line_program.lines();
while let Some(line) = lines.next()? {
let cur_file_name_ref = line_program.get_file_info(line.file_index)?.name;
let cur_file_name = string_table.get(cur_file_name_ref)?.to_string();
if cur_file_name == src_file && line.line_start <= src_line && src_line <= line.line_end {
let rva = line.offset.to_rva(&address_map).ok_or(anyhow!("Could not map source entry to RVA"))?;
let address = process_module.address + rva.0 as u64;
return Ok(address);
}
}
It’s important to note again that the mapping from source line to address is not a one-to-one mapping. A single source line nearly always compiles to multiple assembly instructions. So in this case we’re returning the first instruction in that block of instructions (which is typically what you want for a breakpoint). The instructions that result from a single line of code being compiled are also not necessarily contiguous, so you could have a few assembly instructions for line 1, a few instructions for line 2, and then some more for line 1 again. And beyond that, there’s also cases where a function gets inlined, so a single source line can map to hundreds or thousands of addresses! For the sake of simplicity, we’ll ignore all of that and simply return the first match we find. In a real debugger, you could not ignore that problem or you would have very unhappy users!
Now that we can resolve source lines to addresses and vice versa, we can easily test that we can “round-trip” a source line to an address and back to a source line.
> lsa `hello!C:\git\HelloWorld\hello.cpp:35`
LSA: C:\git\HelloWorld\hello.cpp:35
Found matching file: C:\git\HelloWorld\hello.cpp
33:
34: int main()
> 35: {
36: printf("Hello world, %d", func1());
37: printf("Hello world 2!");
It works! We can even set breakpoints using this:
> bp `hello!C:\git\HelloWorld\hello.cpp:37`
[6DE8] 0x00007ffb0ebaaa20
> bl
0 0x00007ff7f8f27207 (hello.exe!main+0x17)
[6DE8] 0x00007ffb0ebaaa20
That’s a big step towards being useful for people that want to do more than read assembly code! But unfortunately, all of our data is still in registers and memory, so we’re still not a “source level” debugger yet. That will have to be the next project, which will use the symbolic information to show information about local variables, with names and type information.
Hope you enjoyed this post about DbgRs! Have a question or suggestion? Let me know! You can find me on Twitter, Mastodon, and Bluesky.