Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Better CFI Format #558

Open
Swatinem opened this issue May 3, 2022 · 13 comments
Open

RFC: Better CFI Format #558

Swatinem opened this issue May 3, 2022 · 13 comments

Comments

@Swatinem
Copy link
Member

Swatinem commented May 3, 2022

Status Quo

We have this thing called symbolic-mimidump::cfi::AsciiCfiWriter which outputs breakpad ASCII STACK records for all the various debug formats that we support. We use this as part of symbolicator, and afaik mozilla uses it as part of their dump_cfi utility.

Problem Statement

This ascii format has a couple of shortcomings:

  • It needs to be parsed, either ahead-of-time, or lazily.
  • Parsing it is super slow, and needs to happen every damn time.
  • It might even end up being larger than the actual unwind info.
  • It is low fidelity and a bad common denominator; for example it does not support some DWARF operations like "set return addr to 0; aka end of stack".
  • Did I mention its a text format that needs to be parsed?

Proposed Solution

So I was thinking for quite some time about an "indexed" format that I don’t need to parse from beginning to end every single time, but can quickly look up unwind info based on instruction offset.

Also, while working on #549 I thought that converting the unwind instructions into this bad intermediate text format is a bad fit, since it would be a lot nicer to just execute the unwind operations.

Long story short, how about we had a serialized, mmap-able format similar to SymCaches that have something like the following format, in pseudocode:

struct CfiCache {
  ranges: BTreeMap<usize, UnwindInfo> // instruction addr => unwind info
}

enum UnwindInfo {
  Breakpad(String), // same as now, just tiny indexed snippets so you don’t need to parse the whole file ahead of time
  WindowsX64(goblin::pe::exception::UnwindInfo), // well, a reference to a binary representation of the raw info
  Dwarf(gimli::read::CallFrameInstructionIter), // again, binary representation of the raw DWARF unwind info
  Compact(symbolic_debuginfo::macho::compact::CompactCfiOpIter), // again, same for apples format
  // etc, whatever other formats there are
}

The proposed CfiCache / UnwindInfo would implement minidump_processor::symbols::SymbolProvider (or at least walk_frame) to just execute the provided unwind info directly, without needing to go through that horrible intermediate format.

Open Questions

Since we are not the only users of this code, I would have some questions especially for external users: (hello @Gankra, @gabrielesvelto, etc)

  • Pls give feedback on the proposal ;-)
  • Would a "unwind info -> breakpad ASCII" converter still be useful in that scenario? Can we just remove that completely?
  • Would you expect to create such an unwinder directly from an object file, without needing to go through an intermediate format/struct?
  • How transparent / opaque should this format be? Is it sufficient to have "object file -> (opaque intermediate format) -> .unwind(caller frame) -> Option<callee frame>"; as in: having unwind being the only public API it has? Or would you expect to have access to the underlying raw unwind info (raw DWARF bytes; whatever)?
@gabrielesvelto
Copy link
Contributor

We'd definitely be interested in this. We aren't too fond of using native debug information for symbols & unwinding information because it's too large and because we want to retain the ability to extract symbols from things we don't build ourselves (or closed-source binaries). A format that would be better than Breakpad's symbol format would be most welcome. Especially now that the Breakpad format is being extended to support additional stuff in ways that seem even less optimal than the existing ones (see the INLINE directives and the size explosion they cause in .sym files).

The idea of being able to just mmap the files would be very, very nice however it would need careful versioning to prevent the code from trying to access a file that's too new or too old.

@gabrielesvelto
Copy link
Contributor

Would a "unwind info -> breakpad ASCII" converter still be useful in that scenario? Can we just remove that completely?

No, we'd get rid of it and just do debug -> binary symbol/unwinding files

Would you expect to create such an unwinder directly from an object file, without needing to go through an intermediate format/struct?

Yes, ideally we'd do what dump_syms does, but into a binary format instead of a Breakpad symbol file

How transparent / opaque should this format be? Is it sufficient to have "object file -> (opaque intermediate format) -> .unwind(caller frame) -> Option"; as in: having unwind being the only public API it has? Or would you expect to have access to the underlying raw unwind info (raw DWARF bytes; whatever)?

I wouldn't care about the underlying debug info. If we could encapsulate symbols & unwinding info in a binary file that can be mmap'd and used as a replacement for .sym file then that'd be more than enough. We can always use native debug info for use cases that aren't stack-walking & profiling.

@mstange
Copy link
Contributor

mstange commented May 3, 2022

Is this format intended as a long term storage format, or as a "caching" format similar to the symcache format, where the cache files need to be regenerated from the binary source file every time the cache format changes?

If it's the latter, then the original binaries will still need to be kept around and Mozilla would not be able to use it as a replacement for the sym format, unless we're ready to keep around all binaries, even binaries we don't build ourselves or which are closed source.

And since it's only intended to store unwind information and not symbol information, Mozilla would still need a format for long-term storage of symbol information if we want to stop using .sym - or, again, commit to storing the original binaries.

So really, I think Mozilla needs to pick one of these three solutions: 1. Keep .sym, 2. Create a replacement for .sym aimed at long term storage (actually two replacements, one for symbols and one for unwinding), or 3. Keep around all binaries, including ones we don't build ourselves or which are closed source.

@mitsuhiko
Copy link
Member

Our historic view has been that we want to retain the original source files for flexibility for future product direction. However I think at the same time having a stable symcache (or cficache) format is starting to look quite appealing for situations where we do not need the entire source debug file. For instance our iOS symbol harvesting at the moment keeps these reassembled macho containers with symbol tables in them. They are already so limited that we might as well only keep symcaches there but we have opted against this for now due to instability in the format.

That said, keeping the code around for older versions of the format has lasting impact on maintainability.

@gabrielesvelto
Copy link
Contributor

If it's the latter, then the original binaries will still need to be kept around and Mozilla would not be able to use it as a replacement for the sym format, unless we're ready to keep around all binaries, even binaries we don't build ourselves or which are closed source.

I'm assuming this format would be relatively stable over time, like .sym files, frequent churn would indeed be problematic for us.

And since it's only intended to store unwind information and not symbol information, Mozilla would still need a format for long-term storage of symbol information if we want to stop using .sym - or, again, commit to storing the original binaries.

Since the proposal is about replacing the .sym format I assumed that this would also contain symbols. Being able to look up symbols & file/line numbers directly would be beneficial. If it's only about unwinding information then it would be problematic as I wouldn't want to have to deal with two different containers.

So really, I think Mozilla needs to pick one of these three solutions: 1. Keep .sym, 2. Create a replacement for .sym aimed at long term storage (actually two replacements, one for symbols and one for unwinding), or 3. Keep around all binaries, including ones we don't build ourselves or which are closed source.

IIUC this would be going towards 2 because that would interest us. Sticking with 1 is not painless BTW, we still have to implement INLINE directives which is something I had scheduled for sometimes this year. If a better format is potentially on the horizon I'd hold my horses though. 3 is a huge hassle, especially for Apple binaries (they're large) and Linux ones (they're a lot).

@Swatinem
Copy link
Member Author

Swatinem commented May 4, 2022

We internally have a split between stack unwinding, and symbolication. Both as phases in code, as well as in intermediate cache formats we use. And those cache formats are being generated from split debug info anyway (unwinding uses exe / dylib; symbolication uses pdb / dsym; … and either combined or split dwarf for linux).

Those formats are not intended for long-term storage primarily for the reasonthat the info in them might be low-fidelity or plain wrong in case our converters are buggy, which they were a couple of times in the past. (#546 comes to mind immediately)

We are also working on a breaking change release that would drop support for older symcache formats (#550) as maintaining backwards compatibility is not worth the maintenance burden for us.

I would recommend to keep the original binaries around. @mstange recommended on discord to use some objcopy variations to strip away all unneeded sections to cut down on size. I don’t really have too much insight into what storage requirements we are talking about for our apple symbols or customer uploaded symbols. I know that our rolling cache that expires after 3 months just hit its peak at around ~40TiB. @mitsuhiko may enlighten us about the other long term storage that we keep for debug files.

@Gankra
Copy link
Contributor

Gankra commented May 4, 2022

Surely if you strip out everything but the parts of the binary(s) that don't contain unwind/line/symbol info, then the result would be on the order of (and probably smaller than) .sym files anyway?

I've never really understood what the actual historical motivation for .sym files was, in this regard (since dumpsyms "has to" understand the native formats anyway).

@Swatinem
Copy link
Member Author

Swatinem commented May 5, 2022

I think that the original motivation was rather that it depended on some Windows system libraries to read pdbs, on Windows only.

I’m not entirely sure about unwind info, but DWARF is notoriously redundant, as linkers don’t optimize it, but just patch up some offsets and concatenate the whole thing.

@gabrielesvelto
Copy link
Contributor

I've never really understood what the actual historical motivation for .sym files was, in this regard (since dumpsyms "has to" understand the native formats anyway).

IIRC there were several, including the inability to parse PDB files w/o relying on Microsoft libraries as @Swatinem said and the notorious buggyness of compilers' DWARF output. The .sym format is easy to validate, dealing with corner cases and bugs in DWARF was nightmarish in a pre-Rust world.

@Gankra
Copy link
Contributor

Gankra commented May 5, 2022

Hmm I don't understand the DWARF motivation. Surely dump_syms has to handle/suffer those corners/bugs as much as an unwinder based on that info...?

@gabrielesvelto
Copy link
Contributor

gabrielesvelto commented May 5, 2022

Yeah but dump_syms was much simpler back then. There was a dedicated one on each platform so whatever hack you had to make would only happen in one place. The stack walker on the other hand was platform-independent and relied on the .sym files alone for that. Plus storing .sym files gives you everything you need without having to store and serve binaries which in some cases is not possible (like with Apple libraries).

@luser
Copy link
Contributor

luser commented May 27, 2022

I wasn't involved in Breakpad from the very beginning, so I can't speak to the precise original motivation, but I know that the primary reason for translating native symbols to .sym files was to avoid having to parse debug info on the server-side, which at the time, specifically for PDB files, was a difficult affair. The original sym file design was primarily for symbolication + source line info, but the STACK WIN bits were added very early on, since x86 unwinding without frame pointers is effectively impossible otherwise. Jim Blandy designed and implemented the entire STACK CFI system as a way to represent DWARF CFI in order to handle x86-64 unwinding, and then I shoehorned things like Microsoft's x64 unwind tables and the ARM unwind tables into it because it was sufficiently expressive.

Keep in mind that all of this predates Rust being a viable implementation language, and the current state of affairs with robust Rust crates for parsing all of these formats really changes things.

If I were to redesign things from scratch right now, I would argue for using the native binaries/symbol files as much as possible, and layering things like symcache on top purely as an optimization. symcache makes sense given that DWARF debug info is not really optimized for lookups–even GDB builds its own index to speed those up. Most unwind tables are optimized for performance, however, since they're designed to be used at runtime for unwinding from exceptions and such. Given that, and the existing issues with fidelity when translating between native unwind tables and Breakpad's STACK CFI format, I think there's a strong argument to be made for simply using unwind tables in their original format as-is in the stackwalker. For a symcache equivalent, I could imagine a simple container format that stores the original data with a small header that notes the format of the unwind tables within.

You could certainly design a generic representation of unwind tables from scratch, but I suspect you'd ultimately wind up with something that looks a lot like DWARF CFI anyway, given that that's what it was designed for.

@gabrielesvelto
Copy link
Contributor

For us the two primary reasons for only keeping symbols around rather than the executables & debuginfo are size - binaries and native debuginfo can be very large - and distribution issues - we don't want to host things we haven't built ourselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants