Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix file-local symtab lookups #281

Merged
merged 8 commits into from
Feb 3, 2025
Merged

Fix file-local symtab lookups #281

merged 8 commits into from
Feb 3, 2025

Conversation

kquick
Copy link
Member

@kquick kquick commented Feb 3, 2025

An LLVM Bitcode file, it is comprised of nested Blocks of information, with
records available at each block. Different blocks hold different types of
information, and the nesting represents program scope and what is
defined/accessible at a particular point within the program.

The Parse and Finalize monads maintain a set of symbol tables to use for
lookups, as stored in the Env structure referenced by those monads:

  • valSymtab for value references

  • fnSymtab for function references

  • bbSymTab for basic-block references, either "named" by association with
    a symbol name (like the function's entry block) or "anonymous", where
    there is no direct symbol but there may be a block label (like for a goto
    statement or similar surface language construct.)

These lookups are necessary to "relabel" values that reference symbol or label
addresses in the code with the actual targets as resolved by processing the
entirety of the LLVM bitcode file. The llvm-pretty-bc-parser runs in two
phases: the initial phase where it processes the LLVM bitcode stream to create
"Partial" representations of all of the elements, followed by a "finalization"
phase where it performs all of the label references above (via the llvm-pretty
relabel operation), as well as other fixups to convert the "Partial" data
structures into the structures defined by the llvm-pretty AST.

However, the tables in the Env structure previously only held the direct
symbol tables (i.e. VALUE_SYMTAB_BLOCK, found immediately within the
module's top-level MODULE_BLOCK. Symbol resolution was performed by calling
bbEntryName or requireBbEntryName. This was sufficient to resolve global
symbol references.

When finalizing functions, the Function finalization defined an alternate
resolution type (BlockLookup) and function (lookupBlockName) to use with
relabel for handling the code within the function. The bbEntryName and
requireBbEntryName did not accept a symbol (identifying the function or
Define in LLVM IR parlance) so they could only reference the top-level
symbol tables, whereas lookupBlockName was used if the relabel indicated
there was an associated symbol: it would find the (partial) symbol table for
the function (Define) associated with that symbol and perform the bbSymtab
lookup from that block.

This duplication of functionality should have been a suggestion that something
was not right, particularly when the bbEntryName and requireBbEntryName
uses simply discarded any supplied symbol name. But it worked most of the
time, so this went unnoticed for quite some time (eleven years or so), but it
finally broke when the assumptions about the ability to discard symbol names
was proven to be a bad.

What was it that revealed the invalidity of this assumption?

Answer: Function-local label references.

Consider the following

  int has_local_static(int x) {
    static const void *const disptab[NUM] = { &&fn1, &&fn2, &&fn3 };
    int y;
  fn1:
    ...
    goto *disptab[y];
  fn2:
    ...
    goto *disptab[y];
  fn3:
    return y;
  }

In this code, disptab is a local static. To implement this, clang adds the
definition of disptab to the top-level VALUE_SYMTAB_BLOCK. However, the
values in that list are local references to has_local_static. The relabel
operation will supply the has_local_static symbol, but VALUE_SYMTAB_BLOCK
finalization (finalizeGlobal) uses requireBbEntryName and discards that
supplied symbol, thus failing to find the symbol in the top-level symbol block
and halting the parse with fail--or even worse, finding an unrelated global
basic-block reference and using that to create invalid references. Neither of
these is any kind of good.

To fix this, this PR enhances the reader environment for Finalizeto also maintain the Define symtabs entries obtained during parsing, and then enhancing bbEntryName (and
requireBbEntryName) to lookup any symbols and perform bbSymtab resolution in the symbol
table associated with that definition. This means the relabel-supplied symbol
names are no longer ignored, and also means that bbEntryName is capable of
performing the full resolution that lookupBlockName was concocted to handle,
so the latter is removed as obsolete. These changes allow us to successfully
parse the bitcode for the has_local_static function above (also added as a
test in disasm-test/tests/localstatic.c).

A test is included; running the cabal run disasm-test or simply cabal run llvm-disasm -- disasm_test/test/localstatic.bc at the commit where this was introduced shows the error; subsequent commits fix the issue. Unfortunately, the test still fails round-trip as described in the known_bugs file added; this is for a somewhat separate issue that I would prefer to fix independently.

All commits are atomic; this PR can be split up if desired.

@kquick kquick requested a review from RyanGlScott February 3, 2025 06:33
Copy link
Contributor

@RyanGlScott RyanGlScott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! In the future, it would be helpful to file an issue first describing the (rather lengthy and nuanced) problem before filing a PR. This would be especially helpful since (AFAIU) this PR doesn't completely fix the underlying problem due to the roundtrip errors—that way, we could associate one issue with multiple PRs.

A test is included; running the cabal run disasm-test or simply cabal run llvm-disasm -- disasm_test/test/localstatic.bc at the commit where this was introduced shows the error; subsequent commits fix the issue. Unfortunately, the test still fails round-trip as described in the known_bugs file added; this is for a somewhat separate issue that I would prefer to fix independently.

Yes, I would greatly appreciate if you could help with this. I strongly suspect that this is due to #260, an issue which is currently blocking support for LLVM 17 and later (see #261).

disasm-test/known_bugs/llvm_dbg_declare_inline Outdated Show resolved Hide resolved
src/Data/LLVM/BitCode/Parse.hs Outdated Show resolved Hide resolved
src/Data/LLVM/BitCode/IR/Values.hs Outdated Show resolved Hide resolved
src/Data/LLVM/BitCode/Parse.hs Outdated Show resolved Hide resolved
src/Data/LLVM/BitCode/Parse.hs Outdated Show resolved Hide resolved
src/Data/LLVM/BitCode/Parse.hs Outdated Show resolved Hide resolved
src/Data/LLVM/BitCode/Parse.hs Outdated Show resolved Hide resolved
src/Data/LLVM/BitCode/Parse.hs Show resolved Hide resolved
src/Data/LLVM/BitCode/Parse.hs Show resolved Hide resolved
src/Data/LLVM/BitCode/Parse.hs Show resolved Hide resolved
@kquick kquick merged commit 965d6c9 into master Feb 3, 2025
26 checks passed
@kquick kquick deleted the dgb_1738563983-0 branch February 3, 2025 15:06
@sauclovian-g
Copy link

It's never too late to file an issue :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants