Extending llvm_jump_table_sizes

Hi @n-omer and others,

We would like to emit additional metadata for jump tables on aarch64[1]. Unfortunately it seems the format of llvm_jump_table_sizes added in [MC] Emit a jump table size section by omern1 · Pull Request #101962 · llvm/llvm-project · GitHub only works for simpler schemes like x86. Unfortunately I also do not see any room to extend the format (no version / format numbering or similar).

I am contemplating to add yet another section llvm_jump_table_info where each entry is prefixed with a byte specifying the format to leave room for architecture specific formats and future extensions. Unfortunately this adds yet another section type while llvm_jump_table_sizes already exists. Is there a better way? Would it be possible to still change the format of the existing llvm_jump_table_sizes section to have some format/version marker (what are the known users)?

[1] For context: We are trying to improve BOLT behavior on AArch64 for functions with jump tables. On aarch64 the jump tables store relative offsets to a known position in the basic block at either 1, 2, or 4 bytes shifted by 2 because instructions are known to be at 4 byte alignments. As far as I can tell this cannot be properly expressed/encoded in the current llvm_jump_table_sizes section. I am currently prototyping formats similar to this example:

	.section	.llvm_jump_table_info,"",@0x6fff4c0e
	.byte	2                               // format 2: entries are 1 byte, relative to known address; shifted by 2
	.xword	.LJTI0_0                        // pointer to jump table
	.xword	.LBB0_4                         // entries are relative to this address
	.xword	.Ltmp0                          // pointer to Load Instruction
	.xword	.Ltmp1                          // pointer to Branch Instruction
	.byte	67                              // Number of Entries
2 Likes

Hi @MatzeB,
It’s great that you’re working on this!
We are developing an optional extension to AArch64 ELF abi-aa/aaelf64/aaelf64.rst at main · ARM-software/abi-aa · GitHub to propose for discussion Consider an ABI extension to define metadata for binary analysis · Issue #297 · ARM-software/abi-aa · GitHub and plan to support it in GCC and LLVM. We have conducted a preliminary internal discussion in Arm on metadata for jump tables, and at the moment, the draft format looks this ( extending your proposal further ):

Offset (bytes) Field Name Size (bytes) Description
0 BranchOffset 4 Offset to determine the address of the indirect branch instruction that uses the jump table.
4 TableOffset 4 Offset to determine the starting address of the jump table.
8 BaseOffset 4 Offset to determine base address that jump table entries are relative to.
12 EntryCount 3 Number of entries in the jump table.
15 EntrySize 1 Size and type of jump table entries:
1 - 4 bytes, absolute address, value is used for address, base address = 0x0
2 - 4 bytes, unsigned offset, value is added to base address
3 - 4 bytes, signed offset, value is added to base address
4 - 1 byte, unsigned offset shifted, value is shifted and added to base address
5 - 2 byte, unsigned offset, value is shifted and added to base address
6 - 1 byte, signed offset, value is shifted and added to base address
7 - 2 byte, signed offset, value is shifted and added to base address
Other values are reserved for future use.

We would appreciate any comments, suggestions for changes, and improvements and really want to make it suitable for users’ need.
Thanks in advance,
Pavel

1 Like
  • I think EntrySize should be called Format or similar and should go first. That way we can just invent new “formats” with different kind and amount of fields to support other architectures or possible future extensions. If nothing else I would also like to develop a format for the x86-64 jump tables variants while looking at this.
  • Other than that this looks like the right kind of fields for AArch64 to me. One could debate whether 4byte offsets may be limiting (the biggest binaries around here certainly cross the 4GB mark especially when they include debug info). ULEB128/SLEB128 may be a viable option (is there worries about parsing complexity or about variable sized entries because of ULEB/SLEB?)

Our plan here is to push something like this as an internal prototype to the C++ toolchain and BOLT so we can gather some implementation experience first before making anything “official”.

1 Like

Thanks for valuable feedback! Going first with EntrySize ( maybe call it EntryFormat? ) sounds reasonable to me. Would be great if we can make this extension generic and specify formats for other architectures there ( we have plenty of space to encode more formats ). We discussed the option with LEB128 compression, but the advantages of additional complexity were not evident. We can revisit this.

BOLT reader for .llvm.jump_table_info section as defined by Matthias below:

The pass to preserve the jump table format to avoid modifying the jump table access instruction sequence is WIP:

Note this doesn’t work with instrumentation as it violates the assumption that the function is not increasing in size (is either split or not touched).

  • I pushed our prototype LLVM changes here: GitHub - MatzeB/llvm-project at staging-matthiasb-jump_table_info
  • The format for an entry in the prototype looks like this:
    • 1 byte: format. Possible values:
      1: (I intend to use this for default (think x86) format; not implemented in the prototype yet; format 1 would skip most of the following fields)
      2: aarch64-style format, with 1-byte per entry, relative offsets, right shifted by 2
      3: aarch64-style format, with 2-byte per entry, relative offsets, right shifted by 2
      4: aarch64-style format, with 4-byte per entry, relative offsets (no shift)
    • 8-byte pointer/relocation to jump table
    • 8-byte pointer/relocation to “base address” in the code segment. (Jump table entries are relative to this address.)
    • 8-byte pointer/relocation to load instruction
    • 8-byte pointer/relocation to indirect branch instruction
    • ULEB128 number of jump table entries

Notes:

  • Added separate entries to point to load/branch. These could likely be deduced by analyzing the instruction stream instead of encoding them explicitely. However that may be brittle and may trip up BOLT when there is unexpected patterns (imagine the compiler scheduling unrelated instructions into the pattern or spilling/reloading intermdiate values), so we felt more comfortable recording them explicitely in the metadata.
  • I currently went with full 8-byte / xword pointers as I wasn’t sure whether/how much we needed relocations there for things to work correctly during linking. (possible that we can get away with more relative/shorter encoding, haven’t looked into that)
1 Like

I have pushed a prototype for GCC here: Making sure you're not a bot!

On the whole the approach suggested above by @MatzeB is suitable for GCC as well. There are a few changes that would be needed which I’ve outlined here:

  • Format:
    • 5: aarch64-style format, with 4-byte per entry, relative offsets, right shifted by 2
    • 6: aarch64-style format, with 1-byte per entry, relative offsets, right shifted by 2, signed value
    • 7:aarch64-style format, with 2-byte per entry, relative offsets, right shifted by 2, signed value
    • 8: aarch64-style format, with 4-byte per entry, relative offsets, right shifted by 2, signed value

Notes:

  • The above format changes are due to, 4 byte entries also being right shifted by 2 in GCC. Entries 6,7 and 8 are due to the possibility in the future of having jump table entries that could be signed as opposed to unsigned (i.e a backwards jump). We at present do not do this for aarch64 but we do have examples of this in arm, so we would want it defined here so that we keep that option open.
  • We would be looking to standardise the section name, type and flags. The section should not be loaded into the executable. I’ve proposed a name of .jump_table_info.
  • We have no issues with 8-byte pointers or ULEB128 for the number of entries, The section should not be loadable and the number of switch statements is generally low, so having the larger size for safety seems fine. We are not opposed to smaller sizes, if needed as well.
  • I’ve included a reference to the load and branch instruction as per the LLVM prototype, we were unsure of the use case for these, but don’t have any complaint against including them.
1 Like

Hi folks, I’ve implemented PoC for AArch64 to handle jump tables for gcc/clang/golang compilers in llvm-bolt (BOLT)

Let me describe the details here and what could simplify JT support in BOLT for gcc/clang and golang.

Firstly, why support jump tables for aarch64 is the issue for BOLT.
Jump table is an instructions snippet and an array of offsets,
If BOLT can’t match snippet, a function layout can’t be changed and the one must be excluded from optimizations.

The example of the most common clang pattern for tested binary

section .rodata:
JumpTable:
.byte 0x00,0x02,0x04

section .text:
…
adrp    x9, JumpTable
add     x9, x9, :lo12:JumpTable
adr     x10, .LEntry0
ldrb    w11, [x9, x8]
add     x10, x10, x11, lsl #2
br      x10
.LEntry0:
…
.LEntry1:

BOLT builds CFG by disassembling the binary and it needs to know the indirect branch targets to construct CFG,
and to determine the value for indirect branches bolt needs to know CFG.

To handle jump table correctly, bolt handles the indirect branches into 2 steps:

  1. make a guess about the indirect branch nature without knowing the CFG (jump table snippet or fixed indirect branch or veneer and etc.)
  2. build CFG and verify assumptions for processing unknown indirect branches, making a guess about the jump table entry size and jump table size.

Current PoC based on heuristics from JT snippet about the JT size and JT entry size.

To exclude heuristics from the process,

1. we need a hint about the nature indirect branch

gcc/clang don’t add any extra information about jump tables and where they are,
golnag has special records with .jumpXXX suffix in symtab.

readelf -Ws main | grep runtime.typehash
0000000000403c50   656 FUNC    LOCAL  DEFAULT   15 runtime.typehash
0000000000719fe0   104 OBJECT  LOCAL  DEFAULT   18 runtime.typehash.jump14

it means that runtime.typehash function has jump table with 104 bytes size.
bolt can get the hint from symtab to make sure that function has jump table and need to match snippet.

2. we need to know all jump tables snippets implementations

gcc/clang have almost the same jump tables patterns, difference is the instruction sequence and operands

Example for PIE binary,

clang gcc
adrp x9, pageOff@Table adrp x9, pageOff@Table
add x9, x9, #lo12:Table add x9, x9, #lo12:Table
adr x10, foo+0xbc ldrh w0, [x9, w0, uxtw #1]
ldrb w11, [x9, x8] adr x10, foo+0xbc
add x10, x10, x11, lsl #2 add x10, x11, w0, sxth #2
br x10 br x10

The order of instructions and operands varies, which adds complexity due to variability.

Golang has the single snippet for whole binary

adrp x9, pageOff@Table
add  x9, x9, #lo12:Table
ldr  x27, [x9, x8, lsl #3]
br   x27

3. From snippet we have to decode entry size, array size in .rodata and target addresses.

for gcc/clang most common entry size is 1/2/4 bytes. Bolt takes the entry size from load instruction.

for example,
clang

ldrb    w11, [x9, x8] - 1 byte entries

gcc

ldrh    w0, [x9, w0, uxtw #1] - 2 bytes entries

golang

ldr 	x27, [x9, x8, lsl #3] - always 8 bytes, direct target address inside this function

Additional complexity for gcc/clang is an accounting for sign extension (uxtw, sxth) for decoding target addresses.

4. Jump table size
gcc/clang don’t define the size directly, bolt make a guess about a JT size as difference between current and next symbol in .rodata

need to take into account the next symbol and next section alignments

5. Rare case for C/C++
for MySQL binary I’ve faced with corner case where the snippet is last basic block for function and adr instruction is relative to next function

.text:

function foo
.LEntry0:
…
.LEntry1:
…
.BBwithSnippet
  adrp    x9, JumpTable
  add     x9, x9, :lo12:JumpTable
  ldrh    w8, [x9, w0, uxtw #1]
  adr     x10, bar
  add     x10, x11, w0, sxth #2
  br      x10

function bar
…

after function reordering based on profile, foo is hot function move to hot section, bar is moved to cold section and adr instruction is out of available range,
I skipped both function from optimization due to bar.

7. in case of 1/2/4 bytes entries size we have to extend the array and change the snippet for instrumentation
BOLT instrumentation snippet for AArch64 takes ~10 instructions, we need to insert the one before each branch and at the function beginning.
The function layout is changed significantly and the distance for ADR instruction out of range. Bolt must apply ADR instruction relaxation (it isn’t always possibly) or BOLT must extend the pattern and move the array offsets to another section.

for golang with 8 bytes entries BOLT update the addresses in-place.

8. gcc/clang jump table snippet can be split between several basic blocks, which adds complexity for snippet matching

adrp    x9, JumpTable
add     x9, x9, :lo12:JumpTable
adr     x10, bar
ldrb    w11, [x9, x8]
add     x10, x10, x11, lsl #2
br      x10

…

ldrb    w11, [x9, x8]
add     x10, x10, x11, lsl #2
br      x10

golang don’t split snippet (very convenient)

As the summary to handle jump tables correctly with the minimum necessary and sufficient information compiler can add the records to symtab with object size and function name (no need to read extra sections and platform independent solution)

Additionally it will be great for C/C++ to

  1. Reduce the complexity and variability of patterns by reducing the number of templates and fixing the order of instructions.
    Maybe compiler can provide the option for developer which one patterns most suitable (trade off between memory and cpu utilization)
  2. Exclude the splitting for snippet between basic block.
  3. Use 8 bytes entry size to reduce snippet size (under compiler option)
  4. Remove ADR instruction from snippet