Hi folks, I’ve implemented PoC for AArch64 to handle jump tables for gcc/clang/golang compilers in llvm-bolt (BOLT)
Let me describe the details here and what could simplify JT support in BOLT for gcc/clang and golang.
Firstly, why support jump tables for aarch64 is the issue for BOLT.
Jump table is an instructions snippet and an array of offsets,
If BOLT can’t match snippet, a function layout can’t be changed and the one must be excluded from optimizations.
The example of the most common clang pattern for tested binary
section .rodata:
JumpTable:
.byte 0x00,0x02,0x04
section .text:
…
adrp x9, JumpTable
add x9, x9, :lo12:JumpTable
adr x10, .LEntry0
ldrb w11, [x9, x8]
add x10, x10, x11, lsl #2
br x10
.LEntry0:
…
.LEntry1:
BOLT builds CFG by disassembling the binary and it needs to know the indirect branch targets to construct CFG,
and to determine the value for indirect branches bolt needs to know CFG.
To handle jump table correctly, bolt handles the indirect branches into 2 steps:
- make a guess about the indirect branch nature without knowing the CFG (jump table snippet or fixed indirect branch or veneer and etc.)
- build CFG and verify assumptions for processing unknown indirect branches, making a guess about the jump table entry size and jump table size.
Current PoC based on heuristics from JT snippet about the JT size and JT entry size.
To exclude heuristics from the process,
1. we need a hint about the nature indirect branch
gcc/clang don’t add any extra information about jump tables and where they are,
golnag has special records with .jumpXXX suffix in symtab.
readelf -Ws main | grep runtime.typehash
0000000000403c50 656 FUNC LOCAL DEFAULT 15 runtime.typehash
0000000000719fe0 104 OBJECT LOCAL DEFAULT 18 runtime.typehash.jump14
it means that runtime.typehash function has jump table with 104 bytes size.
bolt can get the hint from symtab to make sure that function has jump table and need to match snippet.
2. we need to know all jump tables snippets implementations
gcc/clang have almost the same jump tables patterns, difference is the instruction sequence and operands
Example for PIE binary,
| clang |
gcc |
| adrp x9, pageOff@Table |
adrp x9, pageOff@Table |
| add x9, x9, #lo12:Table |
add x9, x9, #lo12:Table |
| adr x10, foo+0xbc |
ldrh w0, [x9, w0, uxtw #1] |
| ldrb w11, [x9, x8] |
adr x10, foo+0xbc |
| add x10, x10, x11, lsl #2 |
add x10, x11, w0, sxth #2 |
| br x10 |
br x10 |
The order of instructions and operands varies, which adds complexity due to variability.
Golang has the single snippet for whole binary
adrp x9, pageOff@Table
add x9, x9, #lo12:Table
ldr x27, [x9, x8, lsl #3]
br x27
3. From snippet we have to decode entry size, array size in .rodata and target addresses.
for gcc/clang most common entry size is 1/2/4 bytes. Bolt takes the entry size from load instruction.
for example,
clang
ldrb w11, [x9, x8] - 1 byte entries
gcc
ldrh w0, [x9, w0, uxtw #1] - 2 bytes entries
golang
ldr x27, [x9, x8, lsl #3] - always 8 bytes, direct target address inside this function
Additional complexity for gcc/clang is an accounting for sign extension (uxtw, sxth) for decoding target addresses.
4. Jump table size
gcc/clang don’t define the size directly, bolt make a guess about a JT size as difference between current and next symbol in .rodata
need to take into account the next symbol and next section alignments
5. Rare case for C/C++
for MySQL binary I’ve faced with corner case where the snippet is last basic block for function and adr instruction is relative to next function
.text:
function foo
.LEntry0:
…
.LEntry1:
…
.BBwithSnippet
adrp x9, JumpTable
add x9, x9, :lo12:JumpTable
ldrh w8, [x9, w0, uxtw #1]
adr x10, bar
add x10, x11, w0, sxth #2
br x10
function bar
…
after function reordering based on profile, foo is hot function move to hot section, bar is moved to cold section and adr instruction is out of available range,
I skipped both function from optimization due to bar.
7. in case of 1/2/4 bytes entries size we have to extend the array and change the snippet for instrumentation
BOLT instrumentation snippet for AArch64 takes ~10 instructions, we need to insert the one before each branch and at the function beginning.
The function layout is changed significantly and the distance for ADR instruction out of range. Bolt must apply ADR instruction relaxation (it isn’t always possibly) or BOLT must extend the pattern and move the array offsets to another section.
for golang with 8 bytes entries BOLT update the addresses in-place.
8. gcc/clang jump table snippet can be split between several basic blocks, which adds complexity for snippet matching
adrp x9, JumpTable
add x9, x9, :lo12:JumpTable
adr x10, bar
ldrb w11, [x9, x8]
add x10, x10, x11, lsl #2
br x10
…
ldrb w11, [x9, x8]
add x10, x10, x11, lsl #2
br x10
golang don’t split snippet (very convenient)
As the summary to handle jump tables correctly with the minimum necessary and sufficient information compiler can add the records to symtab with object size and function name (no need to read extra sections and platform independent solution)
Additionally it will be great for C/C++ to
- Reduce the complexity and variability of patterns by reducing the number of templates and fixing the order of instructions.
Maybe compiler can provide the option for developer which one patterns most suitable (trade off between memory and cpu utilization)
- Exclude the splitting for snippet between basic block.
- Use 8 bytes entry size to reduce snippet size (under compiler option)
- Remove ADR instruction from snippet