Overview
We have developed a prototype that leverages Propeller to insert optimal code prefetches into binaries. This innovation is particularly relevant given that new architectures from Intel (GNR) and AMD (Turin) now support software-based code prefetching (PREFETCHIT0/1), a capability that Arm has supported for even longer (PRFM) . Our preliminary results demonstrate a reduction in frontend stalls and an overall improvement for an internal workload running on Intel GNR.
Our current framework requires an extra round of hardware profiles on top of the optimize binary (We’re currently investigating the possibility of incorporating this into the Propeller profile collection stage). The profile is used to guide the target and injection site selection. Prefetches must be inserted judiciously as over-prefetching may increase the instruction working set. We have seen improvements from injecting ~10k prefetches. About ~80% of prefetches are placed in .text.hot while the rest are in .text. Similarly, 90% of prefetches target .text.hot while the remaining 10% target code in .text.
Design Details
As a post-link optimizer, Propeller provides a strong foundation for accurate binary address mapping (SHT_LLVM_BB_ADDR_MAP). We emit prefetch directives into the Propeller profile which are then read by the compiler to insert prefetches at the requested sites. Each directive specifies the injection site and the prefetch target. Each of these is specified by a triple <function_name, basic_block_id, callsite_index>. The basic block ID is the unique identifier of the basic block within the function. The callsite index specifies the callsite within the block where the address is mapped to (address is mapped to end of the call instruction – where the call returns back to). A zero callsite index means the beginning of the block. Support for callsite-level precision was recently added to the SHT_LLVM_BB_ADDR_MAP.
The compiler will insert a special global symbol for every prefetch target to allow the linker to resolve it against the prefetch instruction. The symbol is named based on the triple associated with the target:
_llvm_prefetch_target<function_name><basic_block_id><callsite_index>. This symbol must be global since the prefetch instruction and target could be in different object files. This will break ODR when targets are in internal-linkage functions defined in different object files. Currently, we are using the -funique-internal-linkage-names to circumvent this. This option adds a unique suffix to the function based on Module ID. We could also use the GUID once it becomes available and integrated into the Propeller framework.
A new compiler pass is added to insert prefetch instructions at the requested positions. On both X86 and Arm, the prefetch instruction supports register-based addressing and not immediate addressing. X86 provides 4-byte relative addressing while Arm allows a 2-byte range. Our current prototype on X86 uses RIP-relative addressing via the target’s associated symbol to reference the prefetch target. A relocation is used when the target is not defined in the same object file.
Here is an example of how the Propeller profile is used to insert prefetches.
Propeller profile:
f foo
t 78,1 # specifies a prefetch target at <foo,78,1>
t 71,0
f bar
h 12,0 foo,71,0 # prefetch hint to be placed at <bar,12,0> targetting <bar,71,0>
h 12,1 foo,78,1
Binary assembly after prefetch insertion:
<foo>:
...
<BB78>:
## <foo,78,0>
movl $0x4, %esi
callq *%rbx
## <foo,78,1>
__llvm_prefetch_target_foo_78_1: # inserted symbol for prefetch target
testq %r14, %r14
jne <BB78>
jmp <BB77>
<BB71>:
## <foo,71,0>
__llvm_prefetch_target_foo_71_1: # inserted symbol for prefetch target
cmpq $0x1, %rax
jne <BB73>
<BB72>:
...
...
<bar>:
...
<BB12>:
## <bar, 12, 0>
prefetchit1 0x1113(%rip) <__llvm_prefetch_target_foo_71_0> # prefetch instruction
movl $0x5, %edi
callq $0x1200
## <bar, 12, 1>
prefetchit1 0x1115(%rip) <__llvm_prefetch_target_foo_78_1> # prefetch instruction
testq %r14, %r14
je <BB15>
<BB13>:
...
It is possible that the symbol referenced by a prefetch instruction is not defined (e.g., if the target function’s name, block id, or callsites vary in the final build). Currently, we have implemented special handling in the linker to reject the relocation if the symbol is undefined (by checking the symbol name), effectively causing the prefetch to target 0x0(%rip) – the next instruction. Prefetching the next instruction is harmless and can be tolerated since prefetches are nonblocking instructions.
Issues
-
Inserting symbols in the binary may confuse systems and tools (such as llvm-objdump --disassemble-symbols).
-
Tolerating ODR violation when -funique-internal-linkage-names is not used.