RFC: Code Prefetch Insertion

Overview

We have developed a prototype that leverages Propeller to insert optimal code prefetches into binaries. This innovation is particularly relevant given that new architectures from Intel (GNR) and AMD (Turin) now support software-based code prefetching (PREFETCHIT0/1), a capability that Arm has supported for even longer (PRFM) . Our preliminary results demonstrate a reduction in frontend stalls and an overall improvement for an internal workload running on Intel GNR.

Our current framework requires an extra round of hardware profiles on top of the optimize binary (We’re currently investigating the possibility of incorporating this into the Propeller profile collection stage). The profile is used to guide the target and injection site selection. Prefetches must be inserted judiciously as over-prefetching may increase the instruction working set. We have seen improvements from injecting ~10k prefetches. About ~80% of prefetches are placed in .text.hot while the rest are in .text. Similarly, 90% of prefetches target .text.hot while the remaining 10% target code in .text.

Design Details

As a post-link optimizer, Propeller provides a strong foundation for accurate binary address mapping (SHT_LLVM_BB_ADDR_MAP). We emit prefetch directives into the Propeller profile which are then read by the compiler to insert prefetches at the requested sites. Each directive specifies the injection site and the prefetch target. Each of these is specified by a triple <function_name, basic_block_id, callsite_index>. The basic block ID is the unique identifier of the basic block within the function. The callsite index specifies the callsite within the block where the address is mapped to (address is mapped to end of the call instruction – where the call returns back to). A zero callsite index means the beginning of the block. Support for callsite-level precision was recently added to the SHT_LLVM_BB_ADDR_MAP.

The compiler will insert a special global symbol for every prefetch target to allow the linker to resolve it against the prefetch instruction. The symbol is named based on the triple associated with the target:

_llvm_prefetch_target<function_name><basic_block_id><callsite_index>. This symbol must be global since the prefetch instruction and target could be in different object files. This will break ODR when targets are in internal-linkage functions defined in different object files. Currently, we are using the -funique-internal-linkage-names to circumvent this. This option adds a unique suffix to the function based on Module ID. We could also use the GUID once it becomes available and integrated into the Propeller framework.

A new compiler pass is added to insert prefetch instructions at the requested positions. On both X86 and Arm, the prefetch instruction supports register-based addressing and not immediate addressing. X86 provides 4-byte relative addressing while Arm allows a 2-byte range. Our current prototype on X86 uses RIP-relative addressing via the target’s associated symbol to reference the prefetch target. A relocation is used when the target is not defined in the same object file.

Here is an example of how the Propeller profile is used to insert prefetches.

Propeller profile:

f foo
t 78,1 # specifies a prefetch target at <foo,78,1>
t 71,0
f bar
h 12,0 foo,71,0 # prefetch hint to be placed at <bar,12,0> targetting <bar,71,0>
h 12,1 foo,78,1

Binary assembly after prefetch insertion:

<foo>:
  ...
<BB78>:
  ## <foo,78,0>
  movl $0x4, %esi
  callq *%rbx
  ## <foo,78,1>
__llvm_prefetch_target_foo_78_1: # inserted symbol for prefetch target
  testq %r14, %r14
  jne <BB78>
  jmp <BB77>
<BB71>:
  ## <foo,71,0>
__llvm_prefetch_target_foo_71_1: # inserted symbol for prefetch target
  cmpq $0x1, %rax
  jne <BB73>
<BB72>:
  ...
  ...
<bar>:
  ...
<BB12>:
## <bar, 12, 0>
  prefetchit1 0x1113(%rip) <__llvm_prefetch_target_foo_71_0> # prefetch instruction
  movl $0x5, %edi
  callq $0x1200
## <bar, 12, 1>
  prefetchit1 0x1115(%rip) <__llvm_prefetch_target_foo_78_1> # prefetch instruction
  testq %r14, %r14
  je <BB15>
<BB13>:
  ...

It is possible that the symbol referenced by a prefetch instruction is not defined (e.g., if the target function’s name, block id, or callsites vary in the final build). Currently, we have implemented special handling in the linker to reject the relocation if the symbol is undefined (by checking the symbol name), effectively causing the prefetch to target 0x0(%rip) – the next instruction. Prefetching the next instruction is harmless and can be tolerated since prefetches are nonblocking instructions.

Issues

  • Inserting symbols in the binary may confuse systems and tools (such as llvm-objdump --disassemble-symbols).

  • Tolerating ODR violation when -funique-internal-linkage-names is not used.

2 Likes

Hi Rahman,

I’m trying to utilize PREFETCHIT0 on gnr2, and your work will be very helpful for what I’m doing.
Could you share how much improvement you have seen on which benchmark by injecting prefetches?

Thanks for your interest in this work.
We can’t specifically share the improvement just yet as this is a new architectural feature. However, we’d be happy to help you set up your workflow to utilize the feature once it goes upstream.

Thanks for the proposal. I have two feedback, one on motivation and the other on the design.

  • Motivation: While I can see the merit of code prefetch, it’d very helpful to show data, especially end performance (hide details about your workload as necessary). All optimizations adds complexity and has trade-offs, and we need to justify them with data to avoid proliferation of complexity.

  • Design: On a high level, this is profile guided code prefetch in the compiler. The optimization (prefetch insertion) is done in the compiler, not as a post-link binary optimization. So the coupling with Propeller seems unnecessary. Hardware profile can be collected with perf. Then we need anchors to correlate profiles, for that, we should leverage what’s already available in the compiler (e.g. AutoFDO has line, CSSPGO has probes, could also build something else if needed). Design principle aside, for practical uses, you also don’t need to require the use of Propeller for a compiler optimization.

1 Like

Could you expand on the reasons for this extra round of profiling, i.e. why profiling a binary without code-layout optimizations is not sufficient?

The transformation is indeed done in the compiler. Propeller can be viewed as the compiler extension that does the profile correlation and makes global optimization decisions.

The transformation is indeed done in the compiler. Propeller can be viewed as the compiler extension that does the profile correlation and makes global optimization decisions.

If the optimization is in the compiler, it’s not different from other compiler PGO. Can we decouple it from Propeller? Compiler has global view too with LTO for global decisions (whatever prebuilt libs that Propeller can see but compiler can’t doesn’t matter because the optimization still has to go through compiler)

OTOH, you could design this as a pure binary optimization, in which case you don’t need the compiler. But if possible, it’s preferred for optimizations to be in the compiler still.

There will be some independent work that makes the Propeller build flow more integrated with compiler flow similar to what AFDO has (in particular the offline step). The difference is that Propeller benefits more from freshly collected profile fully matching the binary being built.

Could you expand on the reasons for this extra round of profiling, i.e. why profiling a binary without code-layout optimizations is not sufficient?

Thank you. You’re right. It doesn’t require a Propeller-optimized binary. All we need is a round of profile on top of the optimized binary (built with the SHT_LLVM_BB_ADDR_MAP to enable address mapping). I will fix the description.

Just to be sure I understand, is the overall improvement in runtime (or transaction time, etc.) for this workload? (Or is some other improvement targeted?)

Can you give any context to this number? For example, are you able to say anything about how large the binary is or how severe the frontend issues are on GNR?

I have also explored code prefetching, but have had a hard time finding real applications which benefit. Clang itself is actually the most frontend latency-bound case I’ve been able to find, but I haven’t been able to show benefit there.

Can you please elaborate on what kind of profiles are collected and roughly how they are used to arrive at prefetches?

Sure. I will update with the some data.

Could you please be more specific? Do you mean that the API for prefetch insertion directives should be flexible-enough to allow using it on AFDO, CSSPGO, etc? Currently, our mapping uses the LLVM_BB_ADDR_MAP and the insertion pass uses precise directives as explained in this RFC, but it should be possible to adapt to other types of profiles if needed. For example, AFDO can use this by capturing the triplets for each prefetch target and injection site (which is presumably in line number format, initially).

All of our profile analysis will be done outside of the compiler. We need little assistance from the compiler:

  • Inserting prefetches at the requested places
  • Inserting symbols for the prefetch targets

As long as we have a way to give this information to the compiler, we can remove Propeller out of this equation. However, the mappings must be translated accordingly into the triplet format as described in the RFC.

OTOH, you could design this as a pure binary optimization, in which case you don’t need the compiler. But if possible, it’s preferred for optimizations to be in the compiler still.

When you say pure binary optimization, do you mean do it in BOLT? Code Prefetching is a classic post link binary optimization as we want code prefetches at precise points in a binary without affecting anything else. BOLT would actually be a great way to achieve this but since we use Propeller internally, we are doing it here.

When you say pure binary optimization, do you mean do it in BOLT? Code Prefetching is a classic post link binary optimization as we want code prefetches at precise points in a binary without affecting anything else. BOLT would actually be a great way to achieve this but since we use Propeller internally, we are doing it here.

Yeah, what i was thinking – either go with a pure binary optimization (like BOLT), or do it purely with compiler FDO (without dependency on propeller). Given that you need the compiler to do the optimization anyways, might as well just integrate it with compiler FDO.

Could you please be more specific? Do you mean that the API for prefetch insertion directives should be flexible-enough to allow using it on AFDO, CSSPGO, etc?

Basically doing this as an extension to AFDO or CSSPGO, using llvm_create_prof or llvm-profgen to process raw perf data, made decisions on prefetch placement, generate LLVM-acceptable profile (could be part of AFDO, CSSPGO profile, could be a separate profile), feed into LLVM to drive the same optimization you implemented.

There will be some independent work that makes the Propeller build flow more integrated with compiler flow similar to what AFDO has (in particular the offline step). The difference is that Propeller benefits more from freshly collected profile fully matching the binary being built.

How is propeller approach helping with profile matching? As long as we have good correlation anchors (e.g. pseudo-probe), doing this with compiler FDO should be equally effective. If you are talking about zero-gap (no source drift) profile collection for propeller workflow, then it’s not an advantage of propeller approach, but the workflow itself which can be used for compiler FDO too (optimize the same binary that is being profiled).

It is the zero-gap profile collection, but it is also an advantage of Propeller. 1) Propeller insert anchors late in the pipeline thus does not subject to issues with anchor maintenance (due cfg changes). For iFDO, even when the source matches the profile, the profile seen (propagated) at late stage optimization can be quite off (there are work on context-sensitivity and flow-sensitivity, but still not perfect). Propeller achieves the same level of profile fidelity as BOLT, but leverages compiler for transformation. 2) Propeller can be think of as another whole program synchronization point as ThinLTO’s thinLink. It does summary based optimization decisions and reinvokes compiler.

I can see the appeal for prefetching insertion implementation that does not depend on Propeller, but they are serving different need (i.e. on quality of the insertion ).

As noted by davidxl effective code prefetching requires highly precise, context-sensitive profile data.

Integrating this precision into front-end PGO systems (iFDO, CSSPGO, AFDO) is difficult. This is because core optimizations, notably inlining, shift instruction fetch bottlenecks, invalidating the profile’s prefetching effectiveness. Additionally, AFDO’s reliance on source line numbers hinders precise optimization.

The post-link-time nature of BOLT makes it the ideal, stable platform for this context-sensitive optimization, as it operates on the final, optimized binary.

However, adopting BOLT forces us to confront the longstanding architectural challenge: adapting monolithic binary rewriting and disassembly to our existing large-scale distributed build environments. This infrastructure hurdle remains a significant barrier.

Some performance data:

  • 6% decrease in critical L2 instruction cache miss rate.
  • 6% increase in critical L1 instruction cache miss rate.
  • 1% decrease in total L2 code misses (including speculative).
  • 6% increase in total L2 code reads (including speculative and prefetches).
  • 0.8% increase in effective IPC (computed as [#total instructions - #prefetches] / #cycles).
  • Executed 0.23 code prefetches per 1000 instructions.
2 Likes

The particular workload that we targeted uses throughput (QPS).

Clang is definitely frontend-bound – especially with regards to L1 instruction cache and TLB, but has much lower L2 instruction cache miss compared to our internal workload.

Hi, this is amazing. Thank you for sharing.
I was wondering why the L1i performance decreased. It seems a bit counter-intuitive. Does the prefetchi* instruction load the target code into L2, rather than L1i? I couldn’t find a precise spec in the official document.
Also, I’d appreciate it if you could let me know what the prefetchi distance is in your environment.

For iFDO, even when the source matches the profile, the profile seen (propagated) at late stage optimization can be quite off (there are work on context-sensitivity and flow-sensitivity, but still not perfect).

We could have a CS-IRPGO style workflow that relies on same profile/optimization (including inlining) between the profiling builds and the optimization build to get around this issue. In that case, there would be a separate prefetch profile to guide prefetch insertion.

Such workflow does make it look similar to the propeller workflow. But in general I think it’s desirable to make optimization available in the compiler first even though it may have lower (relative to Propeller or BOLT) effectiveness due to profile quality limitations. As an example, hypothetically if we don’t have any profile guided block layout in the compiler today, we would probably implement that in the compiler first before doing it with Propeller or BOLT, even though that can yield better results.

That said, as long as the compiler part of this proposal is generic so it can be reused for compiler PGO style prefetch later, it should be a good addition to have. Along the lines of that, if the heuristic/decision of prefetch placement can be implemented in the compiler, I think that would be even better for reuse.

I’m also curious how/what HW profile is collected, and how you synthesize prefetch placement from that?