[RFC] Supporting more early-exit loops

Following on from David Sherwood’s work over the last year to support basic early exit loops, we’d like to propose a basic design for the next phase of that work.

Prior Work

David extended the vectorizer (along with scalar evolution, laa, etc) to support autovectorization of loops with an early exit, subject to the following restrictions:

  • No stores in the loop
  • No loads after the one used for the early exit
  • No reductions in the loop
  • The loop must have a single counted exit (linear induction variable
    compared against loop-invariant value)
  • The loop must have a single uncounted exit (comparison of a value loaded from memory against a loop-invariant value)
  • All possible memory accessed in the loop must be known to be dereferenceable at compile time; so assuming that the uncounted early exit is never taken, the loop will not fault up to the maximum bounds of the counted exit.

A sample loop that should vectorize today (when early-exit loopvec is enabled via a flag):

// Global array declared with known size, so that the compiler
// knows all 'N' iterations are safe to execute.
#define N (100)
int array[N];

int search_first_occurrence(int value) {
  int result = -1;
  for (int i = 0; i < N; i++)
    if (array[i] == value) {
      result = i;
      break;
    }
  return result;
}

While hopefully all the restrictions will be removed eventually, the first two Arm intends to address is supporting extra memory operations in the loop and requiring compile-time knowledge of memory bounds.

Supporting additional memory operations.

At present, early exit loops featuring stores (or loads after the early exit) will be rejected by LV legality checking. We would like to remove that restriction.

Given a C loop like the following:

void ee_with_store(int * restrict array, int *pred, int N) {
  for (int i = 0; i < N; ++i) {
    array[i]++;
    if (pred[i] > 500)
      break;
  }
}

We have an early exit loop based on pred[i], with a load and store to array[i] occurring before the exit. We will see scalar IR similar to the following:

define void @ee_with_store(ptr dereferenceable(40) noalias %array, ptr dereferenceable(40) readonly %pred) {
entry:
  br label %for.body
for.body:
  %iv = phi i64 [ 0, %entry ], [ %iv.next, %for.inc ]
  %st.addr = getelementptr inbounds nuw i16, ptr %array, i64 %iv
  %data = load i16, ptr %st.addr, align 4
  %inc = add nsw i16 %data, 1
  store i16 %inc, ptr %st.addr, align 4
  %ee.addr = getelementptr inbounds nuw i16, ptr %pred, i64 %iv
  %ee.val = load i16, ptr %ee.addr, align 4
  %ee.cond = icmp sgt i16 %ee.val, 500
  br i1 %ee.cond, label %exit, label %for.inc
for.inc:
  %iv.next = add nuw nsw i64 %iv, 1
  %counted.cond = icmp eq i64 %iv.next, 20
  br i1 %counted.cond, label %exit, label %for.body
exit:
  ret void
}

(Dereferenceable attributes added to stay within known bounds)

In order to support the stores, we will need to either mask them based on the combined exits, or keep them as unconditional but bail out to the scalar loop in the event of an exit. Since the store occurs before the early exit, we need to determine whether an exit will occur in the next vector iteration and generate the mask or exit the vector loop based on that.

Doing so requires that we identify the IR responsible for the exit condition, and after checking to make sure it’s safe to do so, duplicate the vectorized code for it in the preheader, similar to the following:

vector.preheader:
  %ph.ee.load = load <8 x i16>, ptr %pred, align 4
  %ph.ee.cmp = icmp sgt <8 x i16> %ph.ee.load, splat(i16 500)
  ;; bail out immediately just wants an 'any' reduction.
  %ph.ee.cond = call i1 @llvm.vector.reduce.or.v8i1(<8 x i1> %ph.ee.cmp)
  br i1 %ph.ee.cond, label %scalar.ph, label %vector.body
vector.body:
  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
  ;; Normal vectorized loop body.
  %vaddr = getelementptr inbounds nuw i16, ptr %array, i64 %index
  %vload = load <8 x i16>, ptr %vaddr, align 4
  %vinc = add nsw <8 x i16> %vload, splat (i16 1)
  store <8 x i16> %vinc, ptr %vaddr, align 4
  %index.next = add i64 %index, 8
  ;; Find whether there's an exit in the next vector iteration, from either
  ;; the counted or uncounted conditions.
  ;;
  ;; EE load in loop must be masked to avoid going beyond end of known loop.
  ;; In this case with an upper bound of 20 and a VF of 8, we would go
  ;; 4 elements beyond the end of what's known to be dereferenceable.
  %active.lane.mask = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i64(i64 %index.next, i64 20)
  %ee.next.addr = getelementptr inbounds nuw i16, ptr %pred, i64 %index.next
  %ee.masked.load = call <8 x i16> @llvm.masked.load.v8i16.p0(ptr nonnull %ee.next.addr, i32 4, <8 x i1> %active.lane.mask, <8 x i16> poison)
  %ee.cmp = icmp sgt <8 x i16> %ee.masked.load, splat(i16 500)
  %ee.masked = and <8 x i1> %ee.cmp, %active.lane.mask
  %ee.any = call i1 @llvm.vector.reduce.or.v8i1(<8 x i1> %ee.masked)
  ;; Combine EE with counted exit
  %counted = icmp eq i64 %index.next, 16
  %combined = or i1 %counted, %ee.any
  br i1 %combined, label %middle.block, label %vector.body

LV legality can identify the necessary IR for the early exits and whether duplication is safe, and that a VPlan transform can then modify the plan to duplicate the condition in the preheader, modify the IR in the vector body to load the condition for the next vector iteration, and create any phi nodes required.

Sample vplan (done by hand, so there’s probably some bits missing. hopefully the intent is clear):

<x1> vector loop: {
ir-bb<vector.ph>:
  IR   %2 = call i64 @llvm.vscale.i64()
  IR   %3 = mul i64 %2, 4
  IR   %n.mod.vf = urem i64 200, %3
  IR   %n.vec = sub i64 200, %n.mod.vf
  IR   %4 = call i64 @llvm.vscale.i64()
  IR   %5 = mul i64 %4, 4
  IR   %ph.ee.load = load <8 x i16>, ptr %pred, align 4
  IR   %ph.ee.cmp = icmp sgt <8 x i16> %ph.ee.load, splat(i16 500)
  IR   %ph.ee.cond = call i1 @llvm.vector.reduce.or.v8i1(<8 x i1> %ph.ee.cmp)
  IR   br i1 %ph.ee.cond, label %scalar.ph, label %vector.body
Successor(s): vector loop, ir-bb<scalar.ph>

<x1> vector loop: {
  vector.body:
    SCALAR-PHI vp<%0> = phi ir<0>, vp<%index.next>
    ir<%indvars.iv> = WIDEN-INDUCTION  ir<0>, ir<1>, ir<%5>
    vp<%1> = SCALAR-STEPS vp<%0>, ir<1>

    CLONE ir<%arrayidx> = getelementptr inbounds nuw ir<%data>, ir<0>, vp<%1>
    vp<%2> = vector-pointer ir<%arrayidx>
    WIDEN ir<%6> = load vp<%2>
    WIDEN ir<%inc> = add nsw ir<%6>, ir<1>
    vp<%3> = vector-pointer ir<%arrayidx>
    WIDEN store vp<%3>, ir<%inc>

    EMIT vp<%index.next> = add nuw vp<%0>, ir<%5>.1
    vp<%4> = SCALAR-STEPS vp<%index.next>, ir<1>
    CLONE ir<%arrayidx2> = getelementptr inbounds nuw ir<%pred>, ir<0>, vp<%4>
    vp<%5> = vector-pointer ir<%arrayidx2>
    WIDEN ir<%6> = load vp<%5>
    WIDEN ir<%cmp1> = icmp sgt ir<%6>, ir<500>
    EMIT vp<%6> = any-of ir<%cmp1>
    EMIT vp<%7> = icmp eq vp<%index.next>, ir<%n.vec>
    EMIT vp<%8> = or vp<%6>, vp<%7>

    EMIT branch-on-cond vp<%8>
  No successors
}
Successor(s): ir-bb<middle.block>

ir-bb<middle.block>:
Successor(s): ir-bb<scalar.ph>

ir-bb<scalar.ph>:
  EMIT vp<%bc.resume.val> = resume-phi ... // equivalent to [ 0, %vector.ph ], [ %index.next, %middle.block ]
  br label %for.body

Handling exits

Our initial proposal is to bail out to the scalar loop to handle exits as the least intrusive option to start with. It can be expensive to create masks in either the loop body or a vectorized tail, so it will likely be cheaper to just handle it via scalar code. We will consider enabling tail-folding later after the core feature has landed.

Supporting unbounded memory access

The current early exit feature will only vectorize when all memory accesses are known at compile time to be dereferenceable up to the maximum number of scalar loop iterations. This conflicts with common C usage patterns, where a plain pointer is passed alongside an integer that should represent the upper bound that can be safely dereferenced based on the pointee element type, but that isn’t enforced in the language.

double some_func(double *in, int *conditions, int size)

We would like to be able to vectorize at least some loops where the bounds are not guaranteed at compile time and must be determined during execution based on data loaded from memory.

Speculative Loads

SVE has a set of instructions known as ‘first-faulting loads’, which (as the name suggests) will fault only on the first lane. Any subsequent lanes will not trigger a fault if the memory location is invalid, but instead record that the access was invalid in an extra dedicated predicate register. This register (‘FFR’) can then be combined with normal predicate registers to create a predicate which excludes faulting lanes. This will allow us to safely perform loads to determine whether an exit has occurred, and therefore mask other memory operations in the loop according to what the scalar loop would
do.

In the event that the loop termination condition is not met before encountering a fault, we would either fault on the first lane, or would exit the vector loop and resume in the scalar loop and trigger the fault there, so the vector loop would also fault if the scalar loop would.

There doesn’t seem to be any support for such speculative loads in X86, SX Aurora (though they do have ‘dismissable’ scalar loads which discard faults?), or POWER. The RISC-V V extension does include first-faulting loads, but only for linear loads, whereas SVE can perform it for gathers too. RVV first-faulting loads also modify the vector length via the EVL instead of setting a predicate.

The current proposal is to add a version of the masked load intrinsic which is first faulting, and return a struct consisting of the loaded data and the predicate indicating valid lanes. The RISC-V folks could add an EVL-based variant if desired.

This work would mostly just replace normal masked loads with the first faulting variant for loads that are part of determining the loop exit condition, then combine the resulting predicate with other exit conditions to decide whether to exit the vector loop.

Comments welcome.

3 Likes

Nearly all architectures support “speculative” loads if they’re fully aligned, because they can only be in one page so either the first element faults because that page would fault, or the entire vector is loaded successfully.

Of course that’s just the assembly-level semantics, IIRC LLVM doesn’t have an instruction that does that…though it could be useful to add. I’d expect the semantics to be something like returning poison in all lanes that are outside of an allocation, but the load itself is only UB if it’s not completely inside the pages containing the allocation. So, e.g. a lame strlen using that kind of speculative load (I didn’t test it, but it probably works except for the TODO):

https://llvm.godbolt.org/z/vEfeW545f

define i64 @my_strlen(ptr %p) {
start:
    %addr = ptrtoint ptr %p to i64
    %align = and i64 %addr, 15
    %aligned = icmp eq i64 %align, 0
    br i1 %aligned, label %fast_path, label %not_aligned

fast_path:
    %old_len = phi i64 [0, %start], [%len, %fast_path]
    %fast_p = phi ptr [%p, %start], [%next_p, %fast_path]
    %bytes = call <16 x i8> @llvm_speculative_load_16xi8_align_16(ptr %fast_p)
    %frozen_bytes = freeze <16 x i8> %bytes
    %found_zeros = icmp eq <16 x i8> %frozen_bytes, zeroinitializer
    %found_zeros2 = bitcast <16 x i1> %found_zeros to i16
    %zero_index2 = call i16 @llvm.cttz.i16(i16 %found_zeros2, i1 false)
    %zero_index = zext i16 %zero_index2 to i64
    %len = add i64 %zero_index, %old_len
    %next_p = getelementptr inbounds i8, ptr %fast_p, i64 16
    %done = icmp ne i64 %zero_index, 16
    br i1 %done, label %finish, label %fast_path

finish:
    ret i64 %len

not_aligned:
    ; TODO
    ret i64 0
}

define private <16 x i8> @llvm_speculative_load_16xi8_align_16(ptr %p) {
    ; we don't have a speculative instruction, so just use load as a demo
    %retval = load <16 x i8>, ptr %p, align 16
    ret <16 x i8> %retval
}

If you can guarantee alignment of all loads required for determining the condition, yes. David Sherwood has a patch to implement loop versioning based on that: https://github.com/llvm/llvm-project/pull/120603

Peeling to align would be another option, one that could be done nicely on SVE via predication.

However, we have a couple of potential problems. One is having multiple loads required for determining exit conditions which don’t align on the same iteration. This won’t be a problem immediately, since we don’t intend on increasing the number of supported uncounted exits yet.

Another potential problem is one discussed in David’s PR – some architecture have sub-page-level access protection, so aligning to avoid crossing page boundaries in a single access may still cause faults with vector loads that cross protection granularity boundaries. AArch64 has the MTE (memory tagging) extension for this, and SPARC has ADI (though the LLVM SPARC backend seems to be missing support for a lot of features). Similar features have been requested for other architectures.

1 Like

that could be handled by reducing the “page size” in the definition to 1 << floor(log2(minimum_protection_granule)).

for MTE, you can also temporarily disable it by setting and then restoring a MSR bit, idk if that works from user mode or requires running in the kernel though:

Tag checking can also be disabled for a user thread by setting the PSTATE.TCO bit with MSR TCO, #1.

The granularity of MTE is 16 bytes, so any processor implementing both MTE and SVE with a vector size larger than 128b wouldn’t be able to use a pseudo-page size as a load would always touch multiple granules, some of which may have different tags.

I don’t think we should consider disabling security features as part of optimization.

So I think separate intrinsics are needed – one similar to what you describe that can be used for many targets by ensuring alignment where possible, and the second for a more limited set of targets and loops when first faulting behaviour is required (either due to something like MTE being enabled, or if we need gathers to determine whether an exit has occurred).

The preference would be for loopvec to use the alignment-based approach if that’s suitable for the loop.

Is there a good way for frontends to convey this information down to LLVM?

It’s typical in Rust for this to be known thanks to all the reads happening inside a &[T], so it’d be great if we could take advantage of this awesome new vectorization work – it’s one of the biggest missed optimization sadnesses I see from people, who end up very surprised that the obvious loop doesn’t just auto-vectorize to something like memchr.

This should be doable via derefenceable assume bundles. At the moment they already work for constant sizes. [Loads] Support dereferenceable assumption with variable size. by fhahn · Pull Request #128436 · llvm/llvm-project · GitHub should also support dynamic sizes.

Clang recently got a builtin, which can be used convey the required info in C/C++, and libraries like libc++ should ideally use those for data types where dereferenceability is known (e.g. std::vector)

1 Like

I had spent some time looking at early exit vectorization a few years back. I’m no longer really looking at this, but possibly my notes from the time might be helpful: public-notes/multiple-exit-vectorization.rst at master · preames/public-notes · GitHub

As mentioned in the community vectorizer call, I have a draft PR for supporting stores in an early exit loop: [LV] Initial support for stores in early exit loops by huntergr-arm · Pull Request #137774 · llvm/llvm-project · GitHub

The matching code in the vplan transform is overly specific to my current loops, partly because the class hierarchy in vplan doesn’t match IR and I don’t have a common base type to use when cloning recipes (e.g. VPRecipeBase is not a VPValue, whereas Instruction is a Value) – I wonder if it’s best to keep it as a vplan transform or if the duplication should be moved earlier and work directly on the scalar IR. Any preferences or comments?

For now I’ll precommit the tests and split off some minor parts into separate PRs, then take another look at improving the transform.