Following on from David Sherwood’s work over the last year to support basic early exit loops, we’d like to propose a basic design for the next phase of that work.
Prior Work
David extended the vectorizer (along with scalar evolution, laa, etc) to support autovectorization of loops with an early exit, subject to the following restrictions:
- No stores in the loop
- No loads after the one used for the early exit
- No reductions in the loop
- The loop must have a single counted exit (linear induction variable
compared against loop-invariant value) - The loop must have a single uncounted exit (comparison of a value loaded from memory against a loop-invariant value)
- All possible memory accessed in the loop must be known to be dereferenceable at compile time; so assuming that the uncounted early exit is never taken, the loop will not fault up to the maximum bounds of the counted exit.
A sample loop that should vectorize today (when early-exit loopvec is enabled via a flag):
// Global array declared with known size, so that the compiler
// knows all 'N' iterations are safe to execute.
#define N (100)
int array[N];
int search_first_occurrence(int value) {
int result = -1;
for (int i = 0; i < N; i++)
if (array[i] == value) {
result = i;
break;
}
return result;
}
While hopefully all the restrictions will be removed eventually, the first two Arm intends to address is supporting extra memory operations in the loop and requiring compile-time knowledge of memory bounds.
Supporting additional memory operations.
At present, early exit loops featuring stores (or loads after the early exit) will be rejected by LV legality checking. We would like to remove that restriction.
Given a C loop like the following:
void ee_with_store(int * restrict array, int *pred, int N) {
for (int i = 0; i < N; ++i) {
array[i]++;
if (pred[i] > 500)
break;
}
}
We have an early exit loop based on pred[i], with a load and store to array[i] occurring before the exit. We will see scalar IR similar to the following:
define void @ee_with_store(ptr dereferenceable(40) noalias %array, ptr dereferenceable(40) readonly %pred) {
entry:
br label %for.body
for.body:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.inc ]
%st.addr = getelementptr inbounds nuw i16, ptr %array, i64 %iv
%data = load i16, ptr %st.addr, align 4
%inc = add nsw i16 %data, 1
store i16 %inc, ptr %st.addr, align 4
%ee.addr = getelementptr inbounds nuw i16, ptr %pred, i64 %iv
%ee.val = load i16, ptr %ee.addr, align 4
%ee.cond = icmp sgt i16 %ee.val, 500
br i1 %ee.cond, label %exit, label %for.inc
for.inc:
%iv.next = add nuw nsw i64 %iv, 1
%counted.cond = icmp eq i64 %iv.next, 20
br i1 %counted.cond, label %exit, label %for.body
exit:
ret void
}
(Dereferenceable attributes added to stay within known bounds)
In order to support the stores, we will need to either mask them based on the combined exits, or keep them as unconditional but bail out to the scalar loop in the event of an exit. Since the store occurs before the early exit, we need to determine whether an exit will occur in the next vector iteration and generate the mask or exit the vector loop based on that.
Doing so requires that we identify the IR responsible for the exit condition, and after checking to make sure it’s safe to do so, duplicate the vectorized code for it in the preheader, similar to the following:
vector.preheader:
%ph.ee.load = load <8 x i16>, ptr %pred, align 4
%ph.ee.cmp = icmp sgt <8 x i16> %ph.ee.load, splat(i16 500)
;; bail out immediately just wants an 'any' reduction.
%ph.ee.cond = call i1 @llvm.vector.reduce.or.v8i1(<8 x i1> %ph.ee.cmp)
br i1 %ph.ee.cond, label %scalar.ph, label %vector.body
vector.body:
%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
;; Normal vectorized loop body.
%vaddr = getelementptr inbounds nuw i16, ptr %array, i64 %index
%vload = load <8 x i16>, ptr %vaddr, align 4
%vinc = add nsw <8 x i16> %vload, splat (i16 1)
store <8 x i16> %vinc, ptr %vaddr, align 4
%index.next = add i64 %index, 8
;; Find whether there's an exit in the next vector iteration, from either
;; the counted or uncounted conditions.
;;
;; EE load in loop must be masked to avoid going beyond end of known loop.
;; In this case with an upper bound of 20 and a VF of 8, we would go
;; 4 elements beyond the end of what's known to be dereferenceable.
%active.lane.mask = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i64(i64 %index.next, i64 20)
%ee.next.addr = getelementptr inbounds nuw i16, ptr %pred, i64 %index.next
%ee.masked.load = call <8 x i16> @llvm.masked.load.v8i16.p0(ptr nonnull %ee.next.addr, i32 4, <8 x i1> %active.lane.mask, <8 x i16> poison)
%ee.cmp = icmp sgt <8 x i16> %ee.masked.load, splat(i16 500)
%ee.masked = and <8 x i1> %ee.cmp, %active.lane.mask
%ee.any = call i1 @llvm.vector.reduce.or.v8i1(<8 x i1> %ee.masked)
;; Combine EE with counted exit
%counted = icmp eq i64 %index.next, 16
%combined = or i1 %counted, %ee.any
br i1 %combined, label %middle.block, label %vector.body
LV legality can identify the necessary IR for the early exits and whether duplication is safe, and that a VPlan transform can then modify the plan to duplicate the condition in the preheader, modify the IR in the vector body to load the condition for the next vector iteration, and create any phi nodes required.
Sample vplan (done by hand, so there’s probably some bits missing. hopefully the intent is clear):
<x1> vector loop: {
ir-bb<vector.ph>:
IR %2 = call i64 @llvm.vscale.i64()
IR %3 = mul i64 %2, 4
IR %n.mod.vf = urem i64 200, %3
IR %n.vec = sub i64 200, %n.mod.vf
IR %4 = call i64 @llvm.vscale.i64()
IR %5 = mul i64 %4, 4
IR %ph.ee.load = load <8 x i16>, ptr %pred, align 4
IR %ph.ee.cmp = icmp sgt <8 x i16> %ph.ee.load, splat(i16 500)
IR %ph.ee.cond = call i1 @llvm.vector.reduce.or.v8i1(<8 x i1> %ph.ee.cmp)
IR br i1 %ph.ee.cond, label %scalar.ph, label %vector.body
Successor(s): vector loop, ir-bb<scalar.ph>
<x1> vector loop: {
vector.body:
SCALAR-PHI vp<%0> = phi ir<0>, vp<%index.next>
ir<%indvars.iv> = WIDEN-INDUCTION ir<0>, ir<1>, ir<%5>
vp<%1> = SCALAR-STEPS vp<%0>, ir<1>
CLONE ir<%arrayidx> = getelementptr inbounds nuw ir<%data>, ir<0>, vp<%1>
vp<%2> = vector-pointer ir<%arrayidx>
WIDEN ir<%6> = load vp<%2>
WIDEN ir<%inc> = add nsw ir<%6>, ir<1>
vp<%3> = vector-pointer ir<%arrayidx>
WIDEN store vp<%3>, ir<%inc>
EMIT vp<%index.next> = add nuw vp<%0>, ir<%5>.1
vp<%4> = SCALAR-STEPS vp<%index.next>, ir<1>
CLONE ir<%arrayidx2> = getelementptr inbounds nuw ir<%pred>, ir<0>, vp<%4>
vp<%5> = vector-pointer ir<%arrayidx2>
WIDEN ir<%6> = load vp<%5>
WIDEN ir<%cmp1> = icmp sgt ir<%6>, ir<500>
EMIT vp<%6> = any-of ir<%cmp1>
EMIT vp<%7> = icmp eq vp<%index.next>, ir<%n.vec>
EMIT vp<%8> = or vp<%6>, vp<%7>
EMIT branch-on-cond vp<%8>
No successors
}
Successor(s): ir-bb<middle.block>
ir-bb<middle.block>:
Successor(s): ir-bb<scalar.ph>
ir-bb<scalar.ph>:
EMIT vp<%bc.resume.val> = resume-phi ... // equivalent to [ 0, %vector.ph ], [ %index.next, %middle.block ]
br label %for.body
Handling exits
Our initial proposal is to bail out to the scalar loop to handle exits as the least intrusive option to start with. It can be expensive to create masks in either the loop body or a vectorized tail, so it will likely be cheaper to just handle it via scalar code. We will consider enabling tail-folding later after the core feature has landed.
Supporting unbounded memory access
The current early exit feature will only vectorize when all memory accesses are known at compile time to be dereferenceable up to the maximum number of scalar loop iterations. This conflicts with common C usage patterns, where a plain pointer is passed alongside an integer that should represent the upper bound that can be safely dereferenced based on the pointee element type, but that isn’t enforced in the language.
double some_func(double *in, int *conditions, int size)
We would like to be able to vectorize at least some loops where the bounds are not guaranteed at compile time and must be determined during execution based on data loaded from memory.
Speculative Loads
SVE has a set of instructions known as ‘first-faulting loads’, which (as the name suggests) will fault only on the first lane. Any subsequent lanes will not trigger a fault if the memory location is invalid, but instead record that the access was invalid in an extra dedicated predicate register. This register (‘FFR’) can then be combined with normal predicate registers to create a predicate which excludes faulting lanes. This will allow us to safely perform loads to determine whether an exit has occurred, and therefore mask other memory operations in the loop according to what the scalar loop would
do.
In the event that the loop termination condition is not met before encountering a fault, we would either fault on the first lane, or would exit the vector loop and resume in the scalar loop and trigger the fault there, so the vector loop would also fault if the scalar loop would.
There doesn’t seem to be any support for such speculative loads in X86, SX Aurora (though they do have ‘dismissable’ scalar loads which discard faults?), or POWER. The RISC-V V extension does include first-faulting loads, but only for linear loads, whereas SVE can perform it for gathers too. RVV first-faulting loads also modify the vector length via the EVL instead of setting a predicate.
The current proposal is to add a version of the masked load intrinsic which is first faulting, and return a struct consisting of the loaded data and the predicate indicating valid lanes. The RISC-V folks could add an EVL-based variant if desired.
This work would mostly just replace normal masked loads with the first faulting variant for loads that are part of determining the loop exit condition, then combine the resulting predicate with other exit conditions to decide whether to exit the vector loop.
Comments welcome.