Skip to content

Commit

Permalink
[SLP] Enable 64-bit wide vectorization on AArch64
Browse files Browse the repository at this point in the history
ARM Neon has native support for half-sized vector registers (64 bits).  This
is beneficial for example for 2D and 3D graphics.  This patch adds the option
to lower MinVecRegSize from 128 via a TTI in the SLP Vectorizer.

*** Performance Analysis

This change was motivated by some internal benchmarks but it is also
beneficial on SPEC and the LLVM testsuite.

The results are with -O3 and PGO.  A negative percentage is an improvement.
The testsuite was run with a sample size of 4.

** SPEC

* CFP2006/482.sphinx3  -3.34%

A pretty hot loop is SLP vectorized resulting in nice instruction reduction.
This used to be a +22% regression before rL299482.

* CFP2000/177.mesa     -3.34%
* CINT2000/256.bzip2   +6.97%

My current plan is to extend the fix in rL299482 to i16 which brings the
regression down to +2.5%.  There are also other problems with the codegen in
this loop so there is further room for improvement.

** LLVM testsuite

* SingleSource/Benchmarks/Misc/ReedSolomon               -10.75%

There are multiple small SLP vectorizations outside the hot code.  It's a bit
surprising that it adds up to 10%.  Some of this may be code-layout noise.

* MultiSource/Benchmarks/VersaBench/beamformer/beamformer -8.40%

The opt-viewer screenshot can be seen at F3218284.  We start at a colder store
but the tree leads us into the hottest loop.

* MultiSource/Applications/lambda-0.1.3/lambda            -2.68%
* MultiSource/Benchmarks/Bullet/bullet                    -2.18%

This is using 3D vectors.

* SingleSource/Benchmarks/Shootout-C++/Shootout-C++-lists +6.67%

Noise, binary is unchanged.

* MultiSource/Benchmarks/Ptrdist/anagram/anagram          +4.90%

There is an additional SLP in the cold code.  The test runs for ~1sec and
prints out over 2000 lines. This is most likely noise.

* MultiSource/Applications/aha/aha                        +1.63%
* MultiSource/Applications/JM/lencod/lencod               +1.41%
* SingleSource/Benchmarks/Misc/richards_benchmark         +1.15%

Differential Revision: https://reviews.llvm.org/D31965

llvm-svn: 303116
  • Loading branch information
anemet committed May 15, 2017
1 parent bd6e9e7 commit e29686e
Show file tree
Hide file tree
Showing 8 changed files with 58 additions and 1 deletion.
7 changes: 7 additions & 0 deletions llvm/include/llvm/Analysis/TargetTransformInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -537,6 +537,9 @@ class TargetTransformInfo {
/// \return The width of the largest scalar or vector register type.
unsigned getRegisterBitWidth(bool Vector) const;

/// \return The width of the smallest vector register type.
unsigned getMinVectorRegisterBitWidth() const;

/// \return True if it should be considered for address type promotion.
/// \p AllowPromotionWithoutCommonHeader Set true if promoting \p I is
/// profitable without finding other extensions fed by the same input.
Expand Down Expand Up @@ -840,6 +843,7 @@ class TargetTransformInfo::Concept {
Type *Ty) = 0;
virtual unsigned getNumberOfRegisters(bool Vector) = 0;
virtual unsigned getRegisterBitWidth(bool Vector) = 0;
virtual unsigned getMinVectorRegisterBitWidth() = 0;
virtual bool shouldConsiderAddressTypePromotion(
const Instruction &I, bool &AllowPromotionWithoutCommonHeader) = 0;
virtual unsigned getCacheLineSize() = 0;
Expand Down Expand Up @@ -1076,6 +1080,9 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
unsigned getRegisterBitWidth(bool Vector) override {
return Impl.getRegisterBitWidth(Vector);
}
unsigned getMinVectorRegisterBitWidth() override {
return Impl.getMinVectorRegisterBitWidth();
}
bool shouldConsiderAddressTypePromotion(
const Instruction &I, bool &AllowPromotionWithoutCommonHeader) override {
return Impl.shouldConsiderAddressTypePromotion(
Expand Down
2 changes: 2 additions & 0 deletions llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
Original file line number Diff line number Diff line change
Expand Up @@ -311,6 +311,8 @@ class TargetTransformInfoImplBase {

unsigned getRegisterBitWidth(bool Vector) { return 32; }

unsigned getMinVectorRegisterBitWidth() { return 128; }

bool
shouldConsiderAddressTypePromotion(const Instruction &I,
bool &AllowPromotionWithoutCommonHeader) {
Expand Down
4 changes: 4 additions & 0 deletions llvm/lib/Analysis/TargetTransformInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,10 @@ unsigned TargetTransformInfo::getRegisterBitWidth(bool Vector) const {
return TTIImpl->getRegisterBitWidth(Vector);
}

unsigned TargetTransformInfo::getMinVectorRegisterBitWidth() const {
return TTIImpl->getMinVectorRegisterBitWidth();
}

bool TargetTransformInfo::shouldConsiderAddressTypePromotion(
const Instruction &I, bool &AllowPromotionWithoutCommonHeader) const {
return TTIImpl->shouldConsiderAddressTypePromotion(
Expand Down
8 changes: 8 additions & 0 deletions llvm/lib/Target/AArch64/AArch64Subtarget.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,8 @@ void AArch64Subtarget::initializeProperties() {
case Falkor:
MaxInterleaveFactor = 4;
VectorInsertExtractBaseCost = 2;
// FIXME: remove this to enable 64-bit SLP if performance looks good.
MinVectorRegisterBitWidth = 128;
break;
case Kryo:
MaxInterleaveFactor = 4;
Expand All @@ -99,6 +101,8 @@ void AArch64Subtarget::initializeProperties() {
PrefetchDistance = 740;
MinPrefetchStride = 1024;
MaxPrefetchIterationsAhead = 11;
// FIXME: remove this to enable 64-bit SLP if performance looks good.
MinVectorRegisterBitWidth = 128;
break;
case ThunderX2T99:
CacheLineSize = 64;
Expand All @@ -108,6 +112,8 @@ void AArch64Subtarget::initializeProperties() {
PrefetchDistance = 128;
MinPrefetchStride = 1024;
MaxPrefetchIterationsAhead = 4;
// FIXME: remove this to enable 64-bit SLP if performance looks good.
MinVectorRegisterBitWidth = 128;
break;
case ThunderX:
case ThunderXT88:
Expand All @@ -116,6 +122,8 @@ void AArch64Subtarget::initializeProperties() {
CacheLineSize = 128;
PrefFunctionAlignment = 3;
PrefLoopAlignment = 2;
// FIXME: remove this to enable 64-bit SLP if performance looks good.
MinVectorRegisterBitWidth = 128;
break;
case CortexA35: break;
case CortexA53: break;
Expand Down
7 changes: 7 additions & 0 deletions llvm/lib/Target/AArch64/AArch64Subtarget.h
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,9 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {
// NegativeImmediates - transform instructions with negative immediates
bool NegativeImmediates = true;

// Enable 64-bit vectorization in SLP.
unsigned MinVectorRegisterBitWidth = 64;

bool UseAA = false;
bool PredictableSelectIsExpensive = false;
bool BalanceFPOps = false;
Expand Down Expand Up @@ -191,6 +194,10 @@ class AArch64Subtarget final : public AArch64GenSubtargetInfo {

bool isXRaySupported() const override { return true; }

unsigned getMinVectorRegisterBitWidth() const {
return MinVectorRegisterBitWidth;
}

bool isX18Reserved() const { return ReserveX18; }
bool hasFPARMv8() const { return HasFPARMv8; }
bool hasNEON() const { return HasNEON; }
Expand Down
4 changes: 4 additions & 0 deletions llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,10 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
return 64;
}

unsigned getMinVectorRegisterBitWidth() {
return ST->getMinVectorRegisterBitWidth();
}

unsigned getMaxInterleaveFactor(unsigned VF);

int getCastInstrCost(unsigned Opcode, Type *Dst, Type *Src,
Expand Down
5 changes: 4 additions & 1 deletion llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -316,7 +316,10 @@ class BoUpSLP {
else
MaxVecRegSize = TTI->getRegisterBitWidth(true);

MinVecRegSize = MinVectorRegSizeOption;
if (MinVectorRegSizeOption.getNumOccurrences())
MinVecRegSize = MinVectorRegSizeOption;
else
MinVecRegSize = TTI->getMinVectorRegisterBitWidth();
}

/// \brief Vectorize the tree that starts with the elements in \p VL.
Expand Down
22 changes: 22 additions & 0 deletions llvm/test/Transforms/SLPVectorizer/AArch64/64-bit-vector.ll
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
; RUN: opt -S -slp-vectorizer -mtriple=aarch64--linux-gnu -mcpu=generic < %s | FileCheck %s
; RUN: opt -S -slp-vectorizer -mtriple=aarch64-apple-ios -mcpu=cyclone < %s | FileCheck %s
; Currently disabled for a few subtargets (e.g. Kryo):
; RUN: opt -S -slp-vectorizer -mtriple=aarch64--linux-gnu -mcpu=kryo < %s | FileCheck --check-prefix=NO_SLP %s
; RUN: opt -S -slp-vectorizer -mtriple=aarch64--linux-gnu -mcpu=generic -slp-min-reg-size=128 < %s | FileCheck --check-prefix=NO_SLP %s

define void @f(float* %r, float* %w) {
%r0 = getelementptr inbounds float, float* %r, i64 0
%r1 = getelementptr inbounds float, float* %r, i64 1
%f0 = load float, float* %r0
%f1 = load float, float* %r1
%add0 = fadd float %f0, %f0
; CHECK: fadd <2 x float>
; NO_SLP: fadd float
; NO_SLP: fadd float
%add1 = fadd float %f1, %f1
%w0 = getelementptr inbounds float, float* %w, i64 0
%w1 = getelementptr inbounds float, float* %w, i64 1
store float %add0, float* %w0
store float %add1, float* %w1
ret void
}

0 comments on commit e29686e

Please sign in to comment.