Halide icon indicating copy to clipboard operation
Halide copied to clipboard

[x86] Separate vector instruction selection and CodeGen passes

Open rootjalex opened this issue 3 years ago • 48 comments

Left as a draft for now because this work is unfinished, but I was hoping to get feedback.

This PR is intended to create a separate vector-instruction-selection pass for x86, akin to HexagonOptimize. That pass is defined in X86Optimize.(h | cpp), and as many rules as possible are written via the IRMatch.h templating. In order to successfully create this TRS, this PR also adds:

  • a VectorInstruction IR Node, which is not restricted to element-wise operations (i.e. it supports dot_products)
  • updates to IRMatch.h, i.e. supporting bitwise operations and type hints via a new typed() method

Work that still needs to be done (possibly):

  • [x] Implement VectorReduce logic similar to CodeGen_LLVM::codegen_vector_reduce to split up VectorReduce nodes
  • [x] Implement a base class for ^ and possibly for division + fixed-point intrinsics lowering (?)
  • [x] Add the SapphireRapids float32 + bfloat16 accumulating dot_product pattern, and the psadbw from #6872

This TRS addresses the issue that @abadams raised on #6878, namely that rewriting horizontal_widening_add directly into Halide IR that corresponds to the dot_product instructions is cleaner and more extensible than the implementation in that PR.

I am tagging people that might be interested, but would love feedback from anyone (especially on the points above, and anywhere in the code that says TODO / FIXME).

rootjalex avatar Jul 25 '22 22:07 rootjalex

@steven-johnson Thanks for the review! I believe I addressed all of your comments in ec2cd4e, other than fixing the weird IRVisitor / IRMutator behavior on VectorInstruction nodes - I'm still debugging a piece of that

rootjalex avatar Jul 28 '22 16:07 rootjalex

It appears that the HVX failures here are actually due to a bug in CSE that is removing the definition of hexagon_module_state before it's used - I am struggling to make a minimal example, so I will try to make a separate PR fixing that issue.

rootjalex avatar Jul 28 '22 17:07 rootjalex

Hexagon failures should be fixed by #6894. I don't know what the windows failures are though, and I don't have a windows machine to debug on yet.

rootjalex avatar Jul 28 '22 22:07 rootjalex

Monday Morning Review Ping -- where does this PR stand?

steven-johnson avatar Aug 22 '22 16:08 steven-johnson

I need to get access to a Windows machine to debug the winbot failures. I will try to get access to one sometime this week - other than debugging that, this PR should be good to review.

rootjalex avatar Aug 22 '22 17:08 rootjalex

Huh, the windows failures are indeed odd -- and the error messages unhelpful. Could it be an illegal instruction on those machines? WinBot1 and 2 are pretty old.

steven-johnson avatar Aug 22 '22 17:08 steven-johnson

Huh, the windows failures are indeed odd -- and the error messages unhelpful.

Indeed :')

Could it be an illegal instruction on those machines?

That is quite possible - I thought I had safely guarded all instructions, but it could be that I missed some.

WinBot1 and 2 are pretty old.

Do you know which features they should have enabled? Are they just SSE2?

rootjalex avatar Aug 22 '22 17:08 rootjalex

This is weird: if I run correctness_simd_op_check in cmd.exe, I get normal output and no failures. If, otoh, I run it in the "git bash" shell, I get no output at all... but I eventually get a segfault.

steven-johnson avatar Aug 22 '22 21:08 steven-johnson

.....That is super strange. I don't even know how to begin debugging that.

rootjalex avatar Aug 22 '22 21:08 rootjalex

I did a litting debugging with WinDbg and eventually got a failure mode for simd_op_check that said:

HEAP[correctness_simd_op_check.exe]: HEAP: Free Heap block 00000144984FAFE0 modified at 00000144984FB058 after it was freed
(2954.1e4): Break instruction exception - code 80000003 (first chance)

with a stacktrace of

[0x6]   ucrtbase!_malloc_base + 0x36   
[0x7]   Halide!operator new + 0x1f   
[0x8]   Halide!Halide::Internal::VectorInstruction::make + 0x87   
[0x9]   Halide!Halide::Internal::IRMatcher::VectorInstructionOp<Halide::Internal::IRMatcher::Wild<0>,Halide::Internal::IRMatcher::SpecificExpr>::make + 0xdd   
[0xa]   Halide!Halide::Internal::IRMatcher::TypeHint<Halide::Internal::IRMatcher::VectorInstructionOp<Halide::Internal::IRMatcher::Wild<0>,Halide::Internal::IRMatcher::SpecificExpr> >::make + 0xe4   
[0xb]   Halide!Halide::Internal::IRMatcher::CastOp<Halide::Internal::IRMatcher::TypeHint<Halide::Internal::IRMatcher::VectorInstructionOp<Halide::Internal::IRMatcher::Wild<0>,Halide::Internal::IRMatcher::SpecificExpr> > >::make + 0xe4   
[0xc]   Halide!Halide::Internal::IRMatcher::Rewriter<Halide::Internal::IRMatcher::VectorReduceOp<Halide::Internal::IRMatcher::SpecificExpr,Halide::Internal::IRMatcher::IntLiteral,0> >::build_replacement<Halide::Internal::IRMatcher::CastOp<Halide::Internal::IRMatcher::TypeHint<Halide::Internal::IRMatcher::VectorInstructionOp<Halide::Internal::IRMatcher::Wild<0>,Halide::Internal::IRMatcher::SpecificExpr> > > > + 0x104   
[0xd]   Halide!Halide::Internal::`anonymous namespace'::Optimize_X86::visit + 0x164f   
[0xe]   Halide!Halide::Internal::ExprNode<Halide::Internal::VectorReduce>::mutate_expr + 0x1b   
[0xf]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x10]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x11]   Halide!Halide::Internal::`anonymous namespace'::mutate_binary_operator<Halide::Internal::Add> + 0x29   
[0x12]   Halide!Halide::Internal::IRMutator::visit + 0x14   
[0x13]   Halide!Halide::Internal::`anonymous namespace'::Optimize_X86::visit + 0x1de9   
[0x14]   Halide!Halide::Internal::ExprNode<Halide::Internal::Add>::mutate_expr + 0x1b   
[0x15]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x16]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x17]   Halide!Halide::Internal::IRMutator::visit + 0x46   
[0x18]   Halide!Halide::Internal::StmtNode<Halide::Internal::Store>::mutate_stmt + 0x18   
[0x19]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x1a]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x1b]   Halide!Halide::Internal::IRMutator::visit + 0x3c   
[0x1c]   Halide!Halide::Internal::StmtNode<Halide::Internal::Block>::mutate_stmt + 0x18   
[0x1d]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x1e]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x1f]   Halide!Halide::Internal::IRMutator::visit + 0x1d   
[0x20]   Halide!Halide::Internal::StmtNode<Halide::Internal::ProducerConsumer>::mutate_stmt + 0x1b   
[0x21]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x22]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x23]   Halide!Halide::Internal::IRMutator::visit + 0x26   
[0x24]   Halide!Halide::Internal::StmtNode<Halide::Internal::Block>::mutate_stmt + 0x18   
[0x25]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x26]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x27]   Halide!Halide::Internal::IRMutator::visit + 0x47   
[0x28]   Halide!Halide::Internal::StmtNode<Halide::Internal::Allocate>::mutate_stmt + 0x18   
[0x29]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x2a]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x2b]   Halide!Halide::Internal::IRMutator::visit + 0x55   
[0x2c]   Halide!Halide::Internal::StmtNode<Halide::Internal::For>::mutate_stmt + 0x18   
[0x2d]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x2e]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x2f]   Halide!Halide::Internal::IRMutator::visit + 0x3c   
[0x30]   Halide!Halide::Internal::StmtNode<Halide::Internal::LetStmt>::mutate_stmt + 0x1b   
[0x31]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x32]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x33]   Halide!Halide::Internal::IRMutator::visit + 0x55   
[0x34]   Halide!Halide::Internal::StmtNode<Halide::Internal::For>::mutate_stmt + 0x18   
[0x35]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x36]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x37]   Halide!Halide::Internal::IRMutator::visit + 0x1d   
[0x38]   Halide!Halide::Internal::StmtNode<Halide::Internal::ProducerConsumer>::mutate_stmt + 0x1b   
[0x39]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x3a]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x3b]   Halide!Halide::Internal::IRMutator::visit + 0x26   
[0x3c]   Halide!Halide::Internal::StmtNode<Halide::Internal::Block>::mutate_stmt + 0x18   
[0x3d]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x3e]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x3f]   Halide!Halide::Internal::IRMutator::visit + 0x47   
[0x40]   Halide!Halide::Internal::StmtNode<Halide::Internal::Allocate>::mutate_stmt + 0x18   
[0x41]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x42]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x43]   Halide!Halide::Internal::IRMutator::visit + 0x3c   
[0x44]   Halide!Halide::Internal::StmtNode<Halide::Internal::Block>::mutate_stmt + 0x18   
[0x45]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x46]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x47]   Halide!Halide::Internal::IRMutator::visit + 0x3c   
[0x48]   Halide!Halide::Internal::StmtNode<Halide::Internal::Block>::mutate_stmt + 0x18   
[0x49]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x4a]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x4b]   Halide!Halide::Internal::IRMutator::visit + 0x3c   
[0x4c]   Halide!Halide::Internal::StmtNode<Halide::Internal::LetStmt>::mutate_stmt + 0x1b   
[0x4d]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x4e]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x4f]   Halide!Halide::Internal::IRMutator::visit + 0x3c   
[0x50]   Halide!Halide::Internal::StmtNode<Halide::Internal::LetStmt>::mutate_stmt + 0x1b   
[0x51]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x52]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x53]   Halide!Halide::Internal::IRMutator::visit + 0x3c   
[0x54]   Halide!Halide::Internal::StmtNode<Halide::Internal::LetStmt>::mutate_stmt + 0x1b   
[0x55]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x56]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x57]   Halide!Halide::Internal::IRMutator::visit + 0x3c   
[0x58]   Halide!Halide::Internal::StmtNode<Halide::Internal::LetStmt>::mutate_stmt + 0x1b   
[0x59]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x5a]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x5b]   Halide!Halide::Internal::IRMutator::visit + 0x3c   
[0x5c]   Halide!Halide::Internal::StmtNode<Halide::Internal::Block>::mutate_stmt + 0x18   
[0x5d]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x5e]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x5f]   Halide!Halide::Internal::IRMutator::visit + 0x3c   
[0x60]   Halide!Halide::Internal::StmtNode<Halide::Internal::Block>::mutate_stmt + 0x18   
[0x61]   Halide!Halide::Internal::IRMutator::mutate + 0x28   
[0x62]   Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f   
[0x63]   Halide!Halide::Internal::optimize_x86_instructions + 0x45   

steven-johnson avatar Aug 22 '22 22:08 steven-johnson

FYI: the crash above happens reliably on Windows, but I haven't been able to trigger anything similar on OSX or Linux builds. I'm gonna try pulling this into an experimental branch at Google to run with more sanitizers.

steven-johnson avatar Aug 22 '22 23:08 steven-johnson

So I'm at a bit of a loss to explain the apparent use-after-free on Windows; I pulled these changes into Google and ran this test on Linux-x64 under ASAN, TSAN, and MSAN, and none of them reported any issues. I suppose it's possible that this is a Windows-specific (or MSVC-specific) issue, of course.

(One thing that would still be worth trying that I haven't done would be to build a Debug version of libHalide + simd_op_check and see if the MSVC Debug Runtime can give us any more info; unfortunately, that apparently requires building a Debug version of LLVM too, because trying to mix the different runtimes refuses at link time... and building a Debug version of LLVM is, uh, big and slow and I'm literally unsure if any of our WinBots have enough RAM to do so, period.)

steven-johnson avatar Aug 23 '22 00:08 steven-johnson

Thanks for taking the time to try to debug this!

I suspect that this is an issue with SpecificExpr, which just grabs a reference. I think I know how to solve this, but I just got access to a (very old) Windows machine that has literally nothing useful set-up on it, so it will take me a while to confirm

rootjalex avatar Aug 23 '22 02:08 rootjalex

Actually, I am pretty confident I found the bug, I am just going to use the buildbots to confirm instead of waiting for LLVM to install on this old laptop

rootjalex avatar Aug 23 '22 02:08 rootjalex

The only failure appears unrelated, so I am marking this as ready for review - note that this PR relies on #6900 , so if we end up reverting that, this PR will need to change (it uses saturating_cast intrinsic pattern matchings).

rootjalex avatar Aug 23 '22 13:08 rootjalex

I suspect that this is an issue with SpecificExpr, which just grabs a reference

If it is taking a reference to an Expr-that-can-go-stale, that would indeed fit the pattern -- unfortunately the template code in IRMatch.h is too complex for me to immediately see how this could be happening. Did you trace it all the way thru or just suspect it?

I like your proposed fix, but if this is in fact what is happening, we need to upgrade SpecificExpr to use Expr instead of const Expr & -- if the reference can go stale from the callsites you fixed, I will bet you a dollar that there are other such sites as well, either now or in the future. (Yes, this will cost a bit more reference counting, but if the alternative is crashing, performance is irrelevant).

steven-johnson avatar Aug 23 '22 16:08 steven-johnson

note that this PR relies on https://github.com/halide/Halide/pull/6900

In that case we shouldn't land this yet -- if we can't track down the failure in 6900 in the next day or two, I think we need to (temporarily) revert it pending a fix, since we know it is breaking downstream code, and it's not good policy to keep known-broken code in the main branch.

steven-johnson avatar Aug 23 '22 16:08 steven-johnson

Did you trace it all the way thru or just suspect it?

I ran into this issue last year at some point, and it caused random crashes - once I saw the SpecificExpr in your debug script I immediately suspected it as the source of the issue.

we need to upgrade SpecificExpr to use Expr instead of const Expr & -- if the reference can go stale from the callsites you fixed, I will bet you a dollar that there are other such sites as well, either now or in the future. (Yes, this will cost a bit more reference counting, but if the alternative is crashing, performance is irrelevant).

I agree, these bugs are very annoying to hunt down. @abadams any thoughts? I'm happy to change SpecificExpr in this PR or another one.

In that case we shouldn't land this yet -- if we can't track down the failure in 6900 in the next day or two, I think we need to (temporarily) revert it pending a fix, since we know it is breaking downstream code, and it's not good policy to keep known-broken code in the main branch.

Absolutely. Any chance the changes I pushed to #6961 last night fixed the issues in Google?

rootjalex avatar Aug 23 '22 16:08 rootjalex

Any chance the changes I pushed to https://github.com/halide/Halide/pull/6961 last night fixed the issues in Google?

I'll try that now.

steven-johnson avatar Aug 23 '22 17:08 steven-johnson

I'll try that now.

Thank you!

rootjalex avatar Aug 23 '22 17:08 rootjalex

IIRC, having SpecificExpr hold an Expr instead of a BaseExprNode reference is catastrophic for performance and stack space usage, because now constructing a pattern requires atomic operations, and we have stack frames where thousands of patterns are constructed but not used.

So just changing it to an Expr, while appealing, is probably not a viable fix. The IRMatch code doesn't use Exprs for a good reason.

abadams avatar Aug 23 '22 19:08 abadams

I suggest changing it to Expr to see if that's the issue, and if it is, seeing if there's some way we can force everything that might ingest an Expr in IRMatch.h to only accept a "const Expr &" and not an "Expr" or an "Expr &&". Right now it's all universal references.

abadams avatar Aug 23 '22 19:08 abadams

Oh, we already guard against it:


template<typename T>
HALIDE_ALWAYS_INLINE void assert_is_lvalue_if_expr() {
    static_assert(!std::is_same<typename std::decay<T>::type, Expr>::value || std::is_lvalue_reference<T>::value,
                  "Exprs are captured by reference by IRMatcher objects and so must be lvalues");
}

abadams avatar Aug 23 '22 19:08 abadams

v_intrin and the Call intrinsic objects should now all statically fail if we use an Expr as an arg, and I confirmed that it does indeed catch the cases that were causing the Windows failures.

rootjalex avatar Aug 24 '22 16:08 rootjalex

Nice work finding and fixing the dangling-reference stuff -- it didn't occur to me that you could insert is_lvalue... checks as a means of verification.

having SpecificExpr hold an Expr instead of a BaseExprNode reference is catastrophic for performance and stack space usage

We should capture this in a code comment, so someone (e.g. me) won't make the same suggestion in the future...

steven-johnson avatar Aug 24 '22 16:08 steven-johnson

(LGTM pending green)

steven-johnson avatar Aug 24 '22 16:08 steven-johnson

Nice work finding and fixing the dangling-reference stuff

Thanks!

it didn't occur to me that you could insert is_lvalue... checks as a means of verification.

Didn't occur to me either, I'm glad Andrew pointed me to the right method :)

We should capture this in a code comment, so someone (e.g. me) won't make the same suggestion in the future...

Done!

rootjalex avatar Aug 24 '22 17:08 rootjalex

Monday morning review ping

rootjalex avatar Sep 12 '22 18:09 rootjalex

Pulling this into Google for testing, I get many failures of the form:

Internal Error at [third_party/halide/halide/src/IR.cpp:42] Condition failed: a.type() == b.type(): Add of mismatched types
*** SIGABRT received by PID 3966 (TID 3966) on cpu 210 from PID 3966; stack trace: ***
PC: @     0x7ff6fddcf347  (unknown)  gsignal
    @     0x560160be7b34        976  FailureSignalHandler()
    @     0x7ff6fdf2a1c0  (unknown)  (unknown)
    @     0x7ff6fddcf347        136  gsignal
    @     0x7ff6fddd0797        304  abort
    @     0x56015d5ac033        112  Halide::Internal::ErrorReport::~ErrorReport()
    @     0x56015d72eb05        352  Halide::Internal::Add::make()
    @     0x56015d7446df         96  Halide::Internal::IRMutator::visit()
    @     0x56015dd1bc33        432  Halide::Internal::(anonymous namespace)::Optimize_X86::visit()

I'll try to narrow it down...

steven-johnson avatar Sep 12 '22 22:09 steven-johnson

@steven-johnson I have a suspicion that it could be the result of an issue with the top-down type inference in IRMatch.h. Any chance you could try out the small change I made to this branch on rootjalex/x86-optimize-test (d5f507e) ? Sorry about this

rootjalex avatar Sep 12 '22 23:09 rootjalex