[x86] Separate vector instruction selection and CodeGen passes
Left as a draft for now because this work is unfinished, but I was hoping to get feedback.
This PR is intended to create a separate vector-instruction-selection pass for x86, akin to HexagonOptimize. That pass is defined in X86Optimize.(h | cpp), and as many rules as possible are written via the IRMatch.h templating. In order to successfully create this TRS, this PR also adds:
- a VectorInstruction IR Node, which is not restricted to element-wise operations (i.e. it supports
dot_products) - updates to IRMatch.h, i.e. supporting bitwise operations and type hints via a new
typed()method
Work that still needs to be done (possibly):
- [x] Implement VectorReduce logic similar to
CodeGen_LLVM::codegen_vector_reduceto split up VectorReduce nodes - [x] Implement a base class for ^ and possibly for division + fixed-point intrinsics lowering (?)
- [x] Add the SapphireRapids float32 + bfloat16 accumulating
dot_productpattern, and thepsadbwfrom #6872
This TRS addresses the issue that @abadams raised on #6878, namely that rewriting horizontal_widening_add directly into Halide IR that corresponds to the dot_product instructions is cleaner and more extensible than the implementation in that PR.
I am tagging people that might be interested, but would love feedback from anyone (especially on the points above, and anywhere in the code that says TODO / FIXME).
@steven-johnson Thanks for the review! I believe I addressed all of your comments in ec2cd4e, other than fixing the weird IRVisitor / IRMutator behavior on VectorInstruction nodes - I'm still debugging a piece of that
It appears that the HVX failures here are actually due to a bug in CSE that is removing the definition of hexagon_module_state before it's used - I am struggling to make a minimal example, so I will try to make a separate PR fixing that issue.
Hexagon failures should be fixed by #6894. I don't know what the windows failures are though, and I don't have a windows machine to debug on yet.
Monday Morning Review Ping -- where does this PR stand?
I need to get access to a Windows machine to debug the winbot failures. I will try to get access to one sometime this week - other than debugging that, this PR should be good to review.
Huh, the windows failures are indeed odd -- and the error messages unhelpful. Could it be an illegal instruction on those machines? WinBot1 and 2 are pretty old.
Huh, the windows failures are indeed odd -- and the error messages unhelpful.
Indeed :')
Could it be an illegal instruction on those machines?
That is quite possible - I thought I had safely guarded all instructions, but it could be that I missed some.
WinBot1 and 2 are pretty old.
Do you know which features they should have enabled? Are they just SSE2?
This is weird: if I run correctness_simd_op_check in cmd.exe, I get normal output and no failures. If, otoh, I run it in the "git bash" shell, I get no output at all... but I eventually get a segfault.
.....That is super strange. I don't even know how to begin debugging that.
I did a litting debugging with WinDbg and eventually got a failure mode for simd_op_check that said:
HEAP[correctness_simd_op_check.exe]: HEAP: Free Heap block 00000144984FAFE0 modified at 00000144984FB058 after it was freed
(2954.1e4): Break instruction exception - code 80000003 (first chance)
with a stacktrace of
[0x6] ucrtbase!_malloc_base + 0x36
[0x7] Halide!operator new + 0x1f
[0x8] Halide!Halide::Internal::VectorInstruction::make + 0x87
[0x9] Halide!Halide::Internal::IRMatcher::VectorInstructionOp<Halide::Internal::IRMatcher::Wild<0>,Halide::Internal::IRMatcher::SpecificExpr>::make + 0xdd
[0xa] Halide!Halide::Internal::IRMatcher::TypeHint<Halide::Internal::IRMatcher::VectorInstructionOp<Halide::Internal::IRMatcher::Wild<0>,Halide::Internal::IRMatcher::SpecificExpr> >::make + 0xe4
[0xb] Halide!Halide::Internal::IRMatcher::CastOp<Halide::Internal::IRMatcher::TypeHint<Halide::Internal::IRMatcher::VectorInstructionOp<Halide::Internal::IRMatcher::Wild<0>,Halide::Internal::IRMatcher::SpecificExpr> > >::make + 0xe4
[0xc] Halide!Halide::Internal::IRMatcher::Rewriter<Halide::Internal::IRMatcher::VectorReduceOp<Halide::Internal::IRMatcher::SpecificExpr,Halide::Internal::IRMatcher::IntLiteral,0> >::build_replacement<Halide::Internal::IRMatcher::CastOp<Halide::Internal::IRMatcher::TypeHint<Halide::Internal::IRMatcher::VectorInstructionOp<Halide::Internal::IRMatcher::Wild<0>,Halide::Internal::IRMatcher::SpecificExpr> > > > + 0x104
[0xd] Halide!Halide::Internal::`anonymous namespace'::Optimize_X86::visit + 0x164f
[0xe] Halide!Halide::Internal::ExprNode<Halide::Internal::VectorReduce>::mutate_expr + 0x1b
[0xf] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x10] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x11] Halide!Halide::Internal::`anonymous namespace'::mutate_binary_operator<Halide::Internal::Add> + 0x29
[0x12] Halide!Halide::Internal::IRMutator::visit + 0x14
[0x13] Halide!Halide::Internal::`anonymous namespace'::Optimize_X86::visit + 0x1de9
[0x14] Halide!Halide::Internal::ExprNode<Halide::Internal::Add>::mutate_expr + 0x1b
[0x15] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x16] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x17] Halide!Halide::Internal::IRMutator::visit + 0x46
[0x18] Halide!Halide::Internal::StmtNode<Halide::Internal::Store>::mutate_stmt + 0x18
[0x19] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x1a] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x1b] Halide!Halide::Internal::IRMutator::visit + 0x3c
[0x1c] Halide!Halide::Internal::StmtNode<Halide::Internal::Block>::mutate_stmt + 0x18
[0x1d] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x1e] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x1f] Halide!Halide::Internal::IRMutator::visit + 0x1d
[0x20] Halide!Halide::Internal::StmtNode<Halide::Internal::ProducerConsumer>::mutate_stmt + 0x1b
[0x21] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x22] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x23] Halide!Halide::Internal::IRMutator::visit + 0x26
[0x24] Halide!Halide::Internal::StmtNode<Halide::Internal::Block>::mutate_stmt + 0x18
[0x25] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x26] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x27] Halide!Halide::Internal::IRMutator::visit + 0x47
[0x28] Halide!Halide::Internal::StmtNode<Halide::Internal::Allocate>::mutate_stmt + 0x18
[0x29] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x2a] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x2b] Halide!Halide::Internal::IRMutator::visit + 0x55
[0x2c] Halide!Halide::Internal::StmtNode<Halide::Internal::For>::mutate_stmt + 0x18
[0x2d] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x2e] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x2f] Halide!Halide::Internal::IRMutator::visit + 0x3c
[0x30] Halide!Halide::Internal::StmtNode<Halide::Internal::LetStmt>::mutate_stmt + 0x1b
[0x31] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x32] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x33] Halide!Halide::Internal::IRMutator::visit + 0x55
[0x34] Halide!Halide::Internal::StmtNode<Halide::Internal::For>::mutate_stmt + 0x18
[0x35] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x36] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x37] Halide!Halide::Internal::IRMutator::visit + 0x1d
[0x38] Halide!Halide::Internal::StmtNode<Halide::Internal::ProducerConsumer>::mutate_stmt + 0x1b
[0x39] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x3a] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x3b] Halide!Halide::Internal::IRMutator::visit + 0x26
[0x3c] Halide!Halide::Internal::StmtNode<Halide::Internal::Block>::mutate_stmt + 0x18
[0x3d] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x3e] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x3f] Halide!Halide::Internal::IRMutator::visit + 0x47
[0x40] Halide!Halide::Internal::StmtNode<Halide::Internal::Allocate>::mutate_stmt + 0x18
[0x41] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x42] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x43] Halide!Halide::Internal::IRMutator::visit + 0x3c
[0x44] Halide!Halide::Internal::StmtNode<Halide::Internal::Block>::mutate_stmt + 0x18
[0x45] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x46] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x47] Halide!Halide::Internal::IRMutator::visit + 0x3c
[0x48] Halide!Halide::Internal::StmtNode<Halide::Internal::Block>::mutate_stmt + 0x18
[0x49] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x4a] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x4b] Halide!Halide::Internal::IRMutator::visit + 0x3c
[0x4c] Halide!Halide::Internal::StmtNode<Halide::Internal::LetStmt>::mutate_stmt + 0x1b
[0x4d] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x4e] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x4f] Halide!Halide::Internal::IRMutator::visit + 0x3c
[0x50] Halide!Halide::Internal::StmtNode<Halide::Internal::LetStmt>::mutate_stmt + 0x1b
[0x51] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x52] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x53] Halide!Halide::Internal::IRMutator::visit + 0x3c
[0x54] Halide!Halide::Internal::StmtNode<Halide::Internal::LetStmt>::mutate_stmt + 0x1b
[0x55] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x56] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x57] Halide!Halide::Internal::IRMutator::visit + 0x3c
[0x58] Halide!Halide::Internal::StmtNode<Halide::Internal::LetStmt>::mutate_stmt + 0x1b
[0x59] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x5a] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x5b] Halide!Halide::Internal::IRMutator::visit + 0x3c
[0x5c] Halide!Halide::Internal::StmtNode<Halide::Internal::Block>::mutate_stmt + 0x18
[0x5d] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x5e] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x5f] Halide!Halide::Internal::IRMutator::visit + 0x3c
[0x60] Halide!Halide::Internal::StmtNode<Halide::Internal::Block>::mutate_stmt + 0x18
[0x61] Halide!Halide::Internal::IRMutator::mutate + 0x28
[0x62] Halide!Halide::Internal::IRGraphMutator::mutate + 0x10f
[0x63] Halide!Halide::Internal::optimize_x86_instructions + 0x45
FYI: the crash above happens reliably on Windows, but I haven't been able to trigger anything similar on OSX or Linux builds. I'm gonna try pulling this into an experimental branch at Google to run with more sanitizers.
So I'm at a bit of a loss to explain the apparent use-after-free on Windows; I pulled these changes into Google and ran this test on Linux-x64 under ASAN, TSAN, and MSAN, and none of them reported any issues. I suppose it's possible that this is a Windows-specific (or MSVC-specific) issue, of course.
(One thing that would still be worth trying that I haven't done would be to build a Debug version of libHalide + simd_op_check and see if the MSVC Debug Runtime can give us any more info; unfortunately, that apparently requires building a Debug version of LLVM too, because trying to mix the different runtimes refuses at link time... and building a Debug version of LLVM is, uh, big and slow and I'm literally unsure if any of our WinBots have enough RAM to do so, period.)
Thanks for taking the time to try to debug this!
I suspect that this is an issue with SpecificExpr, which just grabs a reference. I think I know how to solve this, but I just got access to a (very old) Windows machine that has literally nothing useful set-up on it, so it will take me a while to confirm
Actually, I am pretty confident I found the bug, I am just going to use the buildbots to confirm instead of waiting for LLVM to install on this old laptop
The only failure appears unrelated, so I am marking this as ready for review - note that this PR relies on #6900 , so if we end up reverting that, this PR will need to change (it uses saturating_cast intrinsic pattern matchings).
I suspect that this is an issue with SpecificExpr, which just grabs a reference
If it is taking a reference to an Expr-that-can-go-stale, that would indeed fit the pattern -- unfortunately the template code in IRMatch.h is too complex for me to immediately see how this could be happening. Did you trace it all the way thru or just suspect it?
I like your proposed fix, but if this is in fact what is happening, we need to upgrade SpecificExpr to use Expr instead of const Expr & -- if the reference can go stale from the callsites you fixed, I will bet you a dollar that there are other such sites as well, either now or in the future. (Yes, this will cost a bit more reference counting, but if the alternative is crashing, performance is irrelevant).
note that this PR relies on https://github.com/halide/Halide/pull/6900
In that case we shouldn't land this yet -- if we can't track down the failure in 6900 in the next day or two, I think we need to (temporarily) revert it pending a fix, since we know it is breaking downstream code, and it's not good policy to keep known-broken code in the main branch.
Did you trace it all the way thru or just suspect it?
I ran into this issue last year at some point, and it caused random crashes - once I saw the SpecificExpr in your debug script I immediately suspected it as the source of the issue.
we need to upgrade SpecificExpr to use Expr instead of const Expr & -- if the reference can go stale from the callsites you fixed, I will bet you a dollar that there are other such sites as well, either now or in the future. (Yes, this will cost a bit more reference counting, but if the alternative is crashing, performance is irrelevant).
I agree, these bugs are very annoying to hunt down. @abadams any thoughts? I'm happy to change SpecificExpr in this PR or another one.
In that case we shouldn't land this yet -- if we can't track down the failure in 6900 in the next day or two, I think we need to (temporarily) revert it pending a fix, since we know it is breaking downstream code, and it's not good policy to keep known-broken code in the main branch.
Absolutely. Any chance the changes I pushed to #6961 last night fixed the issues in Google?
Any chance the changes I pushed to https://github.com/halide/Halide/pull/6961 last night fixed the issues in Google?
I'll try that now.
I'll try that now.
Thank you!
IIRC, having SpecificExpr hold an Expr instead of a BaseExprNode reference is catastrophic for performance and stack space usage, because now constructing a pattern requires atomic operations, and we have stack frames where thousands of patterns are constructed but not used.
So just changing it to an Expr, while appealing, is probably not a viable fix. The IRMatch code doesn't use Exprs for a good reason.
I suggest changing it to Expr to see if that's the issue, and if it is, seeing if there's some way we can force everything that might ingest an Expr in IRMatch.h to only accept a "const Expr &" and not an "Expr" or an "Expr &&". Right now it's all universal references.
Oh, we already guard against it:
template<typename T>
HALIDE_ALWAYS_INLINE void assert_is_lvalue_if_expr() {
static_assert(!std::is_same<typename std::decay<T>::type, Expr>::value || std::is_lvalue_reference<T>::value,
"Exprs are captured by reference by IRMatcher objects and so must be lvalues");
}
v_intrin and the Call intrinsic objects should now all statically fail if we use an Expr as an arg, and I confirmed that it does indeed catch the cases that were causing the Windows failures.
Nice work finding and fixing the dangling-reference stuff -- it didn't occur to me that you could insert is_lvalue... checks as a means of verification.
having SpecificExpr hold an Expr instead of a BaseExprNode reference is catastrophic for performance and stack space usage
We should capture this in a code comment, so someone (e.g. me) won't make the same suggestion in the future...
(LGTM pending green)
Nice work finding and fixing the dangling-reference stuff
Thanks!
it didn't occur to me that you could insert is_lvalue... checks as a means of verification.
Didn't occur to me either, I'm glad Andrew pointed me to the right method :)
We should capture this in a code comment, so someone (e.g. me) won't make the same suggestion in the future...
Done!
Monday morning review ping
Pulling this into Google for testing, I get many failures of the form:
Internal Error at [third_party/halide/halide/src/IR.cpp:42] Condition failed: a.type() == b.type(): Add of mismatched types
*** SIGABRT received by PID 3966 (TID 3966) on cpu 210 from PID 3966; stack trace: ***
PC: @ 0x7ff6fddcf347 (unknown) gsignal
@ 0x560160be7b34 976 FailureSignalHandler()
@ 0x7ff6fdf2a1c0 (unknown) (unknown)
@ 0x7ff6fddcf347 136 gsignal
@ 0x7ff6fddd0797 304 abort
@ 0x56015d5ac033 112 Halide::Internal::ErrorReport::~ErrorReport()
@ 0x56015d72eb05 352 Halide::Internal::Add::make()
@ 0x56015d7446df 96 Halide::Internal::IRMutator::visit()
@ 0x56015dd1bc33 432 Halide::Internal::(anonymous namespace)::Optimize_X86::visit()
I'll try to narrow it down...
@steven-johnson I have a suspicion that it could be the result of an issue with the top-down type inference in IRMatch.h. Any chance you could try out the small change I made to this branch on rootjalex/x86-optimize-test (d5f507e) ? Sorry about this