tvm [Bug][meta_schedule] Tutorial `e2e_opt

Thanks for participating in the TVM community! We use https://discuss.tvm.ai for any general usage questions and discussions. The issue tracker is used for actionable items such as feature proposals discussion, roadmaps, and bug tracking. You are always welcomed to post on the forum first :smile_cat:

Issues that are inactive for a period of time may get closed. We adopt this policy so that we won't lose track of actionable issues that may fall at the bottom of the pile. Feel free to reopen a new one if you feel there is an additional problem that needs attention when an old one gets closed.

Expected behavior

What you were expecting -> The tutorial code e2e_opt_model.py should work.

Actual behavior

What actually happened -> When TaskScheduler picks Task 3: "fused_conv2d9_subtract4_divide4_expand_dims3_multiply4_expand_dims3_add11_relu4"

  File "/home/ysy/Documents/open_source/tvm/source/src/tir/transforms/inject_software_pipeline.cc", line 1143, in tvm::tir::software_pipeline::PipelineInjector::VisitStmt_(tvm::tir::ForNode const*)
InternalError: Check failed: pipeline_stages.size() == original_order.size() (3 vs. 4) : PrimFunc "main" has original order ["", "", "", ""], but pipeline annotation is [0, 0, 3] with different size

Environment

Any environment details, such as: Operating System, TVM version, etc ->

Ubuntu 22.04, Intel i7 13650hx, RTX 4060
commit: 2d964b4133aac2f92e4185b3f095df4eb3bf3a90 (0.21.dev0)

Steps to reproduce

Preferably a minimal script to cause the issue to occur. -> Execute e2e_opt_model.py

🌟My analysis

Error point The error occurs at inject_software_pipeline.cc:1133 during the post process VerifyGPUCode

auto pipeline_stages =
        Downcast<Array<Integer>>(op->annotations.at(attr::software_pipeline_stage));
CHECK_EQ(pipeline_stages.size(), original_order.size())

As indicated by the error message, pipeline_stages.size() is 3 whereas original_order.size() is 4 There are 4 blocks, while the annotation software_pipeline_stage has 3 elements.

Why? Before VerifyGPUCode, RewriteReduction is executed, which decomposes a reduction block conv2d_nchw into conv2d_nchw_init block and conv2d_nchw_update block, thereby adding a new block. This increases original_order.size() from 3 to 4. However, the annotation pipeline_stages is not updated according to the added block. This appears to cause the bug.

Potential solution In my opinion, CHECK_EQ just validates the normal state, checking if each block can be mapped to a pipeline stage, and the problem actually lies with RewriteReduction. RewriteReduction should update the annotation sizes(of the pipeline stages) after adding the block, shouldn't it? I tried to make this modification, but I struggled due to the complexity of the optimization algorithm. Is there an expert who could take this on? I'd appreciate your expertise.

Triage

Please refer to the list of label tags here to find the relevant tags and add them below in a bullet format (example below).

tune:meta_schedule

May 26 '25 15:05 vacu9708

Same here. After downgrading to Release v0.20.0, the bug no longer occurs.

Jun 10 '25 16:06 w1049

I came back to check this issue and now I believe the problem is caused by the error handling mechanism. The issue was introduced by 95d1268 (FFI refactoring). Before this commit, CHECK_EQ would also report errors but wouldn't cause the entire program to terminate.

Jul 01 '25 08:07 w1049

In verify_gpu_code.cc, errors don't terminate the program but instead return false:

try { /* ... */ } catch (const dmlc::Error& e) {
    return false;
}

Before commit #95d1268, InternalError inherited from ::dmlc::Error:

/*! 
 * \brief Base error type for TVM. Wraps a string message. 
 */
class Error : public ::dmlc::Error {  // for backwards compatibility
 public:
  /*!
   * \brief Construct an error.
   * \param s The message to be displayed with the error.
   */
  explicit Error(const std::string& s) : ::dmlc::Error(s) {}
};

/*!
 * \brief Error type for errors from CHECK, ICHECK, and LOG(FATAL). 
 * Contains a backtrace of where it occurred.
 */
class InternalError : public Error {
  // ...
};

After the commit, the error type changed to ffi::Error, which doesn't inherit from dmlc::Error:

using ffi::EnvErrorAlreadySet;
using ffi::Error;

/*!
 * \brief Error type for errors from CHECK, ICHECK, and LOG(FATAL).
 * Contains a backtrace of where it occurred.
 */
class InternalError : public Error {
  // ...
};

/*!
 * \brief Managed reference to ErrorObj
 * \sa Error Object
 */
class Error : public ObjectRef, public std::exception {
  // ...
};

I believe this is the source of the bug, as the error handling in verify_gpu_code.cc expects dmlc::Error but now receives ffi::Error. Modifying the catch() block in src/meta_schedule/postproc/verify_gpu_code.cc:191 will fix this issue. Additionally, I've found other parts of the codebase that still depend on dmlc::Error.

Jul 01 '25 09:07 w1049

[Bug][meta_schedule] Tutorial `e2e_opt_model.py` fails

Expected behavior

Actual behavior

Environment

Steps to reproduce

🌟My analysis

Triage