[DSIP][Task Definition] Optimize Task Definition Query by Adding Redundant workflow_definition_code Field
Search before asking
- [x] I had searched in the issues and found no similar feature requirement.
Description
🐞 Background & Problem
In the current DolphinScheduler implementation, querying all task definitions under a workflow definition requires joining three tables:
-
t_ds_workflow_definition -
t_ds_workflow_task_relation -
t_ds_task_definition
This multi-table join approach introduces significant performance bottlenecks when dealing with large numbers of workflows and tasks, especially in high-frequency query scenarios such as workflow tree view rendering and status statistics.
🚀 Optimization Goal
By adding a redundant workflow_definition_code field to the t_ds_task_definition table, we can simplify the original three-table join into a single-table query, greatly improving read performance.
This optimization applies to several core APIs, including:
-
GET /{code}/view-tree: Generate workflow tree structure -
POST /query-task-definition-list: Query task definition list - Workflow release validation logic (
/{code}/release) - Task instance state statistics (filtered by workflowCode)
💡 Solution
1. Database Schema Changes
-- Add workflow_definition_code to task definition table
ALTER TABLE t_ds_task_definition
ADD COLUMN workflow_definition_code BIGINT(20) NOT NULL DEFAULT 0 COMMENT 'Workflow definition code' AFTER project_code;
-- Add fields to log table
ALTER TABLE t_ds_task_definition_log
ADD COLUMN workflow_definition_code BIGINT(20) NOT NULL DEFAULT 0 COMMENT 'Workflow definition code' AFTER project_code,
ADD COLUMN workflow_definition_version INT(11) NOT NULL DEFAULT 0 COMMENT 'Workflow definition version' AFTER workflow_definition_code;
2. Data Migration Scripts
-- Populate task definition table from relation table UPDATE t_ds_task_definition td JOIN t_ds_workflow_task_relation wtr ON td.code = wtr.post_task_code AND td.version = wtr.post_task_version SET td.workflow_definition_code = wtr.workflow_definition_code WHERE td.workflow_definition_code = 0;
-- Populate log table UPDATE t_ds_task_definition_log tdl JOIN t_ds_workflow_task_relation_log wtrl ON tdl.code = wtrl.post_task_code AND tdl.version = wtrl.post_task_version SET tdl.workflow_definition_code = wtrl.workflow_definition_code, tdl.workflow_definition_version = wtrl.workflow_definition_version WHERE tdl.workflow_definition_code = 0;
3. Code-Level Changes
(1) Entity Updates // TaskDefinition.java private long workflowDefinitionCode;
// TaskDefinitionLog.java
private long workflowDefinitionCode;
private int workflowDefinitionVersion;
(2) DAO Query Optimization (Key Change)
List<WorkflowTaskRelationLog> relations = relationMapper.queryByWorkflowCodeAndVersion(...);
Set<TaskDefinition> codes = relations.stream().map(...).collect(...);
return taskDefinitionLogMapper.queryByTaskDefinitions(codes);
After (single-table query):
List<TaskDefinition> list = taskDefinitionMapper.queryByWorkflowDefinitionCode(workflowCode);
Corresponding SQL:
<select id="queryByWorkflowDefinitionCode" resultType="TaskDefinition">
SELECT
| Module | API / Operation | Impacted |
|---|---|---|
| Workflow Save/Update | Create/Update Task Definitions | ✅ Need to populate field |
| Release Validation | /release check sub-workflows | ✅ Query performance improved |
| View Rendering | /{code}/view-tree | ✅ Faster query |
| Task Query | POST /query-task-definition-list | ✅ Accelerated |
| Statistics API | /v2/statistics/tasks/states/count | ✅ Optimizable |
| Python Gateway | queryByName | ✅ Optimizable |
4. Business Logic Adaptation
The following operations must ensure workflow_definition_code is set when saving or updating task definitions:
Workflow creation & update Import/export Version rollback Copy functionality
📈 Expected Impact Significantly reduce database query load Improve frontend page loading speed (especially for large workflows) Enhance user experience and support larger-scale scheduling scenarios
Are you willing to submit a PR?
- [x] Yes I am willing to submit a PR!
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
This multi-table join approach introduces significant performance bottlenecks when dealing with large numbers of workflows and tasks, especially in high-frequency query scenarios such as workflow tree view rendering and status statistics.
I think we don't need this. There are thousands of task in workflow at most so workflow tree does not encounter performance bottlenecks.