[Feature][Task instance] Automatic cleaning of database task instance table
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
After long-term use of DolphinScheduler, the task instance table accumulates a large amount of data, which can lead to slow queries. It would be beneficial to have a feature that automatically cleans up logs, with a precise whitelist/blacklist mechanism at the project/ProcessDefinition level. After simple configuration, it could automatically clean up periodically (every few months). I would like to contribute this feature. I have a question: would it be more appropriate for this feature to be integrated as a system-level function of DS or to be configured as ProcessDefinition?
Use case
No response
Related issues
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
I think you need to give a more detailed design. Configuring on system level / workflow definition level both have pros and cons. The question is too general for reviewers to answer without a detailed design.
BTW, you need to submit a DISP for any potential modifications on DS core.
https://github.com/apache/dolphinscheduler/issues/14102
I recommend using mysql Event Scheduler to implement automatically delete task instances
good idea, please design a proposal first
Plan One: Configure Workflow:
# black list
delete from t_ds_task_instance
where (project_code in ($project_code)
or process_instance_id in (select id
from t_ds_process_instance
where process_definition_code in ($process_definition_code)))
and start_time < NOW() - INTERVAL 3 MONTH;
# white list
delete from t_ds_task_instance
where (project_code not in ($project_code) or
process_instance_id not in (select id from t_ds_process_instance
where process_definition_code in ($process_definition_code)))
and start_time < NOW() - INTERVAL 3 MONTH;
Choose either a blacklist or a whitelist, retain data from the past three months, modify as needed.fill in the corresponding parameters for $project_code and $process_definition_code, and remove the corresponding conditions if they are not needed, separated by English commas. Set up an SQL node to implement the above functionality, the metadatabase needs to be registered in the database center, and configure it to run as a workflow on the 1st of every month, then export it as a JSON file for download and use. User documentation needs to be written for manual configuration, there are many manual operations, if used improperly, unexpected situations may occur, there is a slight threshold to use, and the development cycle is short (basically developed).
Plan Two: Integration at the System Level: Add an item "Log Cleaning Policy" to the menu, and implement the above functionality on the interface, all users can use it directly, but the development cycle is longer.
How about this plan? Optimization Plan One:
After the workflow development is completed, it will be added to the initialization script of DS's database, along with detailed documentation, so that all users can easily use it with simple configuration.
I used a python script to connect to the metadatabase to delete the data in the t_ds_task_instance table, which will cause the scheduler of the workflow to be invalid and the workflow cannot be offline. Have you done a test ?my version:3.0.1,
Plan One: Configure Workflow:
# black list delete from t_ds_task_instance where (project_code in ($project_code) or process_instance_id in (select id from t_ds_process_instance where process_definition_code in ($process_definition_code))) and start_time < NOW() - INTERVAL 3 MONTH; # white list delete from t_ds_task_instance where (project_code not in ($project_code) or process_instance_id not in (select id from t_ds_process_instance where process_definition_code in ($process_definition_code))) and start_time < NOW() - INTERVAL 3 MONTH;Choose either a blacklist or a whitelist, retain data from the past three months, modify as needed.fill in the corresponding parameters for $project_code and $process_definition_code, and remove the corresponding conditions if they are not needed, separated by English commas. Set up an SQL node to implement the above functionality, the metadatabase needs to be registered in the database center, and configure it to run as a workflow on the 1st of every month, then export it as a JSON file for download and use. User documentation needs to be written for manual configuration, there are many manual operations, if used improperly, unexpected situations may occur, there is a slight threshold to use, and the development cycle is short (basically developed).
Plan Two: Integration at the System Level: Add an item "Log Cleaning Policy" to the menu, and implement the above functionality on the interface, all users can use it directly, but the development cycle is longer.
Simply deleting these tables is not enough. There are already workflow definitions and workflow instances api for deleting. We can create a sample workflow that contains the data that needs to be deleted according to certain filtering criteria and call the deletion API to delete the data. I think this is a good practice for both user intrusion and iterative updates.
Plan One: Configure Workflow:
# black list delete from t_ds_task_instance where (project_code in ($project_code) or process_instance_id in (select id from t_ds_process_instance where process_definition_code in ($process_definition_code))) and start_time < NOW() - INTERVAL 3 MONTH; # white list delete from t_ds_task_instance where (project_code not in ($project_code) or process_instance_id not in (select id from t_ds_process_instance where process_definition_code in ($process_definition_code))) and start_time < NOW() - INTERVAL 3 MONTH;Choose either a blacklist or a whitelist, retain data from the past three months, modify as needed.fill in the corresponding parameters for $project_code and $process_definition_code, and remove the corresponding conditions if they are not needed, separated by English commas. Set up an SQL node to implement the above functionality, the metadatabase needs to be registered in the database center, and configure it to run as a workflow on the 1st of every month, then export it as a JSON file for download and use. User documentation needs to be written for manual configuration, there are many manual operations, if used improperly, unexpected situations may occur, there is a slight threshold to use, and the development cycle is short (basically developed). Plan Two: Integration at the System Level: Add an item "Log Cleaning Policy" to the menu, and implement the above functionality on the interface, all users can use it directly, but the development cycle is longer.
Simply deleting these tables is not enough. There are already workflow definitions and workflow instances api for deleting. We can create a sample workflow that contains the data that needs to be deleted according to certain filtering criteria and call the deletion API to delete the data. I think this is a good practice for both user intrusion and iterative updates.
I understand, but there are some data operations involved, such as the need to query all workflow_definitions_id for the whitelist, and DS does not provide a relevant interface. Should we add a specific interface or directly query using SQL? Both options seem a bit troublesome. Which method is better? I feel that implementing the above functionality with a Python script would be better than using a shell script.
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
I understand, but there are some data operations involved, such as the need to query all workflow_definitions_id for the whitelist, and DS does not provide a relevant interface. Should we add a specific interface or directly query using SQL? Both options seem a bit troublesome. Which method is better? > I feel that implementing the above functionality with a Python script would be better than using a shell script.
Why do you think python script is better? DolphinScheduler is a java project. Using a unified language to implement functionality is a basic requirement.
I have the same problem. I wonder if I can start a scheduled task in the code to perform the cleanup. In addition, two parameters are added to the configuration item. The first parameter indicates whether to enable the automatic deletion function, and the second parameter indicates the time (the data generated one month ago is expired data). The pseudo-code is as follows:
public void cleanupExpiredInstances() {
boolean isExpired = isInstanceExpired(instance);
if (isExpired) {
instanceMapper.batchDelete(instance);
}
}
private boolean isInstanceExpired(Instance instance) {
long expireThreshold = 7 * 24 * 3600 * 1000;
long currentTime = System.currentTimeMillis();
long createTime = instance.getStartTime().getTime();
return currentTime - createTime > expireThreshold;
}
This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.
This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.