dolphinscheduler icon indicating copy to clipboard operation
dolphinscheduler copied to clipboard

[Feature][Task instance] Automatic cleaning of database task instance table

Open pinkfloyds opened this issue 1 year ago • 12 comments

Search before asking

  • [X] I had searched in the issues and found no similar feature requirement.

Description

After long-term use of DolphinScheduler, the task instance table accumulates a large amount of data, which can lead to slow queries. It would be beneficial to have a feature that automatically cleans up logs, with a precise whitelist/blacklist mechanism at the project/ProcessDefinition level. After simple configuration, it could automatically clean up periodically (every few months).  I would like to contribute this feature. I have a question: would it be more appropriate for this feature to be integrated as a system-level function of DS or to be configured as ProcessDefinition?

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • [X] Yes I am willing to submit a PR!

Code of Conduct

pinkfloyds avatar Jun 13 '24 07:06 pinkfloyds

I think you need to give a more detailed design. Configuring on system level / workflow definition level both have pros and cons. The question is too general for reviewers to answer without a detailed design.

EricGao888 avatar Jun 13 '24 08:06 EricGao888

BTW, you need to submit a DISP for any potential modifications on DS core.

https://github.com/apache/dolphinscheduler/issues/14102

EricGao888 avatar Jun 13 '24 08:06 EricGao888

I recommend using mysql Event Scheduler to implement automatically delete task instances

zhuxt2015 avatar Jun 13 '24 08:06 zhuxt2015

good idea, please design a proposal first

davidzollo avatar Jun 13 '24 09:06 davidzollo

Plan One: Configure Workflow:

# black list
delete from t_ds_task_instance
       where (project_code in ($project_code)
          or process_instance_id in (select id
                                     from t_ds_process_instance
                                     where process_definition_code in ($process_definition_code)))
                 and start_time < NOW() - INTERVAL 3 MONTH;
# white list
delete from t_ds_task_instance
       where (project_code not in ($project_code) or
          process_instance_id not in (select id from t_ds_process_instance
                                            where process_definition_code in ($process_definition_code)))
      and start_time < NOW() - INTERVAL 3 MONTH;

Choose either a blacklist or a whitelist, retain data from the past three months, modify as needed.fill in the corresponding parameters for $project_code and $process_definition_code, and remove the corresponding conditions if they are not needed, separated by English commas. Set up an SQL node to implement the above functionality, the metadatabase needs to be registered in the database center, and configure it to run as a workflow on the 1st of every month, then export it as a JSON file for download and use. User documentation needs to be written for manual configuration, there are many manual operations, if used improperly, unexpected situations may occur, there is a slight threshold to use, and the development cycle is short (basically developed).

Plan Two: Integration at the System Level: Add an item "Log Cleaning Policy" to the menu, and implement the above functionality on the interface, all users can use it directly, but the development cycle is longer.

pinkfloyds avatar Jun 15 '24 13:06 pinkfloyds

How about this plan? Optimization Plan One:

After the workflow development is completed, it will be added to the initialization script of DS's database, along with detailed documentation, so that all users can easily use it with simple configuration.

pinkfloyds avatar Jun 20 '24 02:06 pinkfloyds

I used a python script to connect to the metadatabase to delete the data in the t_ds_task_instance table, which will cause the scheduler of the workflow to be invalid and the workflow cannot be offline. Have you done a test ?my version:3.0.1,

eelnomel avatar Jun 20 '24 06:06 eelnomel

Plan One: Configure Workflow:

# black list
delete from t_ds_task_instance
       where (project_code in ($project_code)
          or process_instance_id in (select id
                                     from t_ds_process_instance
                                     where process_definition_code in ($process_definition_code)))
                 and start_time < NOW() - INTERVAL 3 MONTH;
# white list
delete from t_ds_task_instance
       where (project_code not in ($project_code) or
          process_instance_id not in (select id from t_ds_process_instance
                                            where process_definition_code in ($process_definition_code)))
      and start_time < NOW() - INTERVAL 3 MONTH;

Choose either a blacklist or a whitelist, retain data from the past three months, modify as needed.fill in the corresponding parameters for $project_code and $process_definition_code, and remove the corresponding conditions if they are not needed, separated by English commas. Set up an SQL node to implement the above functionality, the metadatabase needs to be registered in the database center, and configure it to run as a workflow on the 1st of every month, then export it as a JSON file for download and use. User documentation needs to be written for manual configuration, there are many manual operations, if used improperly, unexpected situations may occur, there is a slight threshold to use, and the development cycle is short (basically developed).

Plan Two: Integration at the System Level: Add an item "Log Cleaning Policy" to the menu, and implement the above functionality on the interface, all users can use it directly, but the development cycle is longer.

Simply deleting these tables is not enough. There are already workflow definitions and workflow instances api for deleting. We can create a sample workflow that contains the data that needs to be deleted according to certain filtering criteria and call the deletion API to delete the data. I think this is a good practice for both user intrusion and iterative updates.

SbloodyS avatar Jun 25 '24 12:06 SbloodyS

Plan One: Configure Workflow:

# black list
delete from t_ds_task_instance
       where (project_code in ($project_code)
          or process_instance_id in (select id
                                     from t_ds_process_instance
                                     where process_definition_code in ($process_definition_code)))
                 and start_time < NOW() - INTERVAL 3 MONTH;
# white list
delete from t_ds_task_instance
       where (project_code not in ($project_code) or
          process_instance_id not in (select id from t_ds_process_instance
                                            where process_definition_code in ($process_definition_code)))
      and start_time < NOW() - INTERVAL 3 MONTH;

Choose either a blacklist or a whitelist, retain data from the past three months, modify as needed.fill in the corresponding parameters for $project_code and $process_definition_code, and remove the corresponding conditions if they are not needed, separated by English commas. Set up an SQL node to implement the above functionality, the metadatabase needs to be registered in the database center, and configure it to run as a workflow on the 1st of every month, then export it as a JSON file for download and use. User documentation needs to be written for manual configuration, there are many manual operations, if used improperly, unexpected situations may occur, there is a slight threshold to use, and the development cycle is short (basically developed). Plan Two: Integration at the System Level: Add an item "Log Cleaning Policy" to the menu, and implement the above functionality on the interface, all users can use it directly, but the development cycle is longer.

Simply deleting these tables is not enough. There are already workflow definitions and workflow instances api for deleting. We can create a sample workflow that contains the data that needs to be deleted according to certain filtering criteria and call the deletion API to delete the data. I think this is a good practice for both user intrusion and iterative updates.

I understand, but there are some data operations involved, such as the need to query all workflow_definitions_id for the whitelist, and DS does not provide a relevant interface. Should we add a specific interface or directly query using SQL? Both options seem a bit troublesome. Which method is better?  I feel that implementing the above functionality with a Python script would be better than using a shell script.

pinkfloyds avatar Jun 26 '24 05:06 pinkfloyds

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Aug 01 '24 00:08 github-actions[bot]

I understand, but there are some data operations involved, such as the need to query all workflow_definitions_id for the whitelist, and DS does not provide a relevant interface. Should we add a specific interface or directly query using SQL? Both options seem a bit troublesome. Which method is better? > I feel that implementing the above functionality with a Python script would be better than using a shell script.

Why do you think python script is better? DolphinScheduler is a java project. Using a unified language to implement functionality is a basic requirement.

SbloodyS avatar Aug 01 '24 01:08 SbloodyS

I have the same problem. I wonder if I can start a scheduled task in the code to perform the cleanup. In addition, two parameters are added to the configuration item. The first parameter indicates whether to enable the automatic deletion function, and the second parameter indicates the time (the data generated one month ago is expired data). The pseudo-code is as follows:

public void cleanupExpiredInstances() {
    boolean isExpired = isInstanceExpired(instance);
    if (isExpired) {
        instanceMapper.batchDelete(instance);
    }
}

private boolean isInstanceExpired(Instance instance) {
    long expireThreshold = 7 * 24 * 3600 * 1000; 
    long currentTime = System.currentTimeMillis();
    long createTime = instance.getStartTime().getTime();
    return currentTime - createTime > expireThreshold;
}

q4q5q6qw avatar Aug 14 '24 13:08 q4q5q6qw

This issue has been automatically marked as stale because it has not had recent activity for 30 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Sep 14 '24 00:09 github-actions[bot]

This issue has been closed because it has not received response for too long time. You could reopen it if you encountered similar problems in the future.

github-actions[bot] avatar Sep 21 '24 00:09 github-actions[bot]