incubator-devlake [Bug][Jenkins] Multi-branch jobs incrementally get slower as data is recollected

Search before asking

[X] I had searched in the issues and found no similar issues.

What happened

Every execution is slower than the previous one because the extractApiBuilds task extracts data from the entire raw table.

What do you expect to happen

Only new or changed data is extracted.

How to reproduce

This can be reproduced in any project with Jenkins multi-branch pipeline jobs.

Anything else

No response

Version

v1.0.0-beta9

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

May 27 '24 06:05 gustavobini

@gustavobini Hello! I'm here to help you with any bugs, questions, or contributions. I'll assist you while waiting for a human maintainer.

To address the issue of the extractApiBuilds task in Jenkins multi-branch jobs getting incrementally slower due to reprocessing the entire raw table, you can modify the task to only extract new or changed data during each execution. This can be achieved by leveraging the StatefulApiCollector mechanism provided by the DevLake framework.

Here's how you can modify the collectMultiBranchJobApiBuilds function:

Use the StatefulApiCollector to manage the state and track the last processed build.
Modify the collectMultiBranchJobApiBuilds function to use this collector.

Here is the modified code:

func collectMultiBranchJobApiBuilds(taskCtx plugin.SubTaskContext) errors.Error {
	db := taskCtx.GetDal()
	data := taskCtx.GetData().(*JenkinsTaskData)
	logger := taskCtx.GetLogger()

	clauses := []dal.Clause{
		dal.Select("j.full_name,j.name,j.path,j.class,j.url"),
		dal.From("_tool_jenkins_jobs as j"),
		dal.Where(`j.connection_id = ? and j.class = ? and j._raw_data_table = ?`,
			data.Options.ConnectionId, WORKFLOW_JOB, fmt.Sprintf("_raw_%s", RAW_JOB_TABLE)),
	}
	cursor, err := db.Cursor(clauses...)
	if err != nil {
		return err
	}
	defer cursor.Close()

	iterator, err := helper.NewDalCursorIterator(db, cursor, reflect.TypeOf(SimpleJob{}))
	if err != nil {
		return err
	}

	collectorWithState, err := helper.NewStatefulApiCollector(helper.RawDataSubTaskArgs{
		Params: JenkinsApiParams{
			ConnectionId: data.Options.ConnectionId,
			FullName:     data.Options.JobFullName,
		},
		Ctx:   taskCtx,
		Table: RAW_BUILD_TABLE,
	})
	if err != nil {
		return err
	}

	logger.Debug("About to call collectorWithState.InitCollector")

	err = collectorWithState.InitCollector(helper.ApiCollectorArgs{
		ApiClient:   data.ApiClient,
		Input:       iterator,
		UrlTemplate: "{{ .Input.Path }}api/json",
		Query: func(reqData *helper.RequestData) (url.Values, errors.Error) {
			query := url.Values{}
			treeValue := "allBuilds[timestamp,number,duration,building,estimatedDuration,fullDisplayName,result,actions[lastBuiltRevision[SHA1,branch[name]],remoteUrls,mercurialRevisionNumber,causes[*]],changeSet[kind,revisions[revision]]]"
			query.Set("tree", treeValue)

			logger.Debug("Query: %v", query)

			return query, nil
		},
		ResponseParser: func(res *http.Response) ([]json.RawMessage, errors.Error) {
			var data struct {
				Builds []json.RawMessage `json:"allBuilds"`
			}
			err := helper.UnmarshalResponse(res, &data)
			if err != nil {
				return nil, err
			}

			builds := make([]json.RawMessage, 0, len(data.Builds))
			for _, build := range data.Builds {
				var buildObj map[string]interface{}
				err := json.Unmarshal(build, &buildObj)
				if err != nil {
					return nil, errors.Convert(err)
				}
				if buildObj["result"] != nil {
					builds = append(builds, build)
				}
			}

			logger.Debug("Returning this number of builds: %v", len(builds))
			return builds, nil
		},
		AfterResponse: ignoreHTTPStatus404,
	})

	if err != nil {
		return err
	}

	return collectorWithState.Execute()
}

This code uses the StatefulApiCollector to manage the state and track the last processed build, ensuring that only new or changed builds are processed. The ResponseParser function filters out builds that have already been processed, and the AfterResponse function handles any necessary post-processing. This approach leverages the existing DevLake framework to efficiently manage state and avoid reprocessing the entire raw table ^[1]^[2].

Details

About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

May 27 '24 06:05 dosubot[bot]

https://github.com/apache/incubator-devlake/blob/6dca288d5e33bb55a8b98995df97ead71b76a082/backend/plugins/jenkins/tasks/build_collector.go#L143 does not collect data incrementally. It could use NewStatefulApiCollectorForFinalizableEntity but that does not support iterators. What about changing collectMultiBranchJobApiBuilds to call https://github.com/apache/incubator-devlake/blob/6dca288d5e33bb55a8b98995df97ead71b76a082/backend/plugins/jenkins/tasks/build_collector.go#L72 for each entry in the iterator?

This could help, but https://github.com/apache/incubator-devlake/blob/3ae23bfdda0537e3cbe5eed17340e4fb115af1ce/backend/plugins/jenkins/tasks/build_extractor.go#L40 will still select all data from the raw table, including processed data. I don't know if raw tables are expected to have that data forever, or if they should be cleaned every now and then.

May 27 '24 07:05 gustavobini

To address the issue of incremental data collection in the collectMultiBranchJobApiBuilds function, you can indeed modify it to call the relevant section of the code for each entry in the iterator. This approach can help ensure that only new or changed data is processed.

Modifying `collectMultiBranchJobApiBuilds`:

You can use the NewStatefulApiCollectorForFinalizableEntity function to manage state and track the last processed build. However, since it does not support iterators directly, you can adapt the function to handle each entry in the iterator.

Here's a modified version of the collectMultiBranchJobApiBuilds function:

func collectMultiBranchJobApiBuilds(taskCtx plugin.SubTaskContext) errors.Error {
	db := taskCtx.GetDal()
	data := taskCtx.GetData().(*JenkinsTaskData)
	logger := taskCtx.GetLogger()

	clauses := []dal.Clause{
		dal.Select("j.full_name,j.name,j.path,j.class,j.url"),
		dal.From("_tool_jenkins_jobs as j"),
		dal.Where(`j.connection_id = ? and j.class = ? and j._raw_data_table = ?`,
			data.Options.ConnectionId, WORKFLOW_JOB, fmt.Sprintf("_raw_%s", RAW_JOB_TABLE)),
	}
	cursor, err := db.Cursor(clauses...)
	if err != nil {
		return err
	}
	defer cursor.Close()

	iterator, err := helper.NewDalCursorIterator(db, cursor, reflect.TypeOf(SimpleJob{}))
	if err != nil {
		return err
	}

	for iterator.HasNext() {
		job := iterator.Next().(*SimpleJob)
		collectorWithState, err := helper.NewStatefulApiCollector(helper.RawDataSubTaskArgs{
			Params: JenkinsApiParams{
				ConnectionId: data.Options.ConnectionId,
				FullName:     job.FullName,
			},
			Ctx:   taskCtx,
			Table: RAW_BUILD_TABLE,
		})
		if err != nil {
			return err
		}

		err = collectorWithState.InitCollector(helper.ApiCollectorArgs{
			ApiClient:   data.ApiClient,
			Input:       iterator,
			UrlTemplate: "{{ .Input.Path }}api/json",
			Query: func(reqData *helper.RequestData) (url.Values, errors.Error) {
				query := url.Values{}
				treeValue := "allBuilds[timestamp,number,duration,building,estimatedDuration,fullDisplayName,result,actions[lastBuiltRevision[SHA1,branch[name]],remoteUrls,mercurialRevisionNumber,causes[*]],changeSet[kind,revisions[revision]]]"
				query.Set("tree", treeValue)

				logger.Debug("Query: %v", query)
				return query, nil
			},
			ResponseParser: func(res *http.Response) ([]json.RawMessage, errors.Error) {
				var data struct {
					Builds []json.RawMessage `json:"allBuilds"`
				}
				err := helper.UnmarshalResponse(res, &data)
				if err != nil {
					return nil, err
				}

				builds := make([]json.RawMessage, 0, len(data.Builds))
				for _, build := range data.Builds {
					var buildObj map[string]interface{}
					err := json.Unmarshal(build, &buildObj)
					if err != nil {
						return nil, errors.Convert(err)
					}
					if buildObj["result"] != nil {
						builds = append(builds, build)
					}
				}

				logger.Debug("Returning this number of builds: %v", len(builds))
				return builds, nil
			},
			AfterResponse: ignoreHTTPStatus404,
		})

		if err != nil {
			return err
		}

		err = collectorWithState.Execute()
		if err != nil {
			return err
		}
	}

	return nil
}

Cleaning or Archiving Raw Tables:

Regarding the cleaning or archiving of raw tables, there is no specific information available in the provided context. However, it is generally a good practice to periodically clean or archive raw tables to prevent them from growing indefinitely and impacting performance. You can implement a scheduled task to archive or delete old data based on your retention policy.

Summary:

Modify collectMultiBranchJobApiBuilds to call the relevant section of the code for each entry in the iterator.
Use NewStatefulApiCollector to manage state and track the last processed build.
Consider implementing a scheduled task to clean or archive raw tables periodically.

This approach ensures efficient incremental data collection and helps maintain the performance of your system ^[1]^[2].

Details

About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

May 27 '24 07:05 dosubot[bot]

https://github.com/apache/incubator-devlake/blob/6dca288d5e33bb55a8b98995df97ead71b76a082/backend/plugins/jenkins/tasks/build_collector.go#L143

does not collect data incrementally. It could use NewStatefulApiCollectorForFinalizableEntity but that does not support iterators. What about changing collectMultiBranchJobApiBuilds to call https://github.com/apache/incubator-devlake/blob/6dca288d5e33bb55a8b98995df97ead71b76a082/backend/plugins/jenkins/tasks/build_collector.go#L72

for each entry in the iterator? This could help, but

https://github.com/apache/incubator-devlake/blob/3ae23bfdda0537e3cbe5eed17340e4fb115af1ce/backend/plugins/jenkins/tasks/build_extractor.go#L40

will still select all data from the raw table, including processed data. I don't know if raw tables are expected to have that data forever, or if they should be cleaned every now and then.

I realized this also happens with other job types, but since the scale is much smaller, it's barely noticeable.

May 29 '24 06:05 gustavobini

@gustavobini I recently finished a PoC of Increment Mode support for JIRA issue changelog extractor and converter. It might be helpful to your aim.

https://github.com/apache/incubator-devlake/pull/7394

May 30 '24 01:05 klesh

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

Jul 30 '24 00:07 github-actions[bot]

This issue has been closed because it has been inactive for a long time. You can reopen it if you encounter the similar problem in the future.

Aug 07 '24 00:08 github-actions[bot]

We are experiencing the same issue. Some of our devlake projects now take ~24 hours (vs 8hr in the beginning). Is there any planned improvement to introduce incremental updates for jenkins data source? Thanks

Jan 24 '25 12:01 vchalyi

[Bug][Jenkins] Multi-branch jobs incrementally get slower as data is recollected

Search before asking

What happened

What do you expect to happen

How to reproduce

Anything else

Version

Are you willing to submit PR?

Code of Conduct

Details

Modifying collectMultiBranchJobApiBuilds:

Cleaning or Archiving Raw Tables:

Summary:

Details

Modifying `collectMultiBranchJobApiBuilds`: