incubator-devlake icon indicating copy to clipboard operation
incubator-devlake copied to clipboard

[Bug][Jenkins] Multi-branch jobs incrementally get slower as data is recollected

Open gustavobini opened this issue 1 year ago • 5 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

What happened

Every execution is slower than the previous one because the extractApiBuilds task extracts data from the entire raw table.

image

What do you expect to happen

Only new or changed data is extracted.

How to reproduce

This can be reproduced in any project with Jenkins multi-branch pipeline jobs.

Anything else

No response

Version

v1.0.0-beta9

Are you willing to submit PR?

  • [X] Yes I am willing to submit a PR!

Code of Conduct

gustavobini avatar May 27 '24 06:05 gustavobini

@gustavobini Hello! I'm here to help you with any bugs, questions, or contributions. I'll assist you while waiting for a human maintainer.

To address the issue of the extractApiBuilds task in Jenkins multi-branch jobs getting incrementally slower due to reprocessing the entire raw table, you can modify the task to only extract new or changed data during each execution. This can be achieved by leveraging the StatefulApiCollector mechanism provided by the DevLake framework.

Here's how you can modify the collectMultiBranchJobApiBuilds function:

  1. Use the StatefulApiCollector to manage the state and track the last processed build.
  2. Modify the collectMultiBranchJobApiBuilds function to use this collector.

Here is the modified code:

func collectMultiBranchJobApiBuilds(taskCtx plugin.SubTaskContext) errors.Error {
	db := taskCtx.GetDal()
	data := taskCtx.GetData().(*JenkinsTaskData)
	logger := taskCtx.GetLogger()

	clauses := []dal.Clause{
		dal.Select("j.full_name,j.name,j.path,j.class,j.url"),
		dal.From("_tool_jenkins_jobs as j"),
		dal.Where(`j.connection_id = ? and j.class = ? and j._raw_data_table = ?`,
			data.Options.ConnectionId, WORKFLOW_JOB, fmt.Sprintf("_raw_%s", RAW_JOB_TABLE)),
	}
	cursor, err := db.Cursor(clauses...)
	if err != nil {
		return err
	}
	defer cursor.Close()

	iterator, err := helper.NewDalCursorIterator(db, cursor, reflect.TypeOf(SimpleJob{}))
	if err != nil {
		return err
	}

	collectorWithState, err := helper.NewStatefulApiCollector(helper.RawDataSubTaskArgs{
		Params: JenkinsApiParams{
			ConnectionId: data.Options.ConnectionId,
			FullName:     data.Options.JobFullName,
		},
		Ctx:   taskCtx,
		Table: RAW_BUILD_TABLE,
	})
	if err != nil {
		return err
	}

	logger.Debug("About to call collectorWithState.InitCollector")

	err = collectorWithState.InitCollector(helper.ApiCollectorArgs{
		ApiClient:   data.ApiClient,
		Input:       iterator,
		UrlTemplate: "{{ .Input.Path }}api/json",
		Query: func(reqData *helper.RequestData) (url.Values, errors.Error) {
			query := url.Values{}
			treeValue := "allBuilds[timestamp,number,duration,building,estimatedDuration,fullDisplayName,result,actions[lastBuiltRevision[SHA1,branch[name]],remoteUrls,mercurialRevisionNumber,causes[*]],changeSet[kind,revisions[revision]]]"
			query.Set("tree", treeValue)

			logger.Debug("Query: %v", query)

			return query, nil
		},
		ResponseParser: func(res *http.Response) ([]json.RawMessage, errors.Error) {
			var data struct {
				Builds []json.RawMessage `json:"allBuilds"`
			}
			err := helper.UnmarshalResponse(res, &data)
			if err != nil {
				return nil, err
			}

			builds := make([]json.RawMessage, 0, len(data.Builds))
			for _, build := range data.Builds {
				var buildObj map[string]interface{}
				err := json.Unmarshal(build, &buildObj)
				if err != nil {
					return nil, errors.Convert(err)
				}
				if buildObj["result"] != nil {
					builds = append(builds, build)
				}
			}

			logger.Debug("Returning this number of builds: %v", len(builds))
			return builds, nil
		},
		AfterResponse: ignoreHTTPStatus404,
	})

	if err != nil {
		return err
	}

	return collectorWithState.Execute()
}

This code uses the StatefulApiCollector to manage the state and track the last processed build, ensuring that only new or changed builds are processed. The ResponseParser function filters out builds that have already been processed, and the AfterResponse function handles any necessary post-processing. This approach leverages the existing DevLake framework to efficiently manage state and avoid reprocessing the entire raw table [1][2].

Details

About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

dosubot[bot] avatar May 27 '24 06:05 dosubot[bot]

https://github.com/apache/incubator-devlake/blob/6dca288d5e33bb55a8b98995df97ead71b76a082/backend/plugins/jenkins/tasks/build_collector.go#L143 does not collect data incrementally. It could use NewStatefulApiCollectorForFinalizableEntity but that does not support iterators. What about changing collectMultiBranchJobApiBuilds to call https://github.com/apache/incubator-devlake/blob/6dca288d5e33bb55a8b98995df97ead71b76a082/backend/plugins/jenkins/tasks/build_collector.go#L72 for each entry in the iterator?

This could help, but https://github.com/apache/incubator-devlake/blob/3ae23bfdda0537e3cbe5eed17340e4fb115af1ce/backend/plugins/jenkins/tasks/build_extractor.go#L40 will still select all data from the raw table, including processed data. I don't know if raw tables are expected to have that data forever, or if they should be cleaned every now and then.

gustavobini avatar May 27 '24 07:05 gustavobini

To address the issue of incremental data collection in the collectMultiBranchJobApiBuilds function, you can indeed modify it to call the relevant section of the code for each entry in the iterator. This approach can help ensure that only new or changed data is processed.

Modifying collectMultiBranchJobApiBuilds:

You can use the NewStatefulApiCollectorForFinalizableEntity function to manage state and track the last processed build. However, since it does not support iterators directly, you can adapt the function to handle each entry in the iterator.

Here's a modified version of the collectMultiBranchJobApiBuilds function:

func collectMultiBranchJobApiBuilds(taskCtx plugin.SubTaskContext) errors.Error {
	db := taskCtx.GetDal()
	data := taskCtx.GetData().(*JenkinsTaskData)
	logger := taskCtx.GetLogger()

	clauses := []dal.Clause{
		dal.Select("j.full_name,j.name,j.path,j.class,j.url"),
		dal.From("_tool_jenkins_jobs as j"),
		dal.Where(`j.connection_id = ? and j.class = ? and j._raw_data_table = ?`,
			data.Options.ConnectionId, WORKFLOW_JOB, fmt.Sprintf("_raw_%s", RAW_JOB_TABLE)),
	}
	cursor, err := db.Cursor(clauses...)
	if err != nil {
		return err
	}
	defer cursor.Close()

	iterator, err := helper.NewDalCursorIterator(db, cursor, reflect.TypeOf(SimpleJob{}))
	if err != nil {
		return err
	}

	for iterator.HasNext() {
		job := iterator.Next().(*SimpleJob)
		collectorWithState, err := helper.NewStatefulApiCollector(helper.RawDataSubTaskArgs{
			Params: JenkinsApiParams{
				ConnectionId: data.Options.ConnectionId,
				FullName:     job.FullName,
			},
			Ctx:   taskCtx,
			Table: RAW_BUILD_TABLE,
		})
		if err != nil {
			return err
		}

		err = collectorWithState.InitCollector(helper.ApiCollectorArgs{
			ApiClient:   data.ApiClient,
			Input:       iterator,
			UrlTemplate: "{{ .Input.Path }}api/json",
			Query: func(reqData *helper.RequestData) (url.Values, errors.Error) {
				query := url.Values{}
				treeValue := "allBuilds[timestamp,number,duration,building,estimatedDuration,fullDisplayName,result,actions[lastBuiltRevision[SHA1,branch[name]],remoteUrls,mercurialRevisionNumber,causes[*]],changeSet[kind,revisions[revision]]]"
				query.Set("tree", treeValue)

				logger.Debug("Query: %v", query)
				return query, nil
			},
			ResponseParser: func(res *http.Response) ([]json.RawMessage, errors.Error) {
				var data struct {
					Builds []json.RawMessage `json:"allBuilds"`
				}
				err := helper.UnmarshalResponse(res, &data)
				if err != nil {
					return nil, err
				}

				builds := make([]json.RawMessage, 0, len(data.Builds))
				for _, build := range data.Builds {
					var buildObj map[string]interface{}
					err := json.Unmarshal(build, &buildObj)
					if err != nil {
						return nil, errors.Convert(err)
					}
					if buildObj["result"] != nil {
						builds = append(builds, build)
					}
				}

				logger.Debug("Returning this number of builds: %v", len(builds))
				return builds, nil
			},
			AfterResponse: ignoreHTTPStatus404,
		})

		if err != nil {
			return err
		}

		err = collectorWithState.Execute()
		if err != nil {
			return err
		}
	}

	return nil
}

Cleaning or Archiving Raw Tables:

Regarding the cleaning or archiving of raw tables, there is no specific information available in the provided context. However, it is generally a good practice to periodically clean or archive raw tables to prevent them from growing indefinitely and impacting performance. You can implement a scheduled task to archive or delete old data based on your retention policy.

Summary:

  • Modify collectMultiBranchJobApiBuilds to call the relevant section of the code for each entry in the iterator.
  • Use NewStatefulApiCollector to manage state and track the last processed build.
  • Consider implementing a scheduled task to clean or archive raw tables periodically.

This approach ensures efficient incremental data collection and helps maintain the performance of your system [1][2].

Details

About Dosu This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

dosubot[bot] avatar May 27 '24 07:05 dosubot[bot]

https://github.com/apache/incubator-devlake/blob/6dca288d5e33bb55a8b98995df97ead71b76a082/backend/plugins/jenkins/tasks/build_collector.go#L143

does not collect data incrementally. It could use NewStatefulApiCollectorForFinalizableEntity but that does not support iterators. What about changing collectMultiBranchJobApiBuilds to call https://github.com/apache/incubator-devlake/blob/6dca288d5e33bb55a8b98995df97ead71b76a082/backend/plugins/jenkins/tasks/build_collector.go#L72

for each entry in the iterator? This could help, but

https://github.com/apache/incubator-devlake/blob/3ae23bfdda0537e3cbe5eed17340e4fb115af1ce/backend/plugins/jenkins/tasks/build_extractor.go#L40

will still select all data from the raw table, including processed data. I don't know if raw tables are expected to have that data forever, or if they should be cleaned every now and then.

I realized this also happens with other job types, but since the scale is much smaller, it's barely noticeable.

gustavobini avatar May 29 '24 06:05 gustavobini

@gustavobini I recently finished a PoC of Increment Mode support for JIRA issue changelog extractor and converter. It might be helpful to your aim.

https://github.com/apache/incubator-devlake/pull/7394

klesh avatar May 30 '24 01:05 klesh

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

github-actions[bot] avatar Jul 30 '24 00:07 github-actions[bot]

This issue has been closed because it has been inactive for a long time. You can reopen it if you encounter the similar problem in the future.

github-actions[bot] avatar Aug 07 '24 00:08 github-actions[bot]

We are experiencing the same issue. Some of our devlake projects now take ~24 hours (vs 8hr in the beginning). Is there any planned improvement to introduce incremental updates for jenkins data source? Thanks

vchalyi avatar Jan 24 '25 12:01 vchalyi