james-project icon indicating copy to clipboard operation
james-project copied to clipboard

[BUILD] Try to fasten the build on CI

Open quantranhong1999 opened this issue 3 years ago • 10 comments

quantranhong1999 avatar Jul 05 '22 03:07 quantranhong1999

Please document time gains (before/after). Also keep in mind that build time gains shall not decrease build stability...

chibenwa avatar Jul 05 '22 12:07 chibenwa

Not sure how it's supposed to fasten the build either? And I agree, our stability is already not great...

Arsnael avatar Jul 06 '22 02:07 Arsnael

The idea here is: you can use parallel=methods: classes are executed in forkCount concurrent processes, each of the processes can then use threadCount threads to execute the methods of one class in parallel.

I also agree that if this experiment adds more instability to the build, we should not add it.

quantranhong1999 avatar Jul 06 '22 03:07 quantranhong1999

Yeah I'm really not sure doing more things in parallel make the build more stable^^' Can test though

Arsnael avatar Jul 06 '22 03:07 Arsnael

threadCount= 2:

  • before: 4h 48m
  • after: 4h 31m

Not much diff, could just be because of the ci workload state. Will try with another value.

quantranhong1999 avatar Jul 06 '22 08:07 quantranhong1999

The build already uses T1C meaning that it attempts to parallelize the build at module level trying to build one module per hyperthreaded core. for each module build there is already parallelization of the tests on the class level (though I don't remember the exact threadcount) it is hard to tell without more precise metrics (not sure we could get them on such an infra) but trying to parallelize further at the method level is unlikely to yield massive improvments :

  • many of james tests are CPU bound not IO bound (io bound tests would be integration tests that start databases)
  • we already have massive parallelization in place (possibly even too much)

adding more parallelization could result in more context switching resulting in slower test execution for CPU bound tests

unfortunately I am not aware of a way to better optimize parallelization with maven. As far as i know, maven and surefire don't communicate at all. with gradle builds you can define an overall limit to workers and the scheduling is handled by the gradle daemon so you can maximize parallelization of both tests and compile tasks. But james doesn't build with gradle :)

A large issue for james is that the build is very naive and rebuilds a lot of useless stuff for every commit very few commits actually require rebuilding the whole codebase ... not sure if we can improve that

there may be ideas to find in https://eng.uber.com/how-we-halved-go-monorepo-ci-build-time/ they use a different build tool but the overall problem is the same

jeantil avatar Jul 06 '22 09:07 jeantil

Hi Jean, thank you for your suggestion.

The build already uses T1C meaning that it attempts to parallelize the build at module level trying to build one module per hyperthreaded core. for each module build there is already parallelization of the tests on the class level (though I don't remember the exact threadcount)

I checked our Jenkinsfile. We just use T1C for the compilation, not for the test phase. Maybe adding T1C to the maven test command could improve the testing time, but I guess the rationale why we did not use that before could have been because of resource fairness between the build...

quantranhong1999 avatar Jul 06 '22 09:07 quantranhong1999

02:56:59.515 [ERROR] 🐳 [cassandra_3_11_10-597500a2-330a-4e34-a007-dd1c815acc50] - failed to get digest sha256:07aee99ea52e391d5a13faf4f24a5053814c1b6bad8e01fd29c5ceeb5e285080: open /var/lib/docker/image/overlay2/imagedb/content/sha256/07aee99ea52e391d5a13faf4f24a5053814c1b6bad8e01fd29c5ceeb5e285080: no such file or directory

Keep having this recently. This is likely a problem with docker image cache storage. Not sure if it's related to the new concurrency or not.

  • Solution 1: clean the docker image storage (a bit dangerous cause we don't know what are being used on CI worker)
  • Solution 2: build image with --no-cache option. However could not find it in test container.

quantranhong1999 avatar Jul 14 '22 03:07 quantranhong1999

This work looks stale.

@quantranhong1999 can you please summarize your appoach (parallel execution of maven projects instead of relying on surefire fork counts, if I followed well) and its achievment both in term of build time and stability.

This is important as we can use that to see if it is a direction we should keep investigating...

chibenwa avatar Aug 01 '22 03:08 chibenwa

Can you please summarize your appoach (parallel execution of maven projects instead of relying on surefire fork counts, if I followed well) and its achievment both in term of build time and stability.

My idea is to increase the parallelism by adding parallel executions in Maven parallel builds (this works with surefire fork counts, not replace it). Maven surefire plugin doc (https://maven.apache.org/surefire/maven-surefire-plugin/examples/fork-options-and-parallel-execution.html#forked-test-execution):

You can use the place holder ${surefire.forkNumber} within argLine, or within the system properties (both those specified via mvn test -D... and via systemPropertyVariables). Before executing the tests, the surefire plugin replaces that place holder by the number of the actually executing process, counting from 1 to the effective value of forkCount times the maximum number of parallel executions in Maven parallel builds, i.e. the effective value of the -T command line argument of Maven core.

So this idea will actually create more forks. E.g before: forkCount = 1 => 1 fork after: mvn test -T 8 ... with fork count = 1 => 8 forks.

Achievement until now:

  • Build time: decrease much (stable test before 4h -> after ~ 1.5h)
  • Stability: sadly decreased. It creates more unstable tests, especially some docker tests (LDAP as I observed).

Our current test suite is likely not sustainable that much concurrency. I did try to reduce concurrency by using fewer CI worker CPU cores but still did not archive a green build. Specifying threadCount = 1 on docker tests module may help, but I am not sure it will works.

quantranhong1999 avatar Aug 01 '22 04:08 quantranhong1999

Any plan to carry on this work?

Should we close this, @quantranhong1999 ?

chibenwa avatar Sep 22 '22 06:09 chibenwa

Any plan to carry on this work?

Likely no, I think adding more parallelism to the CI is not the right way but brings more instability...

Should we close this, @quantranhong1999 ?

Yes, I will close this.

quantranhong1999 avatar Sep 22 '22 06:09 quantranhong1999