druid First cut at restructuring the integration tests

Description

Issue #12359 proposes an approach to simplify and streamline integration tests, especially around the developer experience, but also for Travis. See that issue for the background.

This PR is big, but most of that comes from creating revised versions of existing files. Unfortunately, there is no good way using GitHub to compare two copies of the same file. For the most part, these are config files and you can assume that the new versions work (because, when they didn't, the cluster stubbornly refused to start or stay up.)

Developer Experience

With this framework, it is possible to:

Do a normal distribution build.
Build the Docker image in less than a minute. (Most of that is Maven determining what not to do. After the first build, you can use a script to rebuild in a few seconds, depending on what Docker must rebuild.)
Launch the cluster in a few seconds.
Debug an integration test as a JUnit test in your favorite IDE.

The result is that integration tests go from being a nightmare to being an efficient way to develop and test code changes. This author used it to create tests for PR #12222. The process was quick and easy. Not as efficient as just using unit tests (we still want the single-process server), but still pretty good. (By contrast, the new tests were ported to the existing framework, and that is still difficult for the reasons we're trying to address here.)

One huge win is that, with this approach, one can start a Docker cluster and leave it up indefinitely to try out APIs, to create or refactor tests, etc. Though there are many details to get right to use Docker and Docker Compose, once those are addressed, using the cluster becomes quite simple and productive.

Contents of this First Cut

This PR is a first draft of the approach which provides:

A new top-level project, docker-tests that holds the new integration test structure. (For now, the existing integration-tests is left unchanged.)
Sub-project testing-tools to hold code placed into the Docker image.
Sub-project test-image to build the Druid-only test image from the tarball produced in distribution. (Dependencies live in their "official" image.)
Sub-project base-test that holds the test code common to the revised integration tests, including file-based test configuration, test-specific clients, test initialization and updated version of some of the common test support classes.
Sub-project high-availability which is port of the integration test high-availability group to demonstrate and exercise the new structure.

The integration test setup is primarily a huge mass of details. This approach refactors many of those details: from how the image is built and configured to how the Docker Compose scripts are structured to test configuration. An extensive set of "readme" files explain those details. Rather than repeat that material here, please consult those files for explanations.

Limitations

This version is very much a first cut. Everything works for the one converted test group. The new framework is intended to exist parallel to the current one so we can get started. The new framework is ignored unless you select the Maven profiles which enable it. (See the docs for details.)

There are many other test groups not yet touched. A good approach is to use this framework for new integration tests, and to convert old ones when someone needs to modify them. The cost of converting to this framework is low, and the productivity gain is large.

Other limitations include:

The original tests appear to run not only in Docker, but also against a local QuickStart cluster and against Kubernetes. Neither of these other two modes have been tested in the new framework. (Though, it is now so easy to start and use a Docker cluster that that it may be easier to use Docker than the QuickStart cluster.)
The original tests always have security enabled. While it is important to test security, having security enabled makes debugging far harder (by design.) So, this draft has security disabled. The various scripts and configs are pulled aside. The thought is to enable security as an option when needed, and run without it when debugging things other than the security mechanism.
The supporting classes have the basics, but have been used for only the one integration test group.
This framework is not yet integrated into Travis. A test that exists only in the new framework won't run in the Travis build. We hope to address that limitation after this PR is merged.

Next Steps

This PR itself will continue to evolve as some of the final details are sorted out. However, it is at the stage where it will benefit from others taking a look and making suggestions.

The thought is that this PR is large enough already: let's get it reviewed, then tackle the additional issues listed above as the opportunity arrises and step-by-step.

Alternatives

The approach in the PR is based on the existing approach, but re-arranges the parts. Since the integration test are pretty much "nothing but details", there are many approaches that could be taken. Here are a few that were considered.

Run the tests as-is in an AWS instance. Because the tests are very difficult to run on a developer machine, many folks set up an AWS instance to run them. While this can work, it is slow: one has to shuffle code from the laptop to the instance and back. Or, just do development on the instance. The tests are not really set up for debugging, so even on the instance, it is still tedious to make and debug test changes.
Run the tests in Travis as part of a PR. This is the default approach. However, it is akin to the development process of old: submit the changes to a batch run, wait many hours for the answers, plow though the logs, find issues, fix them, and repeat. That process was not efficient in the era of punch cards, and is still not very efficient today. A turnaround of a minute or less is the garget, which Travis approach cannot provide.
Modify the existing integration tests. This is the obvious approach. But, the set of existing ITs is so large that attempting to change everything in one go becomes overwhelming. The chosen approach allows incremental test-by-test conversion without breaking the great mass of existing tests.
Status-quo. I'm working on a project that requires many integration tests. It is faster to fix the test framework once, and do the tests quickly, than to fight with the framework for each of the required tests.

That said, this PR is all about details. Your thoughts, suggestions and corrections are encouraged to ensure we've got our bases covered.

Detailed Changes

A number of specific changes are worth calling out that do not appear in the docs.

A number of config classes are modified to allow the test code to create an instance directly, without the "JSON Config" mechanism.
The MySQL, ZK and Kafka extensions are slightly modified to expose methods that allow us to use the classes in test-specific clients.
The Guice Initialization class is modified to allow tests to define a client-oriented set of modules. (The standard configuration assumes a server environment, and thus has far more dependencies than we need for tests which act as clients.)
Similarly, Lifecycle is slightly modified to allow the use in a test client.

This PR has:

[X] been self-reviewed.
[X] added documentation for new or modified features or behaviors.
[X] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
[X] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
[X] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
[X] added integration tests.
[X] been tested in a test Druid cluster (in the sense that this PR is for running such a cluster in Docker.)

Mar 25 '22 21:03 paul-rogers

This is a big PR. It is all-in-one so folks can see the whole picture. If it helps, this can be broken into smaller chunks, at loss of overall context.

Here's a quick summary of what's new vs. what's refactored:

docker-tests/docs is all new and is meant to capture all the learnings from this exercise, along with information needed to move froward. This is the main resource for understanding this PR.
docker-tests project is new so it does not conflict with the existing integration-tests project: both can exist.
docker-tests/base-test is mostly new. It contains the revised test config code and a new cluster client.
- ClusterConfig is the YAML-based config mechanism.
- Initializer has a bunch of complexity to force the server-minded Guice config to work in a client. Loads the DB. Etc.
docker-tests/test-image is a greatly refactored set of Docker build scripts. The Dockerfile is heavily refactored to remove the third-party dependencies and rearrange how Druid is laid out (unpacked from the distribution tarball). DB setup is removed.
docker-tests/testing-tools is mostly a copy/paste of extensions-core/testing-tools with the custom node role from integration-tests added.
docker-tests/high-availability is a refactor of one test from integration-tests. The Docker Compose script is specific to this one test, refactored from those in integration-tests. The idea is that this test contains just the files for this "group". Other groups will follow this pattern.
Other files are mostly clean-up uncovered while debugging. In some cases, code was refactored so the test "clients" could use code that was previously tightly coupled with the server.
yaml files: refactor of Docker Compose with new test config.

Also, as noted before, this PR moves authorization aside into separate files. Authorization is not yet enabled.

Mar 28 '22 17:03 paul-rogers

Continuing to whack build details. Who knew that each task did a pre-build before the build in which the pre-build builds everything except distribution. This cases test-image to fail when looking for the non-existent distribution dependency. Using a bit of profile magic to hind this dependency from the pre-build.

Apr 04 '22 22:04 paul-rogers

Sorry, had to do some major surgery to the Maven module structure, which required renaming the modules and their directories. See the description in maven.md.

Other than that, only minor tweaks as I try to run the gauntlet of the zillions of checks run on the code.

Apr 09 '22 02:04 paul-rogers

The new IT task passed, hooray! Whacked a few more static checking issues.

There is one I don't understand. It appears that we've got JS problems, but I didn't change anything in JS:

added 235 packages from 867 contributors and audited 235 packages in 12.562s
found 4 vulnerabilities (2 moderate, 1 high, 1 critical)
  run `npm audit fix` to fix them, or `npm audit` for details
events.js:183
      throw er; // Unhandled 'error' event
      ^
TypeError: Cannot read property 'forEach' of undefined
    at unpackage (/home/travis/build/apache/druid/node_modules/jacoco-parse/source/index.js:27:14)
    at /home/travis/build/apache/druid/node_modules/jacoco-parse/source/index.js:114:22
    at Parser.<anonymous> (/home/travis/build/apache/druid/node_modules/xml2js/lib/parser.js:304:18)
    at emitOne (events.js:116:13)
    at Parser.emit (events.js:211:7)
    at SAXParser.onclosetag (/home/travis/build/apache/druid/node_modules/xml2js/lib/parser.js:262:26)
    at emit (/home/travis/build/apache/druid/node_modules/sax/lib/sax.js:624:35)
    at emitNode (/home/travis/build/apache/druid/node_modules/sax/lib/sax.js:629:5)
    at closeTag (/home/travis/build/apache/druid/node_modules/sax/lib/sax.js:889:7)
    at SAXParser.write (/home/travis/build/apache/druid/node_modules/sax/lib/sax.js:1436:13)
    at Parser.exports.Parser.Parser.parseString (/home/travis/build/apache/druid/node_modules/xml2js/lib/parser.js:323:31)
    at Parser.parseString (/home/travis/build/apache/druid/node_modules/xml2js/lib/parser.js:5:59)
    at exports.parseString (/home/travis/build/apache/druid/node_modules/xml2js/lib/parser.js:369:19)
    at Object.parse.parseContent (/home/travis/build/apache/druid/node_modules/jacoco-parse/source/index.js:107:5)
    at /home/travis/build/apache/druid/node_modules/jacoco-parse/source/index.js:129:15
    at FSReqWrap.readFileAfterClose [as oncomplete] (fs.js:511:3)
****FAILED****

Is this saying that the build itself has broken code? If so, maybe it will go away on the next build?

Apr 11 '22 22:04 paul-rogers

Rebased on latest master to try to fix the prior issue. Unfortunately, the issue didn't resolve.

Now getting a different unrelated failure:

[ERROR] org.apache.druid.query.groupby.epinephelinae.BufferHashGrouperTest.testGrowingOverflowingInteger  Time elapsed: 0.003 s  <<< ERROR!
java.lang.OutOfMemoryError
	at sun.misc.Unsafe.allocateMemory(Native Method)
	at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:127)
	at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
	at org.apache.druid.query.groupby.epinephelinae.BufferHashGrouperTest.makeGrouper(BufferHashGrouperTest.java:187)

Apr 13 '22 16:04 paul-rogers

New commit. We have to exclude test test code from Jacoco since it is not unit tested. That was painful because the test classes were in "generic" Druid packages. Moved the test code into a dedicated package so we can just exclude that one package.

Migrated the remainder of the batch index tests. This showed some redundancy in the required test code, so created a test runner to hide that boilerplate. Test conversion is now very easy -- at least for the sample of tests converted thus far.

Also includes other minor doc changes and build issue fixes.

Apr 14 '22 05:04 paul-rogers

@kfaraz, thank you for your thorough review, and for trying out the new setup. Always great to know it runs on a machine other than my own!

You mentioned flaky test and how to retry them. Two thoughts on that.

First, we should not have flaky tests. IMHO, such tests either:

Are flaky because they start running before the cluster is stable,
Are not telling us anything if the test themselves are flaky (because they depend on timing, or on behavior which is inherently non-deterministic, such as the ordering of events from different services.)
Are point out actual issues with Druid: that clients would have to retry operations. We should either a) fix that issue, or b) document it. Either way, the tests should be prepared for whatever race or non-deterministic condition is in question.

The new framework eliminates the first issue. The framework ensures that services are ready before launching tests. This means that either the test or Druid is flaky. Either way, we should fix he issue: remove the test if it is not useful, else fix it or fix Druid (perhaps adding a way to synchronize when needed for testing.)

May 06 '22 21:05 paul-rogers

All that said, there is the second issues: rerunning specific tests. This is a harder issue than one would think.

The reason to combine tests is that, in this new system, the bulk of the time for each "test group" is consumed with building Druid. If we keep the tests split up, we end up rebuilding Druid over and over and over. Allowing retries means retaining our current extravagant abuse of Travis resources.

The obvious solution to the redundancy issue is to build Druid and the image once, then run all the test groups that use that particular configuration. Since we have multiple configurations, the various configurations would still run in parallel, but the test "groups" would run in series within each configuration.

Of course, if we retain flaky tests, then we want to play "whack-a-mole" in builds: keep rerunning only those tests that failed until we get lucky and they pass. By combining tests, we decrease the probability of getting lucky. As mentioned above, the obvious answer is to fix the flaky tests, we we are starting to do.

Another constraint is how Travis seems to work. We can only rerun jobs within Travis's build matrix. It does not seem we can parameterize the job to say, "just run the ITs, with only these two projects." To be able to rerun one test "group" we have to let each group (for each configuration) build all of Druid, which gets us back to the redundancy issue.

Short term, I'm thinking to do an experiment in which each test "group" is triggered by a separate Maven profile. We can then also have an "all-its" profile that enables all the groups. Until we resolve flaky tests, we can opt to waste resources and build profile-by-profile (that is, group-by-group) as we do today. Later, when tests are fixed (or if we identify groups which are not flaky), we can combine them via profiles.

I'll try that in a separate commit so I can easily back it out if it does not work out.

May 06 '22 22:05 paul-rogers

@kfaraz, good point on the docs. Yes, the docs started as my attempt to remember the many details of how the original tests worked, and what I changed as I created this new version. Per your suggestion, I created a quickstart.md page with usage info. We can expand that as we figure out what additional information is most often needed. I added links into the more detailed docs for when someone needs more information.

The idea on conversion is to try out a few groups here, then convert the others over time. I was perhaps lucky: the groups I converted so far mostly "just worked." I've encountered no flakiness in those tests, in my limited runs, after I made sure the cluster was up before running the tests.

We'll have to see, as we convert more, if the others are as easy, or if there will be tricky bits. If there are test that are still flaky, we'll have to sort that out on a case-by-case basis, as suggested above.

Let's also remember that there there is a big chunk of work not addressed in this PR: running a secured cluster. There is code in the old ITs to create certs, configure security, etc. Tests run that way will be very difficult to debug, by definition. That whole areas is left as an open issue in this PR, in part because this one is already large enough.

May 06 '22 22:05 paul-rogers

This branch has been open long enough that it drifted out of sync with master. Rebasing ran into the usual issues when a zillion files change. So, squashed commits so the rebase would succeed. Fortunately, the squashed commits are those that have already been reviewed, no additional changes were made before squashing occurred. New changes show up as new commits on top of the squash. In this latest commit, updated the project from 0.23.0 to 0.24.0 so that the builds will work.

May 16 '22 22:05 paul-rogers

Getting an odd failure:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (process-resource-bundles) 
on project it-high-availability: 
Error resolving project artifact: 
Could not transfer artifact io.confluent:kafka-schema-registry-client:pom:5.5.1 
from/to sigar (https://repository.jboss.org/nexus/content/repositories/thirdparty-uploads/): 
Transfer failed for https://repository.jboss.org/nexus/content/repositories/thirdparty-uploads/io/confluent/kafka-schema-registry-client/5.5.1/kafka-schema-registry-client-5.5.1.pom 
for project io.confluent:kafka-schema-registry-client:jar:5.5.1: 
peer not authenticated -> [Help 1]

First, the project on which this fails does not include the artifact. Second, the project that does use it already built, so the artifact should be cached locally. Third,, why is the peer not authenticated?

May 19 '22 20:05 paul-rogers

@kfaraz, thanks for the review. It's been a long slog to resolve the many Maven issues with all our many static checks.

You asked about the "experimental docker tests" task in this PR. Yes, it is experimental: I'll remove (or disable) it before we commit. For now, I envision we won't run the tests in the maven build since they duplicate existing tests. Instead, a good next step would be to migrate each IT one by one: convert it to the new framework, replace the current IT tasks with a new version (based on the "experimental" one), and verify the results.

The plan is to get a clean build. Once that is done, I'll remove the experimental step and we can commit this monster.

As we move ahead, the new framework will run in phase 2, in place of the existing items. During the interim, we can mix-and-match mechanisms: the Travis builds are all independent of one another. That is a problem in general (we redo the same work over and over) but turns out to be a help during the transition.

May 23 '22 20:05 paul-rogers

Currently trying to track down a mysterious error. In `it-high-availability, Maven is unable to find a particular jar file. Looks like it works one time (in Java 8), but fails another time (in Java 11):

[INFO] Downloading from google-maven-central: https://maven-central.storage-download.googleapis.com/maven2/io/confluent/kafka-schema-registry-client/5.5.1/kafka-schema-registry-client-5.5.1.pom
[INFO] Downloading from sonatype: https://oss.sonatype.org/content/repositories/releases/io/confluent/kafka-schema-registry-client/5.5.1/kafka-schema-registry-client-5.5.1.pom
[INFO] Downloading from sonatype-apache: https://repository.apache.org/content/repositories/releases/io/confluent/kafka-schema-registry-client/5.5.1/kafka-schema-registry-client-5.5.1.pom
[INFO] Downloading from apache.snapshots: https://repository.apache.org/snapshots/io/confluent/kafka-schema-registry-client/5.5.1/kafka-schema-registry-client-5.5.1.pom
[INFO] Downloading from sigar: https://repository.jboss.org/nexus/content/repositories/thirdparty-uploads/io/confluent/kafka-schema-registry-client/5.5.1/kafka-schema-registry-client-5.5.1.pom

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (process-resource-bundles) on project it-high-availability: 
Error resolving project artifact: 
Could not transfer artifact io.confluent:kafka-schema-registry-client:pom:5.5.1 from/to sigar (https://repository.jboss.org/nexus/content/repositories/thirdparty-uploads/): 
Transfer failed for https://repository.jboss.org/nexus/content/repositories/thirdparty-uploads/io/confluent/kafka-schema-registry-client/5.5.1/kafka-schema-registry-client-5.5.1.pom for project io.confluent:kafka-schema-registry-client:jar:5.5.1: 
peer not authenticated -> [Help 1]

This is pushing against the edge of my Maven knowledge. I'm hoping it is just something silly I'm doing that shows up as the above mysterious error. Anyone seen something similar? For example, did the first attempt above succeed? Or, is the error a failure in that attempt, but the error is reported later? Or, is Maven trying to get the jar twice, the first worked, but the second failed?

Seems the 5.5.1 release is still available, so that isn't a problem. (It is old, and has vulnerabilities, so we should probably upgrade.)

This seems to be a transitive dependency brought in from druid-integration-tests. Yet, that module seems to compile. Still scratching my head...

May 23 '22 21:05 paul-rogers

Rebased on latest master to try to overcome the perpetual "confluent" jar errors. Let's see if this let's us get a clean build.

Jun 01 '22 23:06 paul-rogers

Sad. This latest run should have worked, but it seems there are issues on Travis with finding the JRE. Sigh. We have to wait for those issues to be resolved, then can some committer please restart the build.

Jun 03 '22 17:06 paul-rogers

Getting a clean build is proving quite difficult. Out of desperation, we'll pull two groups of changes out of this PR into separate PRs so the build issues are easier to debug. In particular, it is hard at present to separate out actual errors in the "old" ITs from the flaky ITs. Let's get those other two PRs done, then we'll rebase this on the updated master so that only the new IR code remains. That way, if an old IT fails, we'll have some confidence that it is just flaky, not broken.

Jun 17 '22 18:06 paul-rogers

Revised to prepare for merge:

Parameterized tests
Test creation guide
Main IT script: it.sh
Enhanced configuration options: env vars, etc.
Test runner supports parameterized tests
Test runner allows test-specific code configuration (add Guice modules, etc.)
Other cleanup, bug fixes

Aug 20 '22 00:08 paul-rogers

The latest PR converted two ITs to use the new framework. Both pass in Travis. (And there was much rejoicing.)

However, two of the old ITs fail for obscure reasons. A security test fails with an auth failure, but the same test was clean in a prior build. Another IT can't find its input file, though this PR changes none of the input files or paths in the old tests. These are not the usual flaky IT suspects. So we probably have to sort out what's-what. Once we do, this should be good to go.

Aug 22 '22 02:08 paul-rogers

@kfaraz, thanks for your approval of this PR. The final changes are in for this PR and the build is clean. Please take a quick final look, and merge the PR at your convenience.

Aug 24 '22 00:08 paul-rogers

Thanks for the update, @paul-rogers ! I will take another look today and merge this.

Aug 24 '22 05:08 kfaraz