kyuubi icon indicating copy to clipboard operation
kyuubi copied to clipboard

[Improvement] Switch to UUID v1 to avoid collision

Open pan3793 opened this issue 3 years ago • 5 comments

Code of Conduct

Search before asking

  • [X] I have searched in the issues and found no similar issues.

What would you like to be improved?

Currently, Kyuubi uses java.util.UUID#randomUUID() to generate UUID, collision is more likely than UUID v1.

    /**
     * Static factory to retrieve a type 4 (pseudo randomly generated) UUID.
     *
     * The {@code UUID} is generated using a cryptographically strong pseudo
     * random number generator.
     *
     * @return  A randomly generated {@code UUID}
     */
    public static UUID randomUUID() {
        SecureRandom ng = Holder.numberGenerator;

        byte[] randomBytes = new byte[16];
        ng.nextBytes(randomBytes);
        randomBytes[6]  &= 0x0f;  /* clear version        */
        randomBytes[6]  |= 0x40;  /* set to version 4     */
        randomBytes[8]  &= 0x3f;  /* clear variant        */
        randomBytes[8]  |= 0x80;  /* set to IETF variant  */
        return new UUID(randomBytes);
    }

Ref: https://www.sohamkamani.com/uuid-versions-explained/

How should we improve?

Switch to UUID v1 algorithm.

A UUID v1 implementation under Apache License https://github.com/datastax/java-driver/blob/4.14.1/core/src/main/java/com/datastax/oss/driver/api/core/uuid/Uuids.java

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

pan3793 avatar Aug 25 '22 16:08 pan3793

Seems we could directly use UuidUtil.getTimeBasedUuid() from org.apache.logging.log4j.core.util.UuidUtil. https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/util/UuidUtil.html The doc says that The generated UUID will be unique for approximately 8,925 years so long as less than 10,000 IDs are generated per millisecond on the same device (as identified by its MAC address).

lightning-L avatar Aug 30 '22 06:08 lightning-L

Great, but let's make a copy of it to kyuubi-common since the log4j2 classes may not available in Spark 3.2 and early versions.

pan3793 avatar Aug 30 '22 06:08 pan3793

I have tested the implementations of those UUID generation method locally. The original usage ensures the most randomness, and the uuid v1 only ensures the uniqueness, and the randomness is not good, especially when you generate a lot of uuid in a short time. @pan3793 do we still need to modify this? If so, we could add a new UuidUtils to kyuubi-common and as datastax implementation for a reference.

original usage (UUID v4): test code:

  test("v4") {
    val uuids = new Array[UUID](200)
    for (i <- 0 to 199) {
      uuids(i) = UUID.randomUUID
      // scalastyle:off println
      println(uuids(i).toString)
    }
  }

output:

5abe7d45-0fbd-41a7-b5a7-249900254acd
9ea6109f-5e70-44cd-901e-b34dcaa84d53
f14681ac-ac9e-4039-b102-d885d29b61f2
ded46277-2c03-4c1b-a8ef-8e7075ac9d7a
31de13aa-ea38-4c71-a319-ec17c4c823c5
273c27de-b48d-46ad-a2ea-6acb70d8cb9b
05178dfd-3986-42fe-920f-c0c8cdf43655
8cc382bb-3ef0-4845-8272-74d1bc7f4a42
ad7ed26e-19df-4412-a97e-9e779bc9905c
ddf58d25-79dc-4d99-8695-74ea62fdae6d
1eafefd9-b452-4c5a-b1af-15f09ff49b8c
82e8dea6-4e12-4896-9154-5339b4cc394a
0569b4d1-0eee-4efb-8fd1-8cf127230004
e6e78e41-2180-482d-8446-e717fc80e21b
......

log4j2 implementation (UUID v1) test code:

  test("v1-log4j2 implementation") {
    val uuids = new Array[UUID](COUNT)
    for (i <- 0 to COUNT - 1) {
      uuids(i) = UuidUtil.getTimeBasedUuid()
      // scalastyle:off println
      println(uuids(i).toString)
    }
  }

output:

f4a6eeb1-291b-11ed-b233-ea76583b50ac
f4a6eeb2-291b-11ed-b233-ea76583b50ac
f4a6eeb3-291b-11ed-b233-ea76583b50ac
f4a6eeb4-291b-11ed-b233-ea76583b50ac
f4a6eeb5-291b-11ed-b233-ea76583b50ac
f4a6eeb6-291b-11ed-b233-ea76583b50ac
f4a6eeb7-291b-11ed-b233-ea76583b50ac
f4a715c8-291b-11ed-b233-ea76583b50ac
f4a715c9-291b-11ed-b233-ea76583b50ac
f4a715ca-291b-11ed-b233-ea76583b50ac
f4a715cb-291b-11ed-b233-ea76583b50ac
f4a715cc-291b-11ed-b233-ea76583b50ac
f4a715cd-291b-11ed-b233-ea76583b50ac
f4a715ce-291b-11ed-b233-ea76583b50ac
......

run it again:

13625d31-291c-11ed-8c2c-ea76583b50ac
13625d32-291c-11ed-8c2c-ea76583b50ac
13625d33-291c-11ed-8c2c-ea76583b50ac
13625d34-291c-11ed-8c2c-ea76583b50ac
13625d35-291c-11ed-8c2c-ea76583b50ac
13625d36-291c-11ed-8c2c-ea76583b50ac
13625d37-291c-11ed-8c2c-ea76583b50ac
13625d38-291c-11ed-8c2c-ea76583b50ac
13625d39-291c-11ed-8c2c-ea76583b50ac
13625d3a-291c-11ed-8c2c-ea76583b50ac
13625d3b-291c-11ed-8c2c-ea76583b50ac
......

datastax implementation(UUID v1): test code

  test("v1-datastax implementation") {
    val uuids = new Array[UUID](COUNT)
    for (i <- 0 to COUNT - 1) {
      uuids(i) = Uuids.timeBased
      // scalastyle:off println
      println(uuids(i).toString)
    }
  }

output:

3701ee90-291c-11ed-b9c9-afece8e12126
378af4b0-291c-11ed-b9c9-afece8e12126
378af4b1-291c-11ed-b9c9-afece8e12126
378af4b2-291c-11ed-b9c9-afece8e12126
378af4b3-291c-11ed-b9c9-afece8e12126
378af4b4-291c-11ed-b9c9-afece8e12126
378af4b5-291c-11ed-b9c9-afece8e12126
378af4b6-291c-11ed-b9c9-afece8e12126
378af4b7-291c-11ed-b9c9-afece8e12126
378af4b8-291c-11ed-b9c9-afece8e12126
378af4b9-291c-11ed-b9c9-afece8e12126
......

run it again:

491eece0-291c-11ed-8083-853955f207b8
49779430-291c-11ed-8083-853955f207b8
4977bb40-291c-11ed-8083-853955f207b8
4977bb41-291c-11ed-8083-853955f207b8
4977bb42-291c-11ed-8083-853955f207b8
4977bb43-291c-11ed-8083-853955f207b8
4977bb44-291c-11ed-8083-853955f207b8
4977bb45-291c-11ed-8083-853955f207b8
4977bb46-291c-11ed-8083-853955f207b8
4977bb47-291c-11ed-8083-853955f207b8
4977bb48-291c-11ed-8083-853955f207b8
4977bb49-291c-11ed-8083-853955f207b8
......

This means, when server restarts, you will get different node part for the UUID(last 12 digits). I checked the code, it will do a MD5 hash to the local address. Seems better security than log4j2 implementation.

lightning-L avatar Aug 31 '22 11:08 lightning-L

Thanks @lightning-L for the experiment.

Technically, uniqueness should be the most important thing, but given the result is hard to distinguish by humans, I think we can introduce a configuration to allow users to switch to v1 in heavy workloads environments but keep the current behavior in default. WDYT @turboFei ?

pan3793 avatar Aug 31 '22 11:08 pan3793

uniqueness should be the most important thing

+1, the uniqueness is important for us, because we rely it to kill the engine.

introduce a configuration to allow users to switch to v1

Sounds good.

turboFei avatar Aug 31 '22 15:08 turboFei