hudi icon indicating copy to clipboard operation
hudi copied to clipboard

If Sanitastiion Enabled In HudiStreamer It is taking too much time

Open Amar1404 opened this issue 2 years ago • 6 comments

Tips before filing an issue

Describe the problem you faced

I have enabled the SANITIZE_SCHEMA_FIELD_NAMES hudiDeltaStreamer is stuck after reading CSV. I think we can refactor the code it too better way. Instead of using withColumnRenamed the transformation should be something like this

def transformSchemaBeginEndCharReplace(spark: SparkSession, final_stream: Dataset[Row], pii_masking_col: Seq[Any]): Dataset[Row] = { val sql_select = new StringBuilder val schema = final_stream.schema for (i <- schema) { if (i.dataType.isInstanceOf[StructType] || i.dataType.isInstanceOf[ArrayType]) { sql_select.append(s"cast(to_json(${i.name}) as String)") sql_select.append(" as ") sql_select.append(avroSchemaNameConversionBeginEndCharReplace(i.name) + " , ")

  }
  else if (pii_masking_col.contains(i.name)) {
    sql_select.append(s"sha1(`${i.name}`)")
    sql_select.append(" as ")
    sql_select.append(avroSchemaNameConversionBeginEndCharReplace(i.name) + " , ")
  }
  else {
    sql_select.append(s"`${i.name}`")
    sql_select.append(" as ")
    sql_select.append(avroSchemaNameConversionBeginEndCharReplace(i.name) + " , ")

  }
}
val final_sql = sql_select.toString().stripSuffix(" , ").split(",")
final_stream.selectExpr(final_sql: _*)

}

def avroSchemaNameConversionBeginEndCharReplace(name: String) = { val regexPattern = "(^[0-9])|(^[^a-zA-Z_])|(([^A-Za-z0-9_])$)|([^A-Za-z0-9_])".r val outputString = regexPattern.replaceAllIn(name, m => { if(m.group(1)!=null){ s"${m.group(1)}" } else if(m.group(2)!=null || m.group(3) != null ){ "" } else { "" } })

outputString

}

We can set and adjust this work faster for my local transformation

To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

  • Hudi version : 0.14.1

  • Spark version : 3.3

  • Hive version :

  • Hadoop version :

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Amar1404 avatar Jan 09 '24 13:01 Amar1404

@Amar1404 Sorry for delay here missing this one out. I will look into this and get back to you.

ad1happy2go avatar Jan 26 '24 16:01 ad1happy2go

@ad1happy2go - Any updated on this.

Amar1404 avatar Mar 06 '24 07:03 Amar1404

@Amar1404 As discussed, Ideally withColumnRenamed should not be slow, I will benchmark number with and without sanitisation and get back to you.

ad1happy2go avatar Mar 06 '24 11:03 ad1happy2go

@ad1happy2go - Any update on this?

Amar1404 avatar Apr 05 '24 04:04 Amar1404

@Amar1404 Never got a chance to try out this. I will prioritise it next week. Thanks.

ad1happy2go avatar Apr 11 '24 16:04 ad1happy2go

hi @ad1happy2go - Any updates on this

Amar1404 avatar Apr 19 '24 12:04 Amar1404