If Sanitastiion Enabled In HudiStreamer It is taking too much time
Tips before filing an issue
Describe the problem you faced
I have enabled the SANITIZE_SCHEMA_FIELD_NAMES hudiDeltaStreamer is stuck after reading CSV. I think we can refactor the code it too better way. Instead of using withColumnRenamed the transformation should be something like this
def transformSchemaBeginEndCharReplace(spark: SparkSession, final_stream: Dataset[Row], pii_masking_col: Seq[Any]): Dataset[Row] = {
val sql_select = new StringBuilder
val schema = final_stream.schema
for (i <- schema) {
if (i.dataType.isInstanceOf[StructType] || i.dataType.isInstanceOf[ArrayType]) {
sql_select.append(s"cast(to_json(${i.name}) as String)")
sql_select.append(" as ")
sql_select.append(avroSchemaNameConversionBeginEndCharReplace(i.name) + " , ")
}
else if (pii_masking_col.contains(i.name)) {
sql_select.append(s"sha1(`${i.name}`)")
sql_select.append(" as ")
sql_select.append(avroSchemaNameConversionBeginEndCharReplace(i.name) + " , ")
}
else {
sql_select.append(s"`${i.name}`")
sql_select.append(" as ")
sql_select.append(avroSchemaNameConversionBeginEndCharReplace(i.name) + " , ")
}
}
val final_sql = sql_select.toString().stripSuffix(" , ").split(",")
final_stream.selectExpr(final_sql: _*)
}
def avroSchemaNameConversionBeginEndCharReplace(name: String) = { val regexPattern = "(^[0-9])|(^[^a-zA-Z_])|(([^A-Za-z0-9_])$)|([^A-Za-z0-9_])".r val outputString = regexPattern.replaceAllIn(name, m => { if(m.group(1)!=null){ s"${m.group(1)}" } else if(m.group(2)!=null || m.group(3) != null ){ "" } else { "" } })
outputString
}
We can set and adjust this work faster for my local transformation
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
-
Hudi version : 0.14.1
-
Spark version : 3.3
-
Hive version :
-
Hadoop version :
-
Storage (HDFS/S3/GCS..) : S3
-
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
@Amar1404 Sorry for delay here missing this one out. I will look into this and get back to you.
@ad1happy2go - Any updated on this.
@Amar1404 As discussed, Ideally withColumnRenamed should not be slow, I will benchmark number with and without sanitisation and get back to you.
@ad1happy2go - Any update on this?
@Amar1404 Never got a chance to try out this. I will prioritise it next week. Thanks.
hi @ad1happy2go - Any updates on this