Bibhu Pala comments

Results 13 comments of


                                            Bibhu Pala

Vocabulary

After line 55 of train.py add the following code. It will produce a voca.txt file where there would be words and their id. ``` vocab_dict = vocab_processor.vocabulary_._mapping sorted_vocab = sorted(vocab_dict.items(),...

how to predict for one sentence?

Delt all your test data. And just keep one single line.

how to predict for one sentence?

@Nadedic Yes! We don't test with same data. That wont produce the exact accuracy of the model. If you have trained then you will be having a text file having...

Training too slowly

(209, 20000) This is the matrix. So feature vector length is 20000 more the size more time it will take. Try to reduce the dimension

Cross compiled to Scala 2.12

Hi guys, I am facing this error using spark 3.3.1 and scala 2.12.15. Has anyone fixed it yet

[SUPPORT] Poor Upsert Performance on COW table due to indexing

Hi Can try https://hudi.apache.org/blog/2023/11/01/record-level-index/#metadata-table this stores the record_keys in metadata tables. But I am not sure if this indexing can be applied for COW tables.

[FEATURE REQUEST] Nested idempotency support

Thanks for providing your suggestions @ad1happy2go 1. Even right now we are doing groupingBy and collect_list this is failing when the array size is more than 2GB 2. As you...

[FEATURE REQUEST] Nested idempotency support

@ad1happy2go Thanks for suggesting. This makes sense. Even I was thinking in same direction for two different tables for it.

[FEATURE REQUEST] Archival support for metadata table of Record level indexing

@danny0405 Are you planning to create a JIRA ticket for same? We started using RLI but we will need support in creating TTL policies for RLI

[SUPPORT] RLI index slowing down

Why do we need to set [hoodie.upsert.shuffle.parallelism](https://hudi.apache.org/docs/configurations/#hoodieupsertshuffleparallelism) From 0.13.0 onwards Hudi by default automatically uses the parallelism deduced by Spark based on the source data. If the shuffle parallelism is...