hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-28377: Add support for hive.output.file.extension to HCatStorer

Open VenkatSNarayanan opened this issue 1 year ago • 5 comments

What changes were proposed in this pull request?

HCatStorer now respects hive.file.output.extension for the output files it writes.

Why are the changes needed?

Brings HCatStorer's feature set more in line with Hive's.

Does this PR introduce any user-facing change?

Adds support for a property to HCatStorer that would previously have been ignored.

Is the change a dependency upgrade?

No

How was this patch tested?

TestHCatExtension was added.

VenkatSNarayanan avatar Jul 17 '24 21:07 VenkatSNarayanan

can the file extension be detected instead of user setting it explicitly? this may not be a good idea if a single job writes to different file formats. for example user set hive.output.file.extension to be '.parquet' but then same data is loaded into both parquet and orc tables.

yigress avatar Aug 07 '24 17:08 yigress

@yigress So for my implementation I just tried to match what Hive does in its existing implementation. If a sequential job wants to have different extensions for different tables, it can simply adjust the setting between HCat queries. Does Hive do something other than that to support that case that I missed?

VenkatSNarayanan avatar Sep 17 '24 22:09 VenkatSNarayanan

@yigress So for my implementation I just tried to match what Hive does in its existing implementation. If a sequential job wants to have different extensions for different tables, it can simply adjust the setting between HCat queries. Does Hive do something other than that to support that case that I missed?

in hive the property is only used for text file format https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L916

+1 for this useful feature, if a single job only loads into one table and the table format is known user can switch the values for the property between jobs. I am not expert in pig, somehow I have the impression that one pig job can load into multiple tables simultaneously, then user needs to be careful about setting

yigress avatar Sep 17 '24 23:09 yigress