datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Feat: Implement hf:// / "hugging face" integration in datafusion-cli

Open xinlifoobar opened this issue 1 year ago • 13 comments

Which issue does this PR close?

Closes #10720

Rationale for this change

What changes are included in this PR?

Are these changes tested?

# xinli @ arch-dev in ~/source/repos/datafusion/datafusion-cli on git:dev/xinli/hfstore o [12:16:09] 
$ ./target/debug/datafusion-cli
DataFusion CLI v38.0.0
> SELECT count(*) AS count
FROM 'hf://datasets/cais/mmlu/astronomy/dev-00000-of-00001.parquet';
+-------+
| count |
+-------+
| 5     |
+-------+
1 row(s) fetched. 
Elapsed 2.469 seconds.

> create external table test stored as parquet location "hf://datasets/cais/mmlu/astronomy/";
0 row(s) fetched. 
Elapsed 1.398 seconds.

> select count(*) from test;
+----------+
| COUNT(*) |
+----------+
| 173      |
+----------+
1 row(s) fetched. 
Elapsed 1.199 seconds.

> select * from test limit 2
;
+--------------------------------------------------------------------------------------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
| question                                                                             | subject   | choices                                                                                                                                                                                                                                                                                                                          | answer |
+--------------------------------------------------------------------------------------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
| You cool a blackbody to half its original temperature. How does its spectrum change? | astronomy | [Power emitted is 1/16 times as high; peak emission wavelength is 1/2 as long., Power emitted is 1/4 times as high; peak emission wavelength is 2 times longer., Power emitted is 1/4 times as high; peak emission wavelength is 1/2 as long., Power emitted is 1/16 times as high; peak emission wavelength is 2 times longer.] | 3      |
| What drives differentiation?                                                         | astronomy | [Spontaneous emission from radioactive atoms., The minimization of gravitational potential energy., Thermally induced collisions., Plate tectonics.]                                                                                                                                                                             | 1      |
+--------------------------------------------------------------------------------------+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+
2 row(s) fetched. 
Elapsed 1.098 seconds.

Are there any user-facing changes?

xinlifoobar avatar Jun 04 '24 15:06 xinlifoobar

Hi @alamb. I am still working on this for completing UTs, and E2Es and fixing bad code styles... Could you please help do a pre-quick review of the ideas behind... I did not find a simple way to implement this other than creating a wrapper ObjectStore impl on top of HttpStore.

xinlifoobar avatar Jun 07 '24 04:06 xinlifoobar

Hi @alamb. I am still working on this for completing UTs, and E2Es and fixing bad code styles... Could you please help do a pre-quick review of the ideas behind... I did not find a simple way to implement this other than creating a wrapper ObjectStore impl on top of HttpStore.

This is now able to be reviewed now. I completed most of the refining of the initial code.

xinlifoobar avatar Jun 10 '24 09:06 xinlifoobar

I am sorry @xinlifoobar for the delayed review -- I am traveling this week (actually presenting about DataFusion at SIGMOD: https://2024.sigmod.org/industrial-list.shtml)

alamb avatar Jun 11 '24 01:06 alamb

I am sorry @xinlifoobar for the delayed review -- I am traveling this week (actually presenting about DataFusion at SIGMOD: https://2024.sigmod.org/industrial-list.shtml)

Ya, the first time I found this on my timeline on LinkedIn, and am glad to be part of this awesome project.

I would pause updating on this PR since it is extremely large IMO and difficult for reviewers. Let me know your thoughts on it and I could do an update in the following iterations.

xinlifoobar avatar Jun 11 '24 03:06 xinlifoobar

I would pause updating on this PR since it is extremely large IMO and difficult for reviewers. Let me know your thoughts on it and I could do an update in the following iterations.

If it is too complicated, maybe we should just stop working on it (or maybe we should put the code into a datafusion-contrib repo 🤔 )

alamb avatar Jun 15 '24 16:06 alamb

I would pause updating on this PR since it is extremely large IMO and difficult for reviewers. Let me know your thoughts on it and I could do an update in the following iterations.

If it is too complicated, maybe we should just stop working on it (or maybe we should put the code into a datafusion-contrib repo 🤔 )

Ya. re-implementing the datastore and associated facilities is code-consuming. Do you think the objectstore solution is the right way to go? If so, I could split part of the code into datafusion-contrib repo.

xinlifoobar avatar Jun 17 '24 10:06 xinlifoobar

Sorry for the delay -- I paln to review this tomorrow

alamb avatar Jun 28 '24 00:06 alamb

Sorry for the delay -- I paln to review this tomorrow

Sorry, I lost Github connections for a couple of days and just returned. Also Thanks. please take your time.

xinlifoobar avatar Jul 02 '24 13:07 xinlifoobar

This is still on my list, but I am behind in my reviews

alamb avatar Jul 07 '24 11:07 alamb

Hi @xinlifoobar

I am sorry for the delay in responding to this PR. This is an amazing piece of software engineering. Very nice 🎩 👌

As you have noted, the challege here is that the hs_store is non trivial and yet somewhat specialized for HuggingFace. It is a really neat feature but somewhat hard to justify adding to the datafusion-cli in the DataFusion repo.

I feel like there is a tension between making datafusion-cli easy to use with many built in integrations (e.g. hugging face, delta-rs, etc) and keeping the dependencies manageable

What would you think about somehow moving this hugging face integration into another repo and making some version of datafusion-cli that had a bunch of pre-defined integrations?

For example, maybe put it in https://github.com/datafusion-contrib/datafusion-cli-plus or something

That could be like the power user version of datafusion-cli that we could use all the fun table providers (like what is in the connector libraries, etc), delta-rust, iceberg-rust, etc

If you think this is a reasonable idea, I will file a ticket for larger discussion

Hey @alamb, thanks for taking the time to review this PR. Great honor to me on this.

I am glad to have a repo like datafusion-cli-plus but think of a broader project for a new repo.

As you may have already observed, the datafusion-cli currently supports the Datafusion SQL interface. Are you considering expanding its capabilities to encompass protocols such as MySQL client, Arrow Flight SQL, and others? This expansion would entail naming it act to a full-fledged server. I've observed similar approaches being implemented in downstream projects like InfluxDB.

xinlifoobar avatar Jul 21 '24 03:07 xinlifoobar

Apologies for missing this PR. I wanted to share that OpenDAL has native support for Huggingface. I'm considering whether integrating with OpenDAL would be beneficial, making it easier, more extensible, and maintainable to cool features like this one (as a response to https://github.com/apache/datafusion/issues/12357).

Xuanwo avatar Sep 06 '24 16:09 Xuanwo

I'm considering whether integrating with OpenDAL would be beneficial, making it easier, more extensible

I think using the connectivity offered by openDAL in a tool such as https://github.com/apache/datafusion/issues/11979 would be very interesting.

Basically I think it would be valuable to distinguish between the query processing engine in DataFusion and the larger integrated applications that are built with DataFusion (and other components)

alamb avatar Sep 06 '24 18:09 alamb

marking as draft as I don't think this is waiting on review -- more like waiting on porting to another repo (like dft)

alamb avatar Oct 18 '24 20:10 alamb

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Dec 18 '24 02:12 github-actions[bot]

I plan to look into integrating hugging face into dft in the coming weeks. First I will check if it can be done with opendal and if not will try to leverage the implementation in this pr.

matthewmturner avatar Dec 18 '24 02:12 matthewmturner

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Feb 18 '25 01:02 github-actions[bot]