[Proposal] Ensure consistency between code and documentation.
Is your feature request related to a problem or challenge?
There are currently two documentation hosting targets for DataFusion: docs.rs (several crates) and arrow.apache.org.
- docs.rs - All crates published to crates.io are automatically documented on docs.rs. Doc tests can also serve as working examples and ensure the correctness of documentation.
- User Guide - The document hosted on arrow.apache.org is generated by Sphinx with manually mantained document source. It's a great source for getting an overview of the project and understanding how DataFusion works and its general usage.
Parts of the documentation could be shared between each other. For example, built-in functions should also be listed in the Expression API of the user guide. However, each of them maintain its own document source separately, and they are not fully consistant with each other.
Describe the solution you'd like
Merge relevant parts of Sphinx source into Rust doc comments.
Then extract doc comments from JSON output (rfcs#2963 - nightly toolchain required) of the rustdoc and generate markdown files. Finally, include these files in Sphinx by doctree.
A utility to generate markdown files from doc comments is required. It should not take much effort by utilizing rustdoc-json and rustdoc-types.
Describe alternatives you've considered
Create a shared doc folder between Rust and Sphinx source, and merge relevant part into one. Then include external file by doc attribute in Rust, or by doctree in Sphinx.
This is really annoying to find the right file when writing doc in development. And I don't want to do that.🤪
Additional context
No response
Thank you @ongchi -- I think this is a really great idea. Thank you
Here is a similar issue from @andygrove https://github.com/apache/arrow-datafusion/issues/7951 that also has a PR https://github.com/apache/arrow-datafusion/pull/7956
As a pragmatic matter, here is my suggestion for incrementally implementing this without having to make a massive PR that will be a conflict magnet.
First, we accept that there will be a period of time where we have a split set of documentation (one auto generated, and one static)
Then, build the new automatic documentation system you describe based on the functions in https://github.com/apache/arrow-datafusion/tree/main/datafusion/functions (it is a subset at the moment). That way we can get the tooling and pattern sorted out. Then as we migrate the rest of the functions over, we can migrate the documentation as well
Here is a quick (and dirty) proof of concept of the doc comments extractor.
# Under arrow-datafusion project folder
comment-extract \
--package "datafusion-expr" \
--module-path "datafusion_expr::expr_fn" \
--kind function
The output example would be like this: https://gist.github.com/ongchi/ad5b256ddcf0dc5560e910e765e2c225
I took a quick look and https://gist.github.com/ongchi/ad5b256ddcf0dc5560e910e765e2c225 looks 👌 very nice -- thanks @ongchi
I believe we now achieve this goal using the doctest! macro -- for example
https://github.com/apache/datafusion/blob/37e54ee874e9027a329ba4f6c1e0e6359d63a33c/datafusion/core/src/lib.rs#L618-L643