pytask icon indicating copy to clipboard operation
pytask copied to clipboard

DOC: Details for data catalog

Open hmgaudecker opened this issue 1 year ago • 8 comments

Location of the documentation

API ref Tutorial How-To Guide

Documentation problem

I tried to start from the basic tutorial, setting up a DataCatalog and then adding another via .add() to get a nested structure. Took me forever to realize that I can't do that, but need to specify the structure right with the initial entries.

Would be nice to add an example and to add a warning that DataCatalogs can't be added via DataCatalog.add()

hmgaudecker avatar Mar 11 '24 14:03 hmgaudecker

Hi @hmgaudecker, I know nesting data catalogs was part of an early version, but I decided to remove it again since it seems like an odd case, which is better handled by having multiple data catalogs next to each other.

Maybe we can raise an error. Let us see.

tobiasraabe avatar Mar 11 '24 23:03 tobiasraabe

Oh, I quite liked it.

What seems definitely necessary in large, complex projects is some sort of nesting structure. E.g. I would loop over model specifications, sample restrictions, etc -- seems much clearer to nest that rather than having very long names at one level of the hierarchy, which would need to be parsed all the time.

E.g., after converting only the data cleaning of a larger project so far, my data catalog directory looks as follows:

.
|____baseline_linear
| |____pooled
| | |____figures
| | |____analysis
| | |____tables
|____clean_data
| |____21186aab5c349047df58b3459bc81927a6beefbce940c09856357dabe82a0df9.pkl
| |____21186aab5c349047df58b3459bc81927a6beefbce940c09856357dabe82a0df9-node.pkl
| |____2bca653468d28a857ee3a3dd70e02f32a9c473666a7c65008097ea4c58992550.pkl
| |____a8c0fb6507c0e87f768015609c4c8d9d67b411ff8cef519359732dd10e23676b-node.pkl
| |____c36bfa1e446cf2f67ea0f82b6e991cae2078d07ad24238d1ab9e24819ad5cf65-node.pkl
| |____a8c0fb6507c0e87f768015609c4c8d9d67b411ff8cef519359732dd10e23676b.pkl
| |____2bca653468d28a857ee3a3dd70e02f32a9c473666a7c65008097ea4c58992550-node.pkl
| |____c36bfa1e446cf2f67ea0f82b6e991cae2078d07ad24238d1ab9e24819ad5cf65.pkl

I really wouldn't want to have to be using

.
|____baseline_linear-pooled-figures
|____baseline_linear-pooled-analysis
|____baseline_linear-pooled-tables
|____clean_data

instead.

(of course it does not matter much what the directory looks like, the important bit is accessing via data_catalog["baseline_linear"]["pooled"]["figures"] etc.)

FWIW, it still seems allowed in the latest version at least according to API docs. Also works in the latest released version.

hmgaudecker avatar Mar 12 '24 04:03 hmgaudecker

Sorry for thinking out loud a bit. So would the preferred way to achieve the above be to just specify a dictionary of data catalogs?

If so, should DataCatalog(name="baseline_linear/pooled/analysis") work or better specify the path? (No Windows support necessary)

hmgaudecker avatar Mar 12 '24 04:03 hmgaudecker

On a related note and not sure whether docs or bug: If defining an item of a data catalog implicitly, e.g.,

def task_silly(
        logs: Annotated[Path, Product] = data["logs"]
):

the default node type is used. When adding to the catalog upfront:

data.add("logs", Path(f"{logs}.db"))

the above code leads to a PathNode being used.

If that is the expected behaviour, it would be good to add it to the docs.

hmgaudecker avatar Mar 12 '24 10:03 hmgaudecker

This is also crucial for the section Developing with the data catalog. That is, the nodes that don't have the default type need to have this changed type in the version that I import. Makes perfect sense, of course, but cost me a bit of time until I realised it...

hmgaudecker avatar Mar 12 '24 11:03 hmgaudecker

(of course it does not matter much what the directory looks like, the important bit is accessing via data_catalog["baseline_linear"]["pooled"]["figures"] etc.)

FWIW, it still seems allowed in the latest version at least according to API docs. Also works in the latest released version.

What is documented is still a leftover. It should not be there. If it works, it works by accident.

Since users need to create each data catalog anyways, I thought users can do

data_catalogs = {
    "baseline_linear": {
        "pooled": {
            "analysis": DataCatalog()
            "figures": DataCatalog()
            "tables": DataCatalog()
        }
    }
}

It has the same effect and I do not need to support it in a special way.

If that is the expected behaviour, it would be good to add it to the docs.

It is expected behavior. I will update the docs and make it more clear.

tobiasraabe avatar Mar 12 '24 21:03 tobiasraabe

Hope it's okay to add thoughts here as I am gaining experience.

A few things that could become clearer IMO:

  • more links to DataCatalog API (e.g. I could not find any on the How-to Guide page)
  • When to explicitly add things (either via add method or via entries initially) and when to use implicit definition
  • Relatedly: How to deal with different types of nodes. Still not fully clear to me, maybe because I am still too deep in the old model of everything being a PathNode.
  • How to deal with debugging stuff (node with name being a hash not found in the data catalog directory -- I am baffled as to how to proceed)

hmgaudecker avatar Mar 13 '24 18:03 hmgaudecker

Not quite related, but the example on hashing PythonNodes is broken because value is keyword-only.

hmgaudecker avatar Mar 14 '24 18:03 hmgaudecker