Come up with a recommended way to render/store tutorials
I think that wherever it is computationally feasible, tutorials should be run as a CI check and the tutorial use the version rendered by the CI to avoid failing or outdated tutorials.
With the current template it is already possible to have tutorials built by nbshpinx on readthedocs.org, but it would be nice to have something that scales better. Also it needs to be documented somewhere how to enable this behavior.
- [ ] Recommendations where to store tutorials (submodule for executed
ipynbfiles when not using CI builds?) - [ ] Documentation how to enable CI builds
- [ ] Ideally, each notebook is built in a separate worker (-> parallelization) (-> individual checkmark for each notebook instead of digging into sphinx logs)
- [ ] GPU support
See also #19.
I'd want something like
flowchart TD
A[push to PR] --> B[run tests]
A --> C[build tutorials on GH actions]
C --> D[trigger RTD build]
I believe @michalk8 has something like this running for squidpy.
@adamgayoso and I have also talked about this a few times.
I think one point here is that some tutorials should opt out of being built. We often have tutorials involving large computations or data and realistically don't want to spend the CI resources or time on every commit running them.
@michalk8, can you elaborate how the squidpy setup works? And maybe point me to a PR where the tutorials were actually built, I only find PRs where that particular action was skipped.
Inspired by the pre-commit CI, the following approach could work:
- trigger CI in PR by push
- rerun notebooks. If the output changes compared to the previous version, add a commit with the rendered ipynb files (either in main repo or submodule)
- the automatic commit triggers RTD again. Other CI checks can be skipped.
Inspired by the pre-commit CI, the following approach could work:
- trigger CI in PR by push
- rerun notebooks. If the output changes compared to the previous version, add a commit with the rendered ipynb files (either in main repo or submodule)
- the automatic commit triggers RTD again. Other CI checks can be skipped.
I think I'll give this approach a try in one of my repos.
MyST-nb might be a good alternative to nbsphinx.
Once benefit is that it supports caching through jupyter-cache. It could be nice to restore that cache when running notebooks through github actions.
My idea was along the lines
- store the notebooks in the main repo in text format (e.g. python percent script of some other jupytext format)
- build notebooks by CI
- upload artifacts (ipynb) to e.g. a S3 bucket with unique hash
- update link to notebook in documentation
- sphinx plugin that pulls ipynb from S3 and includes it in documentation
Isaac wants submodules back in scanpy, see https://github.com/scverse/scanpy/issues/2636
I think we should plan the best approach here first (or approaches: Might be that different projects have different needs).
store the notebooks in the main repo in text format
No, unless you have some excellent arguments for it that you haven’t shared yet
Bespoke markdown dialects are a bad bad thing. The same applies to not storing notebooks in an .ipynb format. All the advantage of having tooling goes out of the window in both cases.
- “Markdown” with role and directive syntax isn’t supported by standard viewers and authoring tools. Formatting destroys it. Why Markdown at all when you can’t use any tool that makes using Markdown pleasant? Sure, if you’re in closed environment like Obsidian or some forum software, that’s fine, but for storing in .md files it’s a bad idea.*
- Notebook authoring, conversion, cleaning, executing is all supported and nice using the constantly developed frontends and libraries that are everywhere. Storing them in some ad-hoc text format that isn’t supported by more than the most rudimentary subset of tools has no advantage. GitHub just gained notebook diff support. There’s no remaining argument that I know of that suggest storing notebooks as any other format might be a good idea.
I’ll need some excellent arguments to be convinced that storing notebooks in any other format than .ipynb might be a good idea in some niche case. I’m very happy that I don’t do R anymore, since Rmd instead of .ipynb cost the whole ecosystem so much potential synergy.
*Y’all like MyST, I’m not going to convince you to go back on it, some “I told you so”s nonwithstanding. Maybe in a few years. Maybe by then some actual extensible widely supported Markdown dialect will have emerged and MyST will migrate to it. Then it’d actually be a quarter-step up from rST instead of a downgade.
So your plan, but taking my veto into account:
- notebooks as .ipynb, but use CI to make sure they’re cleaned of output and execution numbers before merging
- …
- upload artifacts (.ipynb with outputs) to some storage
- …
- …
I like it. It supports linking to specific versions of notebooks from specific versions of the documentation while making sure that CI works.
I think we should plan the best approach here first (or approaches: Might be that different projects have different needs).
agreed. Might need a temporary solution in scanpy that could be reverting to the submodule until we have something everyone is happy with.
No, unless you have some excellent arguments for it that you haven’t shared yet
The main argument for me is that
- text format are easy to review on github
- it is easy to fix merge conflict if there are any.
I think there are two formats that are particularly interesting and have excellent editor support. I also don't think jupytext with >6k stars is a niche tool.
-
plain python script (python percent format)
Out-of-the-box support in Spyder, vscode, pycharm, vim+slime. It's just a python script, i.e. all python tooling works on it (no additional tooling required to apply ruff, black etc)
-
Quarto markdown
The evolution of Rmd, but language agnostic. Rstudio/posit fully supports python with it. Native support in rstudio and vscode (via official extension).
But it's not the hill I am going do die on, I can live with an ipynb with stripped out outputs. The main selling point is that tutorials without output (in whatever format) can be kept in the same repo as the main documentation.
Might need a temporary solution in scanpy that could be reverting to the submodule until we have something everyone is happy with.
As said there, if someone wants that, they can do it, I don’t think it has a significant enough advantage to spend effort on it personally 😄
I’m not convinced about text formats.
I think my main problem with this can be boiled down to
- There’s a standard. Lobby for more things supporting it instead of fracturing the landscape. Rmd was a mistake, jupytext is too.
- Serialization formats (JSON, YAML, …) are always superior in representing structured data (like notebook cells) than magic text delimiters. One mistake and you have a garbled mess whereas in JSON, you have a syntax error.
The authors of .ipynb had valid arguments why they chose JSON instead of text, which are still relevant.
text format are easy to review on github
Good argument. There’s multiple extensions to do it (e.g.), but of course they aren’t seamlessly integrated in GitHub. The question is how much that matters that things aren’t perfectly seamless here.
it is easy to fix merge conflict if there are any.
I never had problems with this, nbdime is great at this. Granted, you need to always do it locally instead of being able to click the GitHub UI button most of the time.
-
No markdown preview while authoring, confusing layout while authoring. That’s not great, since you can accidentally break the magic comment syntax and the corruption can be overlooked while reviewing the notebook. Standard problems when you pretend that text is a data structure. (That’s why I don’t use bash/zsh anymore)
-
No standalone 3rd party tooling without converting them to .ipynb first. I assume that’s not great if you want to use some standard tooling to modify/lint/whatever the file. Maybe I’m wrong and things can be seamlessly converted back and forth without loss of information.
The jupterlab plugin makes this a bit less painful, but it’s kinda dumb that .py files get the Jupyter icon and get opened as notebooks.
Quarto markdown
Still one language-specific editor’s proprietary format that hasn’t garnered tool support outside of the R world. Nope.
I predicted then that Rmd would become big enough in the R world that it would prevent the scientific landscape to grow together and create interoperability with tools that work on the same format. It made me sad. I was right.
Happy to go with stripped ipynb if you prefer. Technically it would be easy to support both at some point, as it's probably just a jupytext call in step 1.
@flying-sheep, maybe an even easier solution than writing a sphinx plugin is to use DVC. It's basically "git for data", similar to git lfs, but with a bunch of supported backends (e.g. S3 bucket). I tried it at work for some data analysis projects and it's pretty neat.
Basically
- you can
dvc addany file in the repository. It will then be gitignored and insteada*.dvcfile is checked into git that contains the file hash with which it can be retreived from the storage backend - there's a
dvc reprocommand that executes commands to "reproduce the analysis" that are stored in a yaml file. This could be used to render notebooks on the CI.
And on readthedocs, one would just add dvc pull as a pre-build command, and it would retreive the rendered .ipynb files from wherever they are stored.
How would that look like from a workflow perspective? I think the ideal workflow (without going a completely different route than GitHub) would be:
- check out repo
- open a notebook, all outputs are there and you see how things are supposed to look
- change/add notebook, commit changes (only inputs and markdown cells)
- CI builds only notebooks that changed in the PR
- After a merge, the state of the repo contains output generated by CI
Of course, adding 1-2 simple steps like that tool wouldn’t be the end of the world, but do we need to? Updating the generated PNGs from time to time isn’t going to explode the repo size, so that use case of DVC isn’t really what we need.
Or in other words, what’s missing from what we do in https://github.com/scverse/scverse-tutorials that this enables?
If committing the notebooks to the main scanpy repo is an option, I'm all for it, but I thought Isaac didn't want that. Having this separate scanpy-tutorials repo that is integrated into the main repo as a submodule is a weird construct:
- you have this separate website with duplicated content: https://scanpy-tutorials.readthedocs.io/en/latest/index.html
- you need to make two PRs to update a tutorial, one to scanpy-tutorials, then a second one to scanpy that bumps the submodule
- you can't as easily run the tutorials as part of the scanpy CI (e.g. check that the tutorials still work before merging a PR). You could setup synced PRs like in cookiecutter-scverse-instance, but that's annoying.
With DVC (or a sphinx-plugin as described above), you could have the tutorials versioned in the scanpy repo directly, without inflating its size.
I’m not a fan of the submodule approach either.
I think they should just be moved back into the scanpy repo. With git clone --filter=blob:none ..., nobody needs to download big amounts of data, and it’s less work to do that once than wrangle the submodule.
@ivirshup you said it was decided to go that route and seemed unwilling to go back on that decision. Can you link to where that’s discussed? I’d like to see what submodules bring to the table.