admin-training icon indicating copy to clipboard operation
admin-training copied to clipboard

trainer <--> topic mapping for testing/updating

Open martenson opened this issue 3 months ago • 11 comments

Please add your name to the issues you commit to testing and updating.

As discussed we want one person to go through each topic, make sure it runs fine, and update or at least flag any issues.

  • [x] Intro to Ansible link @mira-miracoli
  • [x] Quick and dirty manual Galaxy setup @mvdbeek
  • [x] Galaxy Installation with Ansible link (@mira-miracoli )
  • [x] Connecting to a compute cluster link @martenson
  • [x] Mapping Jobs to Destinations (TPV) link @pauldg
  • [x] Performant Uploads with TUS link @kysrpex
  • [x] Celery and Flower link @mira-miracoli
  • [ ] Pulsar slides
  • [x] Singularity and Apptainer here (@martindemko )
  • [x] Data Managers and CVMFS here and here @pauldg
  • [x] User, Groups, and Quotas management here @pauldg
  • [x] Storage management - Object store here @pauldg
  • [ ] Sentry slides
  • [x] Monitoring and gxadmin slides @martenson
  • [ ] Troubleshooting here
  • [x] Maintenance and backup here @mira-miracoli
  • [x] Tool management and Ephemeris here @martenson
  • [x] Interactive Tools here @kysrpex

martenson avatar Oct 23 '25 09:10 martenson

Hi, I leave my thoughts after testing the "Performant Uploads with TUS" tutorial.

  • I did not find issues nor change anything. Everything is still done the same way and works, and there are no new releases of the TUSd role.
  • The recording from 2022 is outdated because it sets up a systemd unit via galaxyproject.tusd rather than via Gravity, but since otherwise it is good, creating a new recording is overkill imo.
  • An interesting insight from the recording that is not on the exercise is that TUSd provides metrics under /metrics (e.g. amount of data uploaded).

If you have more time than the strictly necessary to spend on this topic, from the "pedagogical" point of view I think it is worth mentioning a few things (which are not on the exercise):

  • Briefly show the TUS protocol documentation. Highlight the steps: creation, determine upload offset (if needed), repeat patch requests to submit data (that's precisely what makes it "resumable"). The dev tools of the browser are really illustrative, using them it's even possible to start uploading a large file and then break the connection, making it obvious how the protocol is designed to survive this situation.
Image
  • Explain hooks, after all that's the mechanism that allows Galaxy to know an upload is finished.

kysrpex avatar Nov 17 '25 15:11 kysrpex

@kysrpex cool, thanks! Please PR your suggestions 🎉.

martenson avatar Nov 19 '25 11:11 martenson

Hi again, I reviewed "Galaxy Interactive Tools", updated some things and found some issues.

Changes https://github.com/galaxyproject/training-material/pull/6495:

  • Set up the GxIT Proxy using Gravity.
  • Define the Galaxy job conf via group vars (match the style of all other tutorials).
  • Rename GIE Proxy to GxIT Proxy in Galaxy Interactive Tools Training.
  • Mention GIEs only in the history slide of the Interactive Tools slides.

Issues:

  • The wildcard DNS records were missing (now they exist https://github.com/usegalaxy-eu/infrastructure/pull/256).
  • Issuing wildcard certificates for the GAT VMs: as EU admin I was able to run https://training.galaxyproject.org/training-material/topics/admin/tutorials/interactive-tools/tutorial.html#getting-a-wildcard-ssl-certificate (Option 2: route53), but this is not feasible as it is in the training because of the need for AWS credentials. I don't know what would be the best approach; does anyone remember how was it done in 2023?
  • When embedded Pulsar is enabled Galaxy never gets the URL to the interactive tool, and also cannot stop it. I am trying to find out why the communication channel is broken (that's why the GTN PR https://github.com/galaxyproject/training-material/pull/6495 is a draft).

kysrpex avatar Nov 20 '25 15:11 kysrpex

@kysrpex note the wildcard DNS records will need to cover more than 10 machines, we'll likely have ~30 after full setup (ping @mira-miracoli). That said I have no idea what was the path to do gxit in the past trainings, I wasn't there. Maybe the ns-training.galaxyproject.org via @natefoo ?

martenson avatar Nov 20 '25 17:11 martenson

Hi, I tested apptainer part and I didn't find any trouble there. However, I noticed that the order of training steps using Ansible does not fully correspond to the prerequisites of each step. Example: Celery (step 6) assumes there already is RabbitMQ installed, which is actually a sub-part of Pulsar (step 7); Apptainer (step 8) assumes the existing configuration of CVMFS (step 9) (only for testing, though); both Slurm (step 3) and TPV (step 4) are referring to singularity/apptainer (step 8) in configuration (not viable, but might be confusing). This may not be the full list. I didn't get beyond step 9 yet.

martindemko avatar Nov 21 '25 00:11 martindemko

Two small issues I found: For the connecting to a cluster training: https://github.com/galaxyproject/training-material/blob/e822f5fe70007fb085e8f1382833c79ea02d14a1/topics/admin/tutorials/connect-to-compute-cluster/tutorial.md?plain=1#L165 cons_res is legacy and is replaced by cons_tres For the TPV training: https://github.com/galaxyproject/training-material/blob/e822f5fe70007fb085e8f1382833c79ea02d14a1/topics/admin/tutorials/job-destinations/tutorial.md?plain=1#L732 params['walltime'] should be entity.params.get('walltime')

Shall I try doing the knitting and such or is this so small we fix it on the fly during the training @martenson

pauldg avatar Nov 21 '25 09:11 pauldg

@pauldg small fixes on the fly tend to end up huge pains taking hours when teaching 20 people, please fix all things you find

@martindemko sounds breaking, we need to reorder then

martenson avatar Nov 21 '25 10:11 martenson

The real problem is celery and pulsar. The least is to swap those two. The rest should not break anything.

martindemko avatar Nov 21 '25 12:11 martindemko

Here's the PR https://github.com/galaxyproject/training-material/pull/6502

pauldg avatar Nov 21 '25 15:11 pauldg

Hi again, I reviewed "Galaxy Interactive Tools", updated some things and found some issues.

Changes galaxyproject/training-material#6495:

...

Issues:

* The wildcard DNS records were missing (now they exist [Create GxIT wildcard DNS records for GAT 2025 usegalaxy-eu/infrastructure#256](https://github.com/usegalaxy-eu/infrastructure/pull/256)).

* Issuing wildcard certificates for the GAT VMs: as EU admin I was able to run https://training.galaxyproject.org/training-material/topics/admin/tutorials/interactive-tools/tutorial.html#getting-a-wildcard-ssl-certificate (Option 2: route53), but this is not feasible as it is in the training because of the need for AWS credentials. I don't know what would be the best approach; **does anyone remember how was it done in 2023?**

* When embedded Pulsar is enabled Galaxy never gets the URL to the interactive tool, and also cannot stop it. I am trying to find out why the communication channel is broken (that's why the GTN PR [[GAT] Update Galaxy Interactive Tools tutorial and slides training-material#6495](https://github.com/galaxyproject/training-material/pull/6495) is a draft).

I made progress on the wildcard certificates topic, it requires PRs https://github.com/usegalaxy-eu/infrastructure/pull/260 and https://github.com/usegalaxy-eu/infrastructure/pull/261. The idea is to obtain them via AWS Route53 (option 2 from the tutorial).

I'll have a look now at the communication problem between Galaxy and Pulsar.

EDIT: got it, the container monitor was failing because the Let's Encrypt Staging Root CAs need to be trusted. It wasn't on my machine, but @mira-miracoli confirmed it is on the trainees' VMs ls -lah /usr/local/share/ca-certificates/fakeleroot-x1.crt, so you should not have this bug.

kysrpex avatar Nov 25 '25 08:11 kysrpex

Hi again, I reviewed "Galaxy Interactive Tools", updated some things and found some issues. Changes galaxyproject/training-material#6495: ...

Everything is ready now for interactive tools. If you want to teach interactive tools, make sure https://github.com/galaxyproject/training-material/pull/6495, https://github.com/usegalaxy-eu/infrastructure/pull/260 and https://github.com/usegalaxy-eu/infrastructure/pull/261 are merged, and during the hands-on, obtain wildcard SSL certs using AWS Route53 (option 2 from the tutorial).

kysrpex avatar Nov 25 '25 12:11 kysrpex