Docker-compose improvements

Open sandertan opened this issue 3 years ago • 0 comments

⚠️ This currently breaks docker-compose files other than docker-compose-dev.yml

Hi @tomolopolis, thanks a lot for all the recent updates to MedCAT Trainer. Especially the addition of Solr for the concept lookup looks nice!

We're starting a new annotation project for HPO concepts in Dutch text, and I had to make some changes to make the deployment fit our use case. It would be nice to have our changes, which add configuration options, in the master branch; this'll make it easier for us to update MedCAT trainer in the future.

Included in this PR

There are a number of docker-compose-files in this repository, some of which don't work anymore with current master I think. The only compose file that builds a new Dockerfile is docker-compose-dev.yml, so I made my changes only there. What is your plan with the other compose files? Having multiple compose files is difficult to maintain. I can add these changes to other compose files if you want.
I added an .env-example-file and documentation on how this should be used with Docker Compose. This makes it relatively easy to pass on environment variables to docker-compose, similar to what I added for CogStack-Nifi. For maintainers it would be nice to have the least amount of places where configuration can be changed, but I think this file is needed to configure variables at compose-level.
I made the spaCy model configurable, similar to what I did for MedCAT-Service. This is especially useful for non-English languages. I've set the default to the current en_core_web_md. Alternatively, we can set the default to all 3 (sm, md & lg) English language models.
By first copying requirements.txt and then copying the other backend files, Docker can use the cache during building. When nothing is changed to the requirements file, the cached step is used, which greatly reduces the time it tames to build the image.
I made the exposed Solr port configurable, similar to MCTRAINER_PORT.
I also cleaned up some unnecessary components:
- Remove expose: , as this does not do anything anymore (see last edit at https://stackoverflow.com/a/40801773/4141535)
- Removed the container_name for Solr. Within the docker compose network, the containers can find each other using the service name.

Not included in this PR

I did not update the other compose files to work with these changes. I didn't want to start updating those before discussing these changes with you.

Nice to have in future

It would be nice to have an option to disable loading the example data (https://github.com/CogStack/MedCATtrainer/blob/296d4d79f2dbc7f6db0f8a0eb90ffa6e0f521c57/webapp/run.sh#L19) . When using a different concept universe (for example SNOMED or HPO, or a different UMLS language), I think it's preferred to not have the example data in the instance.
It would also be nice to clean up the how the environment variables for layers lower than the docker-compose are used. Currently they are passed on both using envs/env-file and environment: in the compose. There's also some discrepancy in how different compose files do this. We could move the environment variables from envs/env to the proposed .env-example-file and load them in the container with something like:

environment:
  - MEDCAT_CONFIG_FILE=${MEDCAT_CONFIG_FILE}

It would also be nice to make it more clear how the config for MedCAT is used (configs/base.txt). With the recent versions of MedCAT containing the config within the CDB file, do you agree it would be easier to use this as the default? This is important for using non-English models, because it can set the spaCy model.
The location of the Solr instance is currently configurable at docker-compose level. Is this really necessary? If it's always running along with the other 2 components of this Docker compose, it's always running at solr:8983 within the Docker network.
I think the example project is currently not working. I commented it out in my local fork to get the Dutch instance up, but when I tested this repo with current master branch, the example load process reported some errors.
Perhaps building the Docker image can be faster by using Docker's multi-stage build functionality (https://docs.docker.com/build/building/multi-stage/) with a node.js image.

Let me know what you like/dislike. I can make separate GitHub issues of the nice to have-items if you prefer.

Oct 28 '22 13:10 sandertan