Changes needed to be able to use doctr on AWS Lambda
I would like to use doctr on AWS Lambda, but there is a restriction: it gives a function write access only under /tmp folder.
I have found two places where doctr claims write access outside this folder:
- While caching downloaded models.
- By using
ThreadPoolfrom python'smultiprocessingpackage: it uses/dev/shmfolder for shared memory which is not present onAWS Lambda.
That's why I think it would be nice to add an option for user to decide where models should be cached and whether to use multiprocessing or not.
Hi @mtvch :wave:, that's a really interesting use case thanks for opening :+1:
Some points:
- we should add a short test that it works as expected:
https://github.com/mindee/doctr/blob/main/tests/common/test_utils_multithreading.py create test_utils_data.py in tests/common folder and add a short test
- we need to document this for other users i would suggest to add a page to the documentation in using_doctr maybe something like 'Run on AWS' with a subpoint AWS Lambda
About the casting to list in multithread_exec : #931
Thank you all for responses!
Can I open another PR about changing return type of multithread_exec function to generator?
As far as I have checked every place where it's called expects a list as a result, so this change looks easy to me
Thank you all for responses! Can I open another PR about changing return type of
multithread_execfunction to generator? As far as I have checked every place where it's called expects a list as a result, so this change looks easy to me
Hi @mtvch :wave: , there is currently an open PR to do this #931 I have asked if he want to continue otherwise yes of course feel free to open another PR to solve this :+1:
Hello, @felixdittrich92!
I have updated this pr and opened another pr #1019 about multithread_exec return type. Could you, please, review them?
Codecov Report
Merging #1017 (2d236ac) into main (61d0f1c) will increase coverage by
0.00%. The diff coverage is100.00%.
@@ Coverage Diff @@
## main #1017 +/- ##
=======================================
Coverage 94.93% 94.94%
=======================================
Files 135 135
Lines 5590 5599 +9
=======================================
+ Hits 5307 5316 +9
Misses 283 283
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 94.94% <100.00%> (+<0.01%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Impacted Files | Coverage Δ | |
|---|---|---|
| doctr/datasets/datasets/base.py | 98.00% <100.00%> (ø) |
|
| doctr/utils/data.py | 90.00% <100.00%> (+1.32%) |
:arrow_up: |
| doctr/utils/multithreading.py | 100.00% <100.00%> (ø) |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
Hi @mtvch 👋 , I think you missed some suggestions from above would be great if you can add the missing ones :) 👍
Hello, @felixdittrich92! Excuse me, but, honestly, I don't know what I missed(
There are two things I didn't do as you suggested:
- I didn't use
os.accessfunction, because it doesn't work if directory does not exist (returningFalseeven if you can create directory). - I didn't write test for exactly downloading files, instead I used mocks and covered every line of this pr with tests.
Did you mean these two? If not, please, could you tell me what suggestions I missed and I will implement them
I have found one more thing I couldn't implement: using os.environ.get with default value in a second argument.
I decided to stick with is None check, because mypy was failing.
It seems that os.environ.get has Optional[str] specification even if I provide default value in a second argument.

@mtvch thanks for applying all the requested changes i have added some last minor points then we are good to go hugs
@felixdittrich92 Thank you for your patience and another review!