doctr icon indicating copy to clipboard operation
doctr copied to clipboard

Changes needed to be able to use doctr on AWS Lambda

Open mtvch opened this issue 3 years ago • 3 comments

I would like to use doctr on AWS Lambda, but there is a restriction: it gives a function write access only under /tmp folder. I have found two places where doctr claims write access outside this folder:

  1. While caching downloaded models.
  2. By using ThreadPool from python's multiprocessing package: it uses /dev/shm folder for shared memory which is not present on AWS Lambda.

That's why I think it would be nice to add an option for user to decide where models should be cached and whether to use multiprocessing or not.

mtvch avatar Aug 10 '22 08:08 mtvch

Hi @mtvch :wave:, that's a really interesting use case thanks for opening :+1:

Some points:

  • we should add a short test that it works as expected:

https://github.com/mindee/doctr/blob/main/tests/common/test_utils_multithreading.py create test_utils_data.py in tests/common folder and add a short test

  • we need to document this for other users i would suggest to add a page to the documentation in using_doctr maybe something like 'Run on AWS' with a subpoint AWS Lambda

About the casting to list in multithread_exec : #931

felixdittrich92 avatar Aug 11 '22 09:08 felixdittrich92

Thank you all for responses! Can I open another PR about changing return type of multithread_exec function to generator? As far as I have checked every place where it's called expects a list as a result, so this change looks easy to me

mtvch avatar Aug 11 '22 14:08 mtvch

Thank you all for responses! Can I open another PR about changing return type of multithread_exec function to generator? As far as I have checked every place where it's called expects a list as a result, so this change looks easy to me

Hi @mtvch :wave: , there is currently an open PR to do this #931 I have asked if he want to continue otherwise yes of course feel free to open another PR to solve this :+1:

felixdittrich92 avatar Aug 12 '22 09:08 felixdittrich92

Hello, @felixdittrich92! I have updated this pr and opened another pr #1019 about multithread_exec return type. Could you, please, review them?

mtvch avatar Aug 15 '22 05:08 mtvch

Codecov Report

Merging #1017 (2d236ac) into main (61d0f1c) will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1017   +/-   ##
=======================================
  Coverage   94.93%   94.94%           
=======================================
  Files         135      135           
  Lines        5590     5599    +9     
=======================================
+ Hits         5307     5316    +9     
  Misses        283      283           
Flag Coverage Δ
unittests 94.94% <100.00%> (+<0.01%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/datasets/datasets/base.py 98.00% <100.00%> (ø)
doctr/utils/data.py 90.00% <100.00%> (+1.32%) :arrow_up:
doctr/utils/multithreading.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Aug 15 '22 05:08 codecov[bot]

Hi @mtvch 👋 , I think you missed some suggestions from above would be great if you can add the missing ones :) 👍

felixdittrich92 avatar Aug 30 '22 10:08 felixdittrich92

Hello, @felixdittrich92! Excuse me, but, honestly, I don't know what I missed(

There are two things I didn't do as you suggested:

  1. I didn't use os.access function, because it doesn't work if directory does not exist (returning False even if you can create directory).
  2. I didn't write test for exactly downloading files, instead I used mocks and covered every line of this pr with tests.

Did you mean these two? If not, please, could you tell me what suggestions I missed and I will implement them

mtvch avatar Aug 30 '22 13:08 mtvch

I have found one more thing I couldn't implement: using os.environ.get with default value in a second argument. I decided to stick with is None check, because mypy was failing. It seems that os.environ.get has Optional[str] specification even if I provide default value in a second argument. image

mtvch avatar Aug 31 '22 02:08 mtvch

@mtvch thanks for applying all the requested changes i have added some last minor points then we are good to go hugs

@felixdittrich92 Thank you for your patience and another review!

mtvch avatar Sep 01 '22 09:09 mtvch