pythainlp icon indicating copy to clipboard operation
pythainlp copied to clipboard

Improve PyThaiNLP performance

Open wannaphong opened this issue 3 years ago • 10 comments

PyThaiNLP wants you help us to improve the performance. You can fork this git, coding new code to improve the performance, and send your pull request to PyThaiNLP.

These are some lists for you.

  • [x] Delete unused code
  • [ ] Port model to onnx #639
  • [x] Reduce the use of external packages that are not standard Python packages.
  • [ ] Use Cython for any code but still keep pure python for does not installed cython or cannot install cython

wannaphong avatar Jul 03 '22 06:07 wannaphong

I propose to remove tinydb as an external dependency (as mentioned in #506). It seems to me that using a 3rd-party database library is an overkill for just querying a JSON file which is by no means large in size.

tinydb is only used twice, once in init.py for database initialization, which could be simply replaced by creating an empty JSON file, and once in core.py for querying the local database, which could be done using json in the Python standard library instead.

And it is confusing that while tinydb is used to query the local JSON database, it is not used when parsing JSON data returned from https://pythainlp.github.io/pythainlp-corpus/db.json (see core.py).

The performance improvement gained by using tinydb or other 3rd-party database libraries is, I suppose, to be negligible, but to have one more unnecessary external dependency is not a good idea, I think, for future development and maintenance of the project. Instead, JSON querying could be easily done using json in the standard library.

What is your opinion? If you agree, I would be happy to work on this and send an PR.

BLKSerene avatar Aug 21 '22 18:08 BLKSerene

I propose to remove tinydb as an external dependency (as mentioned in #506). It seems to me that using a 3rd-party database library is an overkill for just querying a JSON file which is by no means large in size.

tinydb is only used twice, once in init.py for database initialization, which could be simply replaced by creating an empty JSON file, and once in core.py for querying the local database, which could be done using json in the Python standard library instead.

And it is confusing that while tinydb is used to query the local JSON database, it is not used when parsing JSON data returned from https://pythainlp.github.io/pythainlp-corpus/db.json (see core.py).

The performance improvement gained by using tinydb or other 3rd-party database libraries is, I suppose, to be negligible, but to have one more unnecessary external dependency is not a good idea, I think, for future development and maintenance of the project. Instead, JSON querying could be easily done using json in the standard library.

What is your opinion? If you agree, I would be happy to work on this and send an PR.

I am agree.

wannaphong avatar Aug 22 '22 09:08 wannaphong

When working on #691, I found a problem relating to the version checking behavior of the corpus downloader.

The downloader checks both the name and the version here, so later if the corpus is found in the local database and a re-download is not forced, the versions should always match.

I'm not sure what the expected behavior is here. If a corpus with the same name but different version is found in local database (and force = True is not specified), should the downloader suggest the user to use force = True or just silently re-download and rewrite the latest version of the corpus?

BLKSerene avatar Aug 23 '22 04:08 BLKSerene

Yes, It is not force download that you can see https://github.com/PyThaiNLP/pythainlp/runs/7957249794?check_suite_focus=true#step:5:5219 and https://github.com/PyThaiNLP/pythainlp/blob/dev/tests/test_corpus.py#L86.

wannaphong avatar Aug 23 '22 04:08 wannaphong

@wannaphong I mean that since both the name and the version of the corpus is checked, the else block would never be reached, so the user will never be notified that a newer version of the requested corpus is available.

That is, in cases that the corpus is found in the local database, but the version do not match, the latest version of the corpus would be silently downloaded and rewritten. If this is indeed the expected behavior, the else block would be redundant. If the user should be notified in these cases, the checking logic should be modified.

BLKSerene avatar Aug 23 '22 05:08 BLKSerene

OK. I'm agree.

wannaphong avatar Aug 23 '22 06:08 wannaphong

@wannaphong So what is the expected behavior? If the user should be notified to use force = True when newer versions of the corpus are available, I could work on a patch to fix it.

BLKSerene avatar Aug 23 '22 06:08 BLKSerene

@wannaphong So what is the expected behavior? If the user should be notified to use force = True when newer versions of the corpus are available, I could work on a patch to fix it.

I'm agree. The user should get notified when newer versions are available.

wannaphong avatar Aug 23 '22 13:08 wannaphong

@wannaphong Should be fixed in #692, please review the PR.

BLKSerene avatar Aug 25 '22 08:08 BLKSerene

I'm doing reduce import time. #719

wannaphong avatar Oct 08 '22 15:10 wannaphong