mtdata
mtdata copied to clipboard
[WIP] 0.3.8 development
Change log
- CLI arg
--log-levelwith default set toWARNING - progressbar can be disabled from CLI
--no-pbar; default is enabled--pbar
python -m mtdata -h
usage: __main__.py [-h] [-vv] [-v] [-ll {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [-ri] [-pb | -no-pb] {list,get,report,list-recipe,get-recipe,stats} ...
positional arguments:
{list,get,report,list-recipe,get-recipe,stats}
optional arguments:
-h, --help show this help message and exit
-vv, --verbose verbose mode (default: False)
-v, --version show program's version number and exit
-ll {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Set log level (default: WARNING)
-ri, --reindex Invalidate index of entries and recreate it. This deletes /Users/tg/.mtdata/mtdata.index.0.3.8-dev.pkl only and not the downloaded files. Use this if you're using in
developer mode and modifying mtdata index. (default: False)
-pb, --pbar Show progressbar (default: True)
-no-pb, --no-pbar Do not show progressbar (default: False)
-
mtdata stats --quickdoes HTTP HEAD and shows content length; e.g.mtdata stats --quick Statmt-commoncrawl-wmt19-fra-deu
{"id": "Statmt-commoncrawl-wmt19-fra-deu", "total_bytes": 65880180, "total_size": "65.88 MB",
"urls": {"https://data.statmt.org/wmt19/translation-task/fr-de/bitexts/commoncrawl.fr.gz": 32607032, "https://data.statmt.org/wmt19/translation-task/fr-de/bitexts/commoncrawl.de.gz": 33273148}}
- stats without quick shows segs, toks, chars, bytes, and total_size
mtdata -no-pb stats Statmt-commoncrawl-wmt19-fra-deu
{"id": "Statmt-commoncrawl-wmt19-fra-deu", "segs": 622288, "segs_err": 0, "segs_noise": 0, "deu_toks": 12217694, "fra_toks": 13992149, "deu_chars": 85747775, "fra_chars": 87337364, "deu_bytes": 87746644, "fra_bytes": 90752783, "total_bytes": 178499427, "total_size": "178.5 MB"}
-
mtdata.scripts.recipe_statsto read stats from output directory
python -m mtdata.scripts.recipe_stats -h
usage: recipe_stats.py [-h] [-d] [-t TOK_EXT] [-lc] DIR [DIR ...]
positional arguments:
DIR Recipe or experiment directory
optional arguments:
-h, --help show this help message and exit
-d, --debug Enable debug logs (default: False)
-t TOK_EXT, --tok-ext TOK_EXT
Tokenized files extension. Example .tok (default: None)
-lc, --caseless Case insesitive when counting number of types (i.e. unique words) (default: False)