Replace use of tar file with zip file
The test suite uses the tar.gz files AlkEthOH_tripos.tar.gz and FreeSolv.tar.gz .
Test configuration requires reading those files to figure out what names are present. Test runs require unpacking the contents to a temporary file. Together this add over a second to each test run, simply during import.
This can be reduced by replacing those tar.gz files with a zip file. The zip file includes an index, so filename discovery is fast, and the entries can be extracted directly, without going through a temporary unpack.
The test discovery process replaces:
with tarfile.open(alkethoh_tar_file_path, "r:gz") as tar:
# Collect all the files discarding the duplicates in the test_filt1 folder.
slow_test_cases = {
extract_id(m.name)
for m in tar.getmembers()
if "crd" in m.name and "test_filt1" not in m.name
}
with something like:
with zipfile.ZipFile(alkethoh_zip_file_path) as z:
# Collect all the files discarding the duplicates in the test_filt1 folder.
slow_test_cases = {
extract_id(m)
for m in z.namelist()
if "crd" in m and "test_filt1" not in m
}
There is some space cost to having a zip file; the files are larger:
-rw-r--r-- 1 dalke admin 5042284 May 26 16:00 AlkEthOH_tripos.tar.gz
-rw-r--r-- 1 dalke admin 8474005 May 27 14:58 AlkEthOH_tripos.zip
-rw-r--r-- 1 dalke admin 1484592 May 26 16:01 FreeSolv.tar.gz
-rw-r--r-- 1 dalke admin 2406597 May 27 15:21 FreeSolv.zip
The tests.utils functions like get_alkethoh_file_path() will also need to be updated to use the zip file.
If accepted, this should reduce test time, and simplify test configuration.
I like these runtime improvements, but the our git file history has become embarrassingly large, and making this change would make it even larger. This could be a good improvement to tackle alongside #43.