Thread safety
Calling Creator.add_item from multiple threads results in the libzim worker threads crashing silently. After terminating the main thread, an exception is raised indicating that the total number of bytes returned by the provider does not equal the size returned by get_size(). It is unknown if this error is the cause of the crash or a result thereof.
It would be nice to have the intended thread safety behavior documented in the README. Obviously, multithreading support or thread safety via locks would probably be great too, but likely offers little benefit.
Are we speaking of python-libzim or libzim ?
Thrown exceptions are documented in https://libzim.readthedocs.io/en/latest/api/classzim_1_1writer_1_1Creator.html#exhale-class-classzim-1-1writer-1-1creator
Thread safety behavior is told in libzim introduction : https://libzim.readthedocs.io/en/latest/usage.html#introduction
[...]The reading part of the libzim is most of the time thread safe. Searching and creating part are not. You have to serialize access to the class yourself.
I meant it would be nice to have a documented thread safety in python-libzim's README. The python bindings could potentially have their own locking mechanism and behave somewhat differently from libzim, which is why I believe an explicit mention in python-libzim would be good.
How do you do multithreading in python ?
Multithreading always has been poorly supported in python (at least, discouraged). CPython implementation has the GIL (Global Interpreter Lock) which prevent two python threads to run in parallel.
Multithreading always has been poorly supported in python (at least, discouraged). CPython implementation has the GIL (Global Interpreter Lock) which prevent two python threads to run in parallel.
That's a common misconception ;) While CPythons GIL prevents the execution of python code in parallel, there are many situations in which the GIL is relased and multiple threads can be executed at the same time. For example, I/O operations and often C libraries release the GIL, thus many threads can work in parallel - it's only the part of the thread that executes python-level code that can not be parallelised (at least in CPython). I think subprocesses may also release the GIL, but I am not sure.
In my case, I use seperate threads to read files and keep a queue of data to add to the ZIM filled, ensuring optimal I/O usage by reading files even while the creator is busy adding the item. Unfortunately, it seems like I've managed to reach a point at which the python wrapper itself may be the performance bottleneck. At least, that's what I am asuming seeing that only like half of the CPU cores are busy with compression, several thousand strings (For StringProviders) are in RAM waiting to be added, disk write capacity is still available and Creator.add_item is the blocking part... I am currently evaulating whether it may be possible to achieve better performance by rewriting the StringProvider to handle the data feeding on C-level, but it's probably a bit too hard for a first time Cython project.
Increase the number of workers for libzim itself (defaults to 4) if you think that's the bottleneck but you'll most likely hit your disk i/o limits first
@IMayBeABitShy any feedback on your experiments? Is https://github.com/openzim/python-libzim?tab=readme-ov-file#thread-safety enough?
Please reopen if you think it's not.
any feedback on your experiments?
It's been a while, but I managed to improve the performance of the program by utilizing threads. In case anyone is interested:
I've changed the code in my project to utilize one thread for the creator and several I/O threads. The I/O threads read media files and put the binary data (as well as required metadata) into a queue.Queue. This queue is size limited (IIRC I used a max size of 512), thus creating a buffer of preloaded files. The creator thread would take elements from this queue, create the libzim items and add them to the creator.
During these experiments, I've observed that the number of preloaded elements would quickly (less than a second) drop from 512 to as low as 120, taking 1-2 seconds to be filled up again. This means that the creator could consume the items faster than they were being read from disk, albeit with periods of slower consumption of files between those rapid periods. I believe this may have been caused by one or more workers being busy with compressing larger files. Either way, utilizing multiple threads meant that the creator could consume files as fast as possible.
Still, the effectiveness will likely depend on the setup. In this case, I've used two 4TB HDDs, SSDs or NVMEs may experience less or more gains due to their better random I/O performance.
I've also utilized a different queue based approach in my current project (rendering a ZIM from several fanfiction dumps someone shared on reddit) to implement a multicore HTML rendering system by utilizing multiprocessing.
Is https://github.com/openzim/python-libzim?tab=readme-ov-file#thread-safety enough?
I think that should be enough.
Thanks for this useful feedback!