Version 2.0 Goals
Now that puremagic is picking up some outside traction, and used in places like MongoDB, want to lay out clear future plans.
- Stay backwards compatible. Anything changed or added has to be behind a feature flag.
- #71 Faster. (Is a json file best way to store data? switch to tree lookup instead of loop iteration?)
- Higher accuracy. Some ideas in #12
- Even better test coverage. All platforms, all current python versions, both success and failure cases. (Started in #67)
- Documentation improvements
- Better sub variation names #69
Please keep comments on this page limited to overall goals, any specific conversations about any goal should be their own issue and will be updated here.
Could #69 be a new feature for 2.0? Compatibility wise the new field would/should not break anything (that I'm aware of).
Hi, found out your project via "Explore repositories" on github.com homepage feed I have kinda similar project https://github.com/CatKasha/yet-another-filetype-checker Idk if it will be helpful (my project is very simple) but hope it will give you some ideas for improvements
I just found this: https://mark0.net/soft-trid-e.html
Not sure how well it is known but it contains "over 17k file types". The file signatures does not have an explicit data license attached to it, but at the very least it might be useful to compare against
maybe related:
- https://github.com/MarcoPon/TrID2bt/blob/master/trid2bt.py
- https://www.sweetscape.com/010editor/repository/templates/
TrID is one of the oldest filetype sites/software out there. That site has looked near enough the same for decades.
Their database is pretty solid and very extensive. But they cannot generate a confidence or process more complicated searches. For example .SBK Creative Soundfont is only handled as an extension where as we can handle looking at the file in two places to generate a match.
Starting work on adding more advanced scanners. Rough right now, but have detection for unusual PDFs https://github.com/cdgriffith/puremagic/issues/94 and better ZIP type format detection https://github.com/cdgriffith/puremagic/issues/102 (MS Office, Open Office, JAR, APK, etc...)
https://github.com/cdgriffith/puremagic/tree/deep-scan/puremagic/scanners
Before release want to add scanners for:
- [x] ASCII Text
- [ ] Encoded text (min UTF-8, UTF-16 and Windows standard)
- [x] Generic Binary File
- [X] PDF
- [X] ZIP, Word, Open Office
- [x] Python files (and other languages, eventually)
Still need to do:
- [x] Tests, Tests, Tests
- [ ] Code simplification
- [ ] Documentation updates
- [ ] Support for streams, not just files
Won't be able to work on more myself for at least two weeks, hence this in progress documentation. Biggest help would be testing framework for scanners if anyone wants to contribute to a part of this!
Just had a quick skim through the code and this is awesome stuff. The zip method is way better coded that I can manage but I can see it works as I sort of thought it would in my head. If I want to help fill in some of the .zip what's the best way? I'm guessing I need to fork the dev branch?
Looking at the two examples I can see the rough ideas of how to improve some of the more complex formats I've mentioned in my PR's. For example, we could heavily reduce the size of the .json by shoving all the .mp3 related stuff I added into a dedicated scanner. That in itself would likely be smaller than the .json entries data size as we would not need to repeat everything so heavily.
Decided that if I try for perfection, it would never get done. 2.0 beta is out now! https://github.com/cdgriffith/puremagic/releases/tag/2.0.0b1
@NebularNerd if you want to work on any scanners like mp3, the can make any PRs against the develop branch 😄
Exciting stuff, I shall have a look and play when I have some free time. 🙂