puremagic icon indicating copy to clipboard operation
puremagic copied to clipboard

Version 2.0 Goals

Open cdgriffith opened this issue 1 year ago • 6 comments

Now that puremagic is picking up some outside traction, and used in places like MongoDB, want to lay out clear future plans.

  • Stay backwards compatible. Anything changed or added has to be behind a feature flag.
  • #71 Faster. (Is a json file best way to store data? switch to tree lookup instead of loop iteration?)
  • Higher accuracy. Some ideas in #12
  • Even better test coverage. All platforms, all current python versions, both success and failure cases. (Started in #67)
  • Documentation improvements
  • Better sub variation names #69

Please keep comments on this page limited to overall goals, any specific conversations about any goal should be their own issue and will be updated here.

cdgriffith avatar May 12 '24 21:05 cdgriffith

Could #69 be a new feature for 2.0? Compatibility wise the new field would/should not break anything (that I'm aware of).

NebularNerd avatar May 12 '24 22:05 NebularNerd

Hi, found out your project via "Explore repositories" on github.com homepage feed I have kinda similar project https://github.com/CatKasha/yet-another-filetype-checker Idk if it will be helpful (my project is very simple) but hope it will give you some ideas for improvements

CatKasha avatar May 13 '24 21:05 CatKasha

I just found this: https://mark0.net/soft-trid-e.html

Not sure how well it is known but it contains "over 17k file types". The file signatures does not have an explicit data license attached to it, but at the very least it might be useful to compare against

maybe related:

  • https://github.com/MarcoPon/TrID2bt/blob/master/trid2bt.py
  • https://www.sweetscape.com/010editor/repository/templates/

chapmanjacobd avatar May 26 '24 04:05 chapmanjacobd

TrID is one of the oldest filetype sites/software out there. That site has looked near enough the same for decades.

Their database is pretty solid and very extensive. But they cannot generate a confidence or process more complicated searches. For example .SBK Creative Soundfont is only handled as an extension where as we can handle looking at the file in two places to generate a match.

NebularNerd avatar May 26 '24 08:05 NebularNerd

Starting work on adding more advanced scanners. Rough right now, but have detection for unusual PDFs https://github.com/cdgriffith/puremagic/issues/94 and better ZIP type format detection https://github.com/cdgriffith/puremagic/issues/102 (MS Office, Open Office, JAR, APK, etc...)

https://github.com/cdgriffith/puremagic/tree/deep-scan/puremagic/scanners

Before release want to add scanners for:

  • [x] ASCII Text
  • [ ] Encoded text (min UTF-8, UTF-16 and Windows standard)
  • [x] Generic Binary File
  • [X] PDF
  • [X] ZIP, Word, Open Office
  • [x] Python files (and other languages, eventually)

Still need to do:

  • [x] Tests, Tests, Tests
  • [ ] Code simplification
  • [ ] Documentation updates
  • [ ] Support for streams, not just files

Won't be able to work on more myself for at least two weeks, hence this in progress documentation. Biggest help would be testing framework for scanners if anyone wants to contribute to a part of this!

cdgriffith avatar Sep 28 '24 23:09 cdgriffith

Just had a quick skim through the code and this is awesome stuff. The zip method is way better coded that I can manage but I can see it works as I sort of thought it would in my head. If I want to help fill in some of the .zip what's the best way? I'm guessing I need to fork the dev branch?

Looking at the two examples I can see the rough ideas of how to improve some of the more complex formats I've mentioned in my PR's. For example, we could heavily reduce the size of the .json by shoving all the .mp3 related stuff I added into a dedicated scanner. That in itself would likely be smaller than the .json entries data size as we would not need to repeat everything so heavily.

NebularNerd avatar Sep 29 '24 08:09 NebularNerd

Decided that if I try for perfection, it would never get done. 2.0 beta is out now! https://github.com/cdgriffith/puremagic/releases/tag/2.0.0b1

cdgriffith avatar May 04 '25 22:05 cdgriffith

@NebularNerd if you want to work on any scanners like mp3, the can make any PRs against the develop branch 😄

cdgriffith avatar May 05 '25 01:05 cdgriffith

Exciting stuff, I shall have a look and play when I have some free time. 🙂

NebularNerd avatar May 05 '25 10:05 NebularNerd