puremagic icon indicating copy to clipboard operation
puremagic copied to clipboard

Some common filetypes are not detected

Open victordomingos opened this issue 7 years ago • 10 comments

Pure magic seems to be failing to detect some very common file types, like text files (.py, .txt, .md).

$ file changelog.txt
changelog.txt: ASCII English text

$ python3.6 -m puremagic ./changelog.txt
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125: 
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package 
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in 
unpredictable behaviour
  warn(RuntimeWarning(msg))
'./changelog.txt' : could not be Identified

$ python3.6 -m puremagic -m ./changelog.txt
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py:125: 
RuntimeWarning: 'puremagic.__main__' found in sys.modules after import of package 
'puremagic', but prior to execution of 'puremagic.__main__'; this may result in 
unpredictable behaviour
  warn(RuntimeWarning(msg))
'./changelog.txt' : could not be Identified

victordomingos avatar May 08 '18 23:05 victordomingos

You are correct, it is not able to detect these as those file types do not have file magic numbers for file detection and require additional analytics for a best guess that I have not written.

For example it does support Python files with their first line formatted as '#!/usr/bin/env python', whereas it would be better to upgrade this module to do some loser matching or some analytics to give more / better results. (Already tried to capture this idea in https://github.com/cdgriffith/puremagic/issues/3 but better spelled out with your example)

I don't have the time currently to work on it, but I at least remember how I thought about implementing I will capture in this issue:

  • Create a subdirectory where 'detectors' live
  • If detectors are enabled (probably by default) will load all files from that directory
  • Each detector has a standard format / entry point that will be called against each file
  • Each detector is for a specific file type and will return its confidence and filetype information

cdgriffith avatar May 08 '18 23:05 cdgriffith