Find file by content type
Is there a way to find video files that does not contain an extension?
No, that's currently not possible. FSearch doesn't look at the content of a file during scanning or searching, so there's no way to determine the file type from the content.
In the future maybe this could be possible with a certain search function (e.g. contenttype:), but that's a low priority for me at the moment.
That could be done using ffmpeg, right?
Yes, but there are more generic solutions out there, which also work for non-media files. For example the glib library has g_content_type_guess(), which might be exactly what we're looking for.
If this works reasonably well I might actually implement this quite soon, as it's not that much work.
I thought the same, probably it is not that much work. Also, you would be caching the content type together the database structure that currently exists, right? It would only add an extra verification while the files are indexed, I guess.
I just quickly hacked together a demo and it works as expected. For example contenttype:video also detects video files without an extension. However, as I thought it is quite slow. It currently takes about 10 seconds for every 100,000 files. So the user needs to build the queries efficiently, to speed up the process (i.e. narrow down the files before the content type is queried). For example this path:/home/user/downloads contenttype:video will be much much faster than this query contenttype:video path:/home/user/downloads, since the first one will only query the file type for files within the downloads folders, but the second one will first query the file type for all files and then query if they're within the downloads folders. In such simple cases this could probably be fixed with automatically sorting the individual queries by their "weight", but for more complex queries this won't work.
Also, you would be caching the content type together the database structure that currently exists, right? It would only add an extra verification while the files are indexed, I guess.
Yes in theory that's possible, however, it would also significantly slow down the indexing process and it would require additional RAM. So if this is going to be added, it will be made optional and disabled by default.
For now, I think what you did already solve my current issue, for sure.
Anyway, memory and CPU is not an issue for me, currently available 64Gb and 64 threads.
Can you make an release in the PPA, so I can test that?
Unfortunately content scanning, like any other filesystem access, doesn't scale well with multiple threads. In fact it can even be much slower than doing it single threaded.
Hence I also need to implement that the search falls back to single threaded mode when it contains a content-type query, before I'll provide an official build. Maybe I'll find some time this weekend to work on that. I'll keep you updated.
Yes, I know, mechanic hard drives have lower IOPS than SDD and RAM, and then, this is why there exist some softwares that does multiple hierarchy caching.
I tried something like that using the package "preload" and a couple other optimizations in my Ubuntu, but I am not sure this is anything near what I saw available for Windows (AMD, Intel, and some other comapanies offer multiple hierarchy caching system to avoid bottleneck related to mechanic hard drives)
@flickleafy the new contenttype function just landed in master. Within the next 24 hours the PPA for the development builds should will have the updated builds.
To use it you can type contenttype:video or contenttype:image/png. The content type is in the format of mime types (image/png, image/jpeg, ...) and you can also use wildcards or regular expressions for them. So something like this works as well: path:/home/user/downloads regex:contenttype:image$ to detect types like ISO images instead of pictures.
I did a try, and it found files that had no extension as I predicted that would happen (some files was recovered from a damaged drive and lost the extension), but, in the file listing, in the column "Type", it should not tell me the internal encoding of the file? example: MPEG video MPEG-4 video ...
Currently, in the column "Type", it only shows, "unknown"
I have MediaInfo installed in my nautilus, and when I open the file properties, it shows the format properly, as expected.
@flickleafy that's to be expected. The Type column only tries to guess the type from the name. I'll probably add an additional column for the Content Type, which will be more accurate but more resource intensive than the current Type column.
That makes sense.
You could make another configuration with 3 options:
Disabled - does not check the real content in any case. Enabled all - check real content on all files. Enabled for unknown - that would check only for files with no extension.