fsearch icon indicating copy to clipboard operation
fsearch copied to clipboard

Find file by content type

Open flickleafy opened this issue 3 years ago • 12 comments

Is there a way to find video files that does not contain an extension?

flickleafy avatar May 11 '22 20:05 flickleafy

No, that's currently not possible. FSearch doesn't look at the content of a file during scanning or searching, so there's no way to determine the file type from the content.

In the future maybe this could be possible with a certain search function (e.g. contenttype:), but that's a low priority for me at the moment.

cboxdoerfer avatar May 12 '22 04:05 cboxdoerfer

That could be done using ffmpeg, right?

flickleafy avatar May 12 '22 15:05 flickleafy

Yes, but there are more generic solutions out there, which also work for non-media files. For example the glib library has g_content_type_guess(), which might be exactly what we're looking for.

If this works reasonably well I might actually implement this quite soon, as it's not that much work.

cboxdoerfer avatar May 12 '22 16:05 cboxdoerfer

I thought the same, probably it is not that much work. Also, you would be caching the content type together the database structure that currently exists, right? It would only add an extra verification while the files are indexed, I guess.

flickleafy avatar May 12 '22 17:05 flickleafy

I just quickly hacked together a demo and it works as expected. For example contenttype:video also detects video files without an extension. However, as I thought it is quite slow. It currently takes about 10 seconds for every 100,000 files. So the user needs to build the queries efficiently, to speed up the process (i.e. narrow down the files before the content type is queried). For example this path:/home/user/downloads contenttype:video will be much much faster than this query contenttype:video path:/home/user/downloads, since the first one will only query the file type for files within the downloads folders, but the second one will first query the file type for all files and then query if they're within the downloads folders. In such simple cases this could probably be fixed with automatically sorting the individual queries by their "weight", but for more complex queries this won't work.

Also, you would be caching the content type together the database structure that currently exists, right? It would only add an extra verification while the files are indexed, I guess.

Yes in theory that's possible, however, it would also significantly slow down the indexing process and it would require additional RAM. So if this is going to be added, it will be made optional and disabled by default.

cboxdoerfer avatar May 12 '22 18:05 cboxdoerfer

For now, I think what you did already solve my current issue, for sure.

Anyway, memory and CPU is not an issue for me, currently available 64Gb and 64 threads.

Can you make an release in the PPA, so I can test that?

flickleafy avatar May 12 '22 19:05 flickleafy

Unfortunately content scanning, like any other filesystem access, doesn't scale well with multiple threads. In fact it can even be much slower than doing it single threaded.

Hence I also need to implement that the search falls back to single threaded mode when it contains a content-type query, before I'll provide an official build. Maybe I'll find some time this weekend to work on that. I'll keep you updated.

cboxdoerfer avatar May 12 '22 20:05 cboxdoerfer

Yes, I know, mechanic hard drives have lower IOPS than SDD and RAM, and then, this is why there exist some softwares that does multiple hierarchy caching.

I tried something like that using the package "preload" and a couple other optimizations in my Ubuntu, but I am not sure this is anything near what I saw available for Windows (AMD, Intel, and some other comapanies offer multiple hierarchy caching system to avoid bottleneck related to mechanic hard drives)

flickleafy avatar May 12 '22 20:05 flickleafy

@flickleafy the new contenttype function just landed in master. Within the next 24 hours the PPA for the development builds should will have the updated builds.

To use it you can type contenttype:video or contenttype:image/png. The content type is in the format of mime types (image/png, image/jpeg, ...) and you can also use wildcards or regular expressions for them. So something like this works as well: path:/home/user/downloads regex:contenttype:image$ to detect types like ISO images instead of pictures.

cboxdoerfer avatar May 15 '22 11:05 cboxdoerfer

I did a try, and it found files that had no extension as I predicted that would happen (some files was recovered from a damaged drive and lost the extension), but, in the file listing, in the column "Type", it should not tell me the internal encoding of the file? example: MPEG video MPEG-4 video ...

Currently, in the column "Type", it only shows, "unknown"

I have MediaInfo installed in my nautilus, and when I open the file properties, it shows the format properly, as expected.

flickleafy avatar May 17 '22 16:05 flickleafy

@flickleafy that's to be expected. The Type column only tries to guess the type from the name. I'll probably add an additional column for the Content Type, which will be more accurate but more resource intensive than the current Type column.

cboxdoerfer avatar May 17 '22 17:05 cboxdoerfer

That makes sense.

You could make another configuration with 3 options:

Disabled - does not check the real content in any case. Enabled all - check real content on all files. Enabled for unknown - that would check only for files with no extension.

flickleafy avatar May 17 '22 22:05 flickleafy