magika Set pebin target_label in content_types

FONNX is an open source Flutter library for running ONNX models cross-platform.

It has tests mirroring the sample files in google/magika.

I added code to parse content_types_config.json into a Dart enum.

Two tests failed after switching to this new model code: mitra/pe32.exe and mitra/pe64.exe.

Investigation revealed that was because there was no target_label set for pebin, so the new model code was using null.

This seems to be an error, since the model does output pebin results.

Feb 22 '24 06:02 jpohhhh

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Feb 22 '24 06:02 google-cla[bot]

Thanks for reporting! Interesting issue. Your patch fixes a part of the json that is not really used by the current codebase / and I was not expecting external clients to use.

Can you elaborate how you are using that json?

To give you more context, that target_label is used by the training pipeline to know which label to predict for a given content type... but we actually don't have a pebin dataset; we have an exe, dll, sys, etc. datasets, that the model is trained to predict as the more generic pebin. That target_label field is not set for pebin because it's a target_label itself, not really a content type we trained for.

But these are likely too many details and it should not be important for external clients. in general I'm of course happy to fix things, but I fear that what you are doing is not what I was planning people to do and there could be more bugs down the road. Getting a bit more context would help!

Feb 22 '24 17:02 reyammer

Good question, it's sort of involved. TL;DR is it's a good source for info for building UIs around Magika, and in my case, for identifying the types with a "text" tag

I make an LLM wrapper app, and it includes the ability for the user to upload files. At least on macOS and iOS, and I imagine on other operating systems, there's no definitive way to tell if a file is plain text and thus compatible with an LLM. All you get is bytes.

For example, when I tried a Dart code file, I found macOS/iOS provide nothing other than the local file URI, and the ability to read the file bytes.

Magika is really handy because now I can open up the file picker to all file types.

To decide if something is text-y, I can rely on whether the Magika type has a text tag.

Originally, I hand-translated supported_content_types_list.md into an enum. But, I realized that didn't have the tags, and there was a significant amount of information useful for building UIs in the config_type.json file.

So I wrote a quick parser for that, with a few statements figuring out bounds on the data, here. It generates the model class file here.

Feb 23 '24 07:02 jpohhhh

Thanks for the context! But so, it seems all you need would be that Magika, when it returns the output, not only tells you the content type, mime type, but also a "is_text" boolean? What about we add a "is_text" field to MagikaOutputFields (https://github.com/google/magika/blob/main/python/magika/types.py#L53)? would this work? Or, if you are using the JS npm, we could add that info in the returned object by the js implementation? (/cc @invernizzi)

In general, I would refrain to putting too much emphasis on the config json, as we consider this for internal use and may (likely!) change in the future. And here it seems what you need is some additional metadata for a given content type -- which we already have!

Feb 23 '24 08:02 reyammer

+1 to what Yanick said. A viable solution to this is to add a is_text field to Magika's output - we have that info, so it's just a matter to expose it. Let us know if this will work for you

Feb 28 '24 11:02 invernizzi

As there are no new info on this, I've opened #294 to track this feature request, and I'm closing this PR. Thanks for bringing this up!

Mar 07 '24 16:03 reyammer

Set pebin target_label in content_types_config.json