colander Add DataURL validator

Closes #347

As discussed in the linked issue, here’s a first stab at the DataURL validator. Locally I added py39 and py310 to

https://github.com/Pylons/colander/blob/048fb24eeb6c3df21831413943dbf89d7b5776e4/tox.ini#L5

and perhaps that should also be added to the Github Actions (see PR https://github.com/Pylons/colander/pull/349)? I’ve not added translations to the locales.

Question: perhaps pass three msg args to the constructor for the three different errors; or raise just one error message for all cases?

Jul 01 '22 00:07 jenstroeger

Shame on me. @jenstroeger I forgot to ask you to sign CONTRIBUTORS.txt in the other PR, but we can do that in this one.

Jul 03 '22 00:07 stevepiercy

A rebase on master or merging master into this branch should fix the CI.

Jul 03 '22 00:07 stevepiercy

This LGTM. I've requested a review from any maintainer so they can cut a release after approval and merge. Thank you!

@mcdonc @ericof @miohtama @mmerickel @tseaver

Jul 03 '22 00:07 stevepiercy

Thank you @stevepiercy. Any thoughts on the msg parameters for the DataURL class? And should I add the locales, or not bother for now?

Jul 03 '22 00:07 jenstroeger

Regarding msg, I think when more than one part of the data URL is not valid, then each invalid part should return its error message and that each message should be customizable. Else the user could potentially need to submit their data up to three times, each time returning a different error message.

For locales, see my comment on the linked issue. I am not sure what to use for an unknown language's translation string. I assume that if the translation has an empty string or None, then the default is used, but don't hold me to that. Someone more familiar with the package translationstring would know.

Jul 03 '22 01:07 stevepiercy

Ok on msg and I’ll adjust the code accordingly, similar to how All handles multiple failures in a single raised exception:

https://github.com/Pylons/colander/blob/a1c525c1f172dc43bd82ba62bcb5521f853a760c/src/colander/init.py#L260-L263

Looks like translationstring wraps around gettext which uses the translation catalogues in locale/. In that case, missing translation strings simply default (in this case to English) but I’ll take a closer look before touching that part…

Jul 03 '22 01:07 jenstroeger

Actually, I think the current implementation is somewhat cumbersome because both MIME type and Base64 data will be handled twice — first by the validator, and then again by the user who unpacks the URL string.

With that in mind it probably makes more sense to implement DataURL as a Type (maybe sub-classing String) such that

data_url = colander.SchemaNode(
    colander.DataURL(),
)

validates the data_url and assigns the data to it. Or we leave data_url alone and the type adds data_url_mimetype and data_url_data to the node as part of its deserialization?

Jul 03 '22 04:07 jenstroeger

Actually, I think the current implementation is somewhat cumbersome because both MIME type and Base64 data will be handled twice — first by the validator, and then again by the user who unpacks the URL string.

With that in mind it probably makes more sense to implement DataURL as a Type (maybe sub-classing String) such that
data_url = colander.SchemaNode(
    colander.DataURL(),
)
validates the data_url and assigns the data to it. Or we leave data_url alone and the type adds data_url_mimetype and data_url_data to the node as part of its deserialization?

Can you explain why the user needs to unpack the data URL string?

If valid, I would store the data URL as submitted. If not valid, I don't really need to present the part's value that the user entered as an error message, e.g., "Bad MIME type: 'foo/bar'". I would display a single concatenated error message. "Data URL not valid. MIME type not valid. Data not valid."

Jul 03 '22 05:07 stevepiercy

Can you explain why the user needs to unpack the data URL string?

If the data URL is valid then the user will take that URL and unpack the data from it. If the data URL is something like

data:image/png;base64,abcde...xyz==

then we have two potentially expensive passes over that one URL string:

the validator
- looks up all MIME types
```
 mime not in mimetypes.types_map.values()
```
- decodes the data to make sure it’s valid Base64
```
 base64.standard_b64decode(data)
```
the user then does the same again to unpack the data from the URL.

If we handle the data URL as a Type rather than using a validator I think we might be able to manage both, validation and data unpacking in one go.

Jul 03 '22 05:07 jenstroeger

I still don't follow the reason to do what you describe. I understand what you propose, but I do not understand why it is needed the second time. What does the user need to do with unpacked data? Maybe I don't understand what you mean by "unpack the data from the URL". Do you mean break the data URL into its 2 to 4 parts?

protocol: data:
optional mediatype: image/png
optional token: ;base64
data: ,abcde...xyz==

...and then store each of these parts as key/value pairs, an instance of a type, or in separate columns in a relational database?

I assume that it would be fine to store the whole thing data:image/png;base64,abcde...xyz== as a string.

Jul 03 '22 06:07 stevepiercy

Maybe a bit context on how I use Colander. Suppose I define a schema


class SomeRequestSchema(colander.MappingSchema):
    data_url = colander.SchemaNode(
        colander.String(),
        validator=colander.DataURL,
    )

then the request.validated (managed by Cornice’s Colander body validator) would contain the data URL string. The validation went through the above steps, i.e. extracted and checked the MIME type and extracted and decoded the Base64 data — and discarded the results.

Now when the handler processes the request, it again goes through these steps in order to get to the data URL’s MIME type and data. That duplication is what I’d like to avoid.

Or am I missing something?

Jul 03 '22 08:07 jenstroeger

I'm not familiar with Cornice, so I don't know why it would do two passes on the same data. But from what I can tell on this line:

https://github.com/Cornices/cornice/blob/15dfca94b7cc31e285f9290579399d5a57c6e07a/cornice/validators/_colander.py#L120

...it looks like it only does one pass and returns the deserialized data. What that looks like, I don't know, but I assume it is a dict or multidict.

In Deform you can try the demo to see what it does, and look at its source code:

https://deformdemo.pylonsproject.org/textinput/

Enter:

data:image/png;base64,abcde...xyz==

The captured submission returns:

{'text': 'data:image/png;base64,abcde...xyz=='}

This is generated by:

https://github.com/Pylons/deformdemo/blob/f124a2bc846c9b66510255065f2e07fb7a87a676/deformdemo/init.py#L105-L106

There is only one pass on validation. The submitted string is returned in a dict, where the schema name is the key and the submitted string is the value.

Jul 03 '22 09:07 stevepiercy

In Deform you can try the demo to see what it does, and look at its source code:

https://deformdemo.pylonsproject.org/textinput/

Enter:

data:image/png;base64,abcde...xyz==

The captured submission returns:

{'text': 'data:image/png;base64,abcde...xyz=='}

Yup 🤓 And now I’d like to get the data out of the validated URL:

data_url = validated["text"]
mime, is_b64, data = colander.DataURL.match(data_url).groups()  # This is save because the URL is already validated.
if is_b64:
    data = base64.standard_b64decode(data)  # Decode a second time.
# Use `mime` and `data` from here on…

So I had to match the regex again and then decode the data again which is what I mean above with “That duplication is what I’d like to avoid.”

Jul 04 '22 00:07 jenstroeger

Yup 🤓 And now I’d like to get the data out of the validated URL:

Yes, but why? Once you get it then decode it to bytes as in your example, what would you do with it?

I mean, the whole point of data URLs is to keep things inline to reduce network requests. Do you intend to use the bytes for something other than inline HTML?

Jul 04 '22 04:07 stevepiercy

Do you intend to use the bytes for something other than inline HTML?

Oh… this has nothing with HTML! I actually have users who want to send a data URL in a REST request’s JSON body where a URL is required—a way of transferring small files instead of uploading them someplace. So instead of fetching data from the URL, I need to decode the data URL to use the file.

Jul 04 '22 21:07 jenstroeger

Ah, now I see. That makes sense.

Is there any other type in Colander that returns the parts? I did not see one, so this would be a new feature.

My only concern is that the capture should contain the original submitted value and be easily consumable, similar to regular URLs and emails.

Or we leave data_url alone and the type adds data_url_mimetype and data_url_data to the node as part of its deserialization?

I would prefer this way. Maintainers might have different opinions.

Would the data_token be needed as well?

Jul 04 '22 22:07 stevepiercy

My only concern is that the capture should contain the original submitted value and be easily consumable, similar to regular URLs and emails.

Agreed.

Or we leave data_url alone and the type adds data_url_mimetype and data_url_data to the node as part of its deserialization?

I would prefer this way. Maintainers might have different opinions.

Let’s wait if one or more responds to this conversation.

Would the data_token be needed as well?

Which one is the “token” here?

Jul 05 '22 00:07 jenstroeger

Which one is the “token” here?

From https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs#syntax

[;base64]

Jul 05 '22 01:07 stevepiercy

Ah, yes that’s needed to know whether the data itself is plain or Base64-encoded. It’s the second group in the DATA_URL_REGEX

# https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs
DATA_URL_REGEX = (
    # data: (required)
    r'^data:'
    # optional mime type
    r'([^;]*)?'
    # optional base64 identifier
    r'(;base64)?'
    # actual data follows the comma
    r',(.*)$'
)

which matches to the string itself or a (falsey) empty string.

Jul 06 '22 21:07 jenstroeger