Add DataURL validator
Closes #347
As discussed in the linked issue, here’s a first stab at the DataURL validator. Locally I added py39 and py310 to
https://github.com/Pylons/colander/blob/048fb24eeb6c3df21831413943dbf89d7b5776e4/tox.ini#L5
and perhaps that should also be added to the Github Actions (see PR https://github.com/Pylons/colander/pull/349)? I’ve not added translations to the locales.
Question: perhaps pass three msg args to the constructor for the three different errors; or raise just one error message for all cases?
Shame on me. @jenstroeger I forgot to ask you to sign CONTRIBUTORS.txt in the other PR, but we can do that in this one.
A rebase on master or merging master into this branch should fix the CI.
This LGTM. I've requested a review from any maintainer so they can cut a release after approval and merge. Thank you!
@mcdonc @ericof @miohtama @mmerickel @tseaver
Thank you @stevepiercy. Any thoughts on the msg parameters for the DataURL class? And should I add the locales, or not bother for now?
Regarding msg, I think when more than one part of the data URL is not valid, then each invalid part should return its error message and that each message should be customizable. Else the user could potentially need to submit their data up to three times, each time returning a different error message.
For locales, see my comment on the linked issue. I am not sure what to use for an unknown language's translation string. I assume that if the translation has an empty string or None, then the default is used, but don't hold me to that. Someone more familiar with the package translationstring would know.
Ok on msg and I’ll adjust the code accordingly, similar to how All handles multiple failures in a single raised exception:
https://github.com/Pylons/colander/blob/a1c525c1f172dc43bd82ba62bcb5521f853a760c/src/colander/init.py#L260-L263
Looks like translationstring wraps around gettext which uses the translation catalogues in locale/. In that case, missing translation strings simply default (in this case to English) but I’ll take a closer look before touching that part…
Actually, I think the current implementation is somewhat cumbersome because both MIME type and Base64 data will be handled twice — first by the validator, and then again by the user who unpacks the URL string.
With that in mind it probably makes more sense to implement DataURL as a Type (maybe sub-classing String) such that
data_url = colander.SchemaNode(
colander.DataURL(),
)
validates the data_url and assigns the data to it. Or we leave data_url alone and the type adds data_url_mimetype and data_url_data to the node as part of its deserialization?
Actually, I think the current implementation is somewhat cumbersome because both MIME type and Base64 data will be handled twice — first by the validator, and then again by the user who unpacks the URL string.
With that in mind it probably makes more sense to implement
DataURLas a Type (maybe sub-classingString) such thatdata_url = colander.SchemaNode( colander.DataURL(), )validates the
data_urland assigns the data to it. Or we leavedata_urlalone and the type addsdata_url_mimetypeanddata_url_datato thenodeas part of its deserialization?
Can you explain why the user needs to unpack the data URL string?
If valid, I would store the data URL as submitted. If not valid, I don't really need to present the part's value that the user entered as an error message, e.g., "Bad MIME type: 'foo/bar'". I would display a single concatenated error message. "Data URL not valid. MIME type not valid. Data not valid."
Can you explain why the user needs to unpack the data URL string?
If the data URL is valid then the user will take that URL and unpack the data from it. If the data URL is something like
...xyz==
then we have two potentially expensive passes over that one URL string:
- the validator
- looks up all MIME types
mime not in mimetypes.types_map.values() - decodes the data to make sure it’s valid Base64
base64.standard_b64decode(data)
- looks up all MIME types
- the user then does the same again to unpack the data from the URL.
If we handle the data URL as a Type rather than using a validator I think we might be able to manage both, validation and data unpacking in one go.
I still don't follow the reason to do what you describe. I understand what you propose, but I do not understand why it is needed the second time. What does the user need to do with unpacked data? Maybe I don't understand what you mean by "unpack the data from the URL". Do you mean break the data URL into its 2 to 4 parts?
- protocol:
data: - optional mediatype:
image/png - optional token:
;base64 - data:
,abcde...xyz==
...and then store each of these parts as key/value pairs, an instance of a type, or in separate columns in a relational database?
I assume that it would be fine to store the whole thing ...xyz== as a string.
Maybe a bit context on how I use Colander. Suppose I define a schema
class SomeRequestSchema(colander.MappingSchema):
data_url = colander.SchemaNode(
colander.String(),
validator=colander.DataURL,
)
then the request.validated (managed by Cornice’s Colander body validator) would contain the data URL string. The validation went through the above steps, i.e. extracted and checked the MIME type and extracted and decoded the Base64 data — and discarded the results.
Now when the handler processes the request, it again goes through these steps in order to get to the data URL’s MIME type and data. That duplication is what I’d like to avoid.
Or am I missing something?
I'm not familiar with Cornice, so I don't know why it would do two passes on the same data. But from what I can tell on this line:
https://github.com/Cornices/cornice/blob/15dfca94b7cc31e285f9290579399d5a57c6e07a/cornice/validators/_colander.py#L120
...it looks like it only does one pass and returns the deserialized data. What that looks like, I don't know, but I assume it is a dict or multidict.
In Deform you can try the demo to see what it does, and look at its source code:
https://deformdemo.pylonsproject.org/textinput/
Enter:
...xyz==
The captured submission returns:
{'text': '...xyz=='}
This is generated by:
https://github.com/Pylons/deformdemo/blob/f124a2bc846c9b66510255065f2e07fb7a87a676/deformdemo/init.py#L105-L106
There is only one pass on validation. The submitted string is returned in a dict, where the schema name is the key and the submitted string is the value.
In Deform you can try the demo to see what it does, and look at its source code:
https://deformdemo.pylonsproject.org/textinput/
Enter:
...xyz==The captured submission returns:
{'text': '...xyz=='}
Yup 🤓 And now I’d like to get the data out of the validated URL:
data_url = validated["text"]
mime, is_b64, data = colander.DataURL.match(data_url).groups() # This is save because the URL is already validated.
if is_b64:
data = base64.standard_b64decode(data) # Decode a second time.
# Use `mime` and `data` from here on…
So I had to match the regex again and then decode the data again which is what I mean above with “That duplication is what I’d like to avoid.”
Yup 🤓 And now I’d like to get the data out of the validated URL:
Yes, but why? Once you get it then decode it to bytes as in your example, what would you do with it?
I mean, the whole point of data URLs is to keep things inline to reduce network requests. Do you intend to use the bytes for something other than inline HTML?
Do you intend to use the bytes for something other than inline HTML?
Oh… this has nothing with HTML! I actually have users who want to send a data URL in a REST request’s JSON body where a URL is required—a way of transferring small files instead of uploading them someplace. So instead of fetching data from the URL, I need to decode the data URL to use the file.
Ah, now I see. That makes sense.
Is there any other type in Colander that returns the parts? I did not see one, so this would be a new feature.
My only concern is that the capture should contain the original submitted value and be easily consumable, similar to regular URLs and emails.
Or we leave
data_urlalone and the type addsdata_url_mimetypeanddata_url_datato the node as part of its deserialization?
I would prefer this way. Maintainers might have different opinions.
Would the data_token be needed as well?
My only concern is that the capture should contain the original submitted value and be easily consumable, similar to regular URLs and emails.
Agreed.
Or we leave
data_urlalone and the type addsdata_url_mimetypeanddata_url_datato the node as part of its deserialization?I would prefer this way. Maintainers might have different opinions.
Let’s wait if one or more responds to this conversation.
Would the
data_tokenbe needed as well?
Which one is the “token” here?
Which one is the “token” here?
From https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs#syntax
[;base64]
Ah, yes that’s needed to know whether the data itself is plain or Base64-encoded. It’s the second group in the DATA_URL_REGEX
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URLs
DATA_URL_REGEX = (
# data: (required)
r'^data:'
# optional mime type
r'([^;]*)?'
# optional base64 identifier
r'(;base64)?'
# actual data follows the comma
r',(.*)$'
)
which matches to the string itself or a (falsey) empty string.