parse_multipart_form_data for stream_request_body
Hi!
We have a good feature for streaming HTTP body (decorator stream_request_body). Using stream_request_body we get chunks of HTTP body in data_received method. But this data is raw. And It will be cool to have a support for higher level tool to work with streaming body (in case of multipart_form_data).
If it is not solved yet I can do this, but I'm not sure about architecture for solution.
I want to implement smth like StreamParser, which will get boundary and infinite generator which will return file-like objects. This parser we can instantiate in prepare method and then simply pass chunks in data_received method to it. Parser will write to these file-like objects and efficiently use RAM. But this solution has drawbacks. At least, we don't have universal solution in case if we want to handle data with some logic. Also we can't return any meta info from these file-like objects. But there can be important information such as ObjectId in case of using for example mongodb gridfs.
Another solution is to implement a subclass of RequestHandler with stream_request_body decorator and overridden prepare and data_received methods. We can add some extra abstract methods such as open_file, close_file, new_data and pass to them info about current state and chunks which we can write to file and do whatever we want without thinking about boundaries, headers.
What do you think about it?
This seems like a useful idea; I believe you can implement it with Tornado's public API, so you could publish it as a separate package that provides a RequestHandler subclass for those users who need it. But perhaps it's close enough to Tornado's core feature set that it should be part of Tornado itself.
Ben, curious what you think?
I think this belongs in Tornado itself, although maybe it should first be introduced as a separate package so we can be sure we have the API right. Whatever we come up with should be extensible to handle other kinds of parsing too, so it can't just be a subclass of RequestHandler (although we could provide one for convenience if that's the only parser you need).
My thought on the API is to define a StreamingFormDataParser class, whose constructor takes a StreamingFormDataParserDelegate (and maybe a request object or a headers dict). The delegate interface defines methods like start_file, file_data_received, finish_file. The parser has a data_received and finish method, which parses a chunk of data and calls methods on the delegate. You could use this on its own, but the most common use case would be to create a parser in prepare using self as the delegate, and forward data_received and finish calls to the parser. Or there could be pre-made delegates that could store the files directly in s3/gridfs/etc.
We ran into the same problem recently, so I ended up writing a small module for this purpose. I started with the API interface suggested in Ben's comment, but then ended up changing it slightly as I made progress. Although if I think about it, it's really not all that different.
Here is the source, and it's pip-installable under the name streaming-form-data. It uses Cython (and the generated .c file is included), so it'll end up building a C extension on installation.
So far, it seems to work fine. To be extra sure about correctness, I also imported the test cases for tornado.httputil.parse_mutlipart_form_data, in addition to writing some tests of my own.
Anyway, hope this is helpful, and I'm curious about thoughts!
Really need this feature too, Is any appropriate way to process file upload stream now?
@sizeoftank I'll publish my library soon.
@sizeoftank https://github.com/Lookyan/tornado-stream-request
Is it still possible to create the streaming multipart/form-data as a part of tornado itself? Or has tornado deferred to the posted third-party library above? Lookyan's library works great and is simple enough, but maybe it is mature enough to add to the tornado mainline (for the purposes of support)?
I can also create a PR and so forth if needed.
I still think it makes sense to try and get something for this into Tornado directly. We have two implementations mentioned in this thread, one by @Lookyan and one by @siddhantgoel. @Lookyan 's implementation appears to match the interface I described in my last comment and looks nice and simple, although it's missing a license declaration so we'd need to make sure that we have permission to reuse that code. @siddhantgoel 's implementation is a bit more complicated, but perhaps in good ways (it recognizes that not everything in a multipart/form-data is a file; some of it is just regular form fields). However, it uses Cython and I don't think we want to accept that as a part of the Tornado build process.
So if you'd like to help out, I think the best thing to do would be to first confirm with @Lookyan that they license this code so that it can be used in Tornado, then prepare a PR.
I actually had my own implementation here: https://github.com/eulersIDcrisis/torncoder/blob/main/torncoder/file_util/_parser.py
This implementation here is my own, and I wrote a "thorough" test case that sliced some input in every way (to test all of the possible corner cases). I can clean this up and create a PR here. I'd be willing to license it however necessary.
That link is broken for me, but yes, if you've got something that is your own work feel free to use it instead of one of the other implementations linked in this thread.
Sorry, I guess my repo there is private; I use it with various things when dealing with tornado. I guess I could make it public. Anyway, I've started a PR here: https://github.com/tornadoweb/tornado/pull/3228
I know I need to clean up the test cases, run linting, etc. I also think there is some (really small) optimization/check to be had by limiting the size of the internal buffer in a few more cases, but it is a start.