large file download fails with OverflowError
On my 32-bit linux machine, files over 2GB fail to download. Memory usage while my test script runs gets very high, suggesting the entire download is being cached in memory; I think the download should be streamed to disk instead.
To use the script, upload a large file called bigvid.avi to google drive and put client_secrets.json in the working directory.
$ ./test.py
bigvid.avi
Traceback (most recent call last):
File "./test.py", line 17, in <module>
f.GetContentFile('/tmp/bigvid-from-pydrive.avi')
File "/usr/local/lib/python2.7/dist-packages/pydrive/files.py", line 167, in GetContentFile
self.FetchContent(mimetype)
File "/usr/local/lib/python2.7/dist-packages/pydrive/files.py", line 36, in _decorated
return decoratee(self, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pydrive/files.py", line 198, in FetchContent
self.content = io.BytesIO(self._DownloadFromUrl(download_url))
File "/usr/local/lib/python2.7/dist-packages/pydrive/auth.py", line 54, in _decorated
return decoratee(self, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pydrive/files.py", line 313, in _DownloadFromUrl
resp, content = self.auth.service._http.request(url)
File "/usr/local/lib/python2.7/dist-packages/oauth2client/util.py", line 135, in positional_wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 547, in new_request
redirections, connection_type)
File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1593, in request
(response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1335, in _request
(response, content) = self._conn_request(conn, request_uri, method, body, headers)
File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1318, in _conn_request
content = response.read()
File "/usr/lib/python2.7/httplib.py", line 541, in read
return self._read_chunked(amt)
File "/usr/lib/python2.7/httplib.py", line 624, in _read_chunked
return ''.join(value)
OverflowError: join() result is too long for a Python string
$ cat test.py
#!/usr/bin/env python
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
gauth=GoogleAuth()
if not gauth.LoadCredentialsFile("auth.txt") :
gauth.CommandLineAuth()
gauth.SaveCredentialsFile("auth.txt")
drive=GoogleDrive(gauth)
filelist=drive.ListFile({'q': "title='bigvid.avi'"}).GetList()
for f in filelist:
print f['title'];
f.GetContentFile('/tmp/bigvid-from-pydrive.avi')
$ ls -l /tmp/big*
ls: cannot access /tmp/big*: No such file or directory
This uses the google-api-python-client under the hood. That is where the bug is. However, I am really sorry about this - that is appalling forethought to dump the entire thing into memory without streaming.
You might want to have a look at https://github.com/googledrive/PyDrive/issues/27
I don't have a 32bit system handy for testing, but could you report whether replacing
filelist=drive.ListFile({'q': "title='bigvid.avi'"}).GetList()
for f in filelist:
print f['title'];
f.GetContentFile('/tmp/bigvid-from-pydrive.avi')
with
local_file = io.FileIO('/tmp/bigvid-from-pydrive.avi', mode='wb')
for f in file_list:
print f['title']
id = f.metadata.get('id')
request = drive.auth.service.files().get_media(fileId=id)
downloader = MediaIoBaseDownload(local_file, request, chunksize=2048*1024)
done = False
while done is False:
status, done = downloader.next_chunk()
local_file.close()
works (you'll probably need an from apiclient.http import MediaIoBaseDownload somewhere)?
Inasmuch as it seems to download a 4Gb file om random data, without any serious memory use, on my machine, I posit the dreaded "works on my machine", but that is a 64bit one.
If it does work, I think I can cook up a way to let PyDrive take a decision to do this for files over a certain size, but I then think that I shall want to open a feature request to solicit responses as to what that limit should be, as well as whether the limit should be the chunk size, then.
Thanks, this works!
(I upped the chunk size by a factor of 10 to save time. Otherwise it was rather slow.)
On 17 February 2016 at 02:38, Fjodor42 [email protected] wrote:
I don't have a 32bit system handy for testing, but could you report whether replacing
filelist=drive.ListFile({'q': "title='bigvid.avi'"}).GetList() for f in filelist: print f['title']; f.GetContentFile('/tmp/bigvid-from-pydrive.avi')
with
`local_file = io.FileIO('/tmp/bigvid-from-pydrive.avi', mode='wb') for f in file_list: print f['title'] id = f.metadata.get('id') request = drive.auth.service.files().get_media(fileId=id) downloader = MediaIoBaseDownload(local_file, request, chunksize=2048*1024)
done = False
while done is False: status, done = downloader.next_chunk()
local_file.close()`
works (you'll probably need an from apiclient.http import MediaIoBaseDownload somewhere)?
Inasmuch as it seems to download a 4Gb file om random data, without any serious memory use, on my machine, I posit the dreaded "works on my machine", but that is a 64bit one.
If it does work, I think I can cook up a way to let PyDrive take a decision to do this for files over a certain size, but I then think that I shall want to open a feature request to solicit responses as to what that limit should be, as well as whether the limit should be the chunk size, then.
— Reply to this email directly or view it on GitHub https://github.com/googledrive/PyDrive/issues/30#issuecomment-184983418.
@rupertlevene This should be resolved now. Post here if you are still encountering this issue.
Reopening, as @Fjodor42 points out, and #62 references, there is no verification of this being resolved.
I don't have a 32bit system handy for testing, but could you report whether replacing
filelist=drive.ListFile({'q': "title='bigvid.avi'"}).GetList() for f in filelist: print f['title']; f.GetContentFile('/tmp/bigvid-from-pydrive.avi')with
local_file = io.FileIO('/tmp/bigvid-from-pydrive.avi', mode='wb') for f in file_list: print f['title'] id = f.metadata.get('id') request = drive.auth.service.files().get_media(fileId=id) downloader = MediaIoBaseDownload(local_file, request, chunksize=2048*1024) done = False while done is False: status, done = downloader.next_chunk() local_file.close()works (you'll probably need an from apiclient.http import MediaIoBaseDownload somewhere)?
Inasmuch as it seems to download a 4Gb file om random data, without any serious memory use, on my machine, I posit the dreaded "works on my machine", but that is a 64bit one.
If it does work, I think I can cook up a way to let PyDrive take a decision to do this for files over a certain size, but I then think that I shall want to open a feature request to solicit responses as to what that limit should be, as well as whether the limit should be the chunk size, then.
Thank you for the solution. Side note:
import io
from googleapiclient.http import MediaIoBaseDownload
@smichaud btw, GetContentFile has been rewritten (among other fixes and improvements) in the iterative/PyDrive2 - a maintained fork. It uses MediaIoBaseDownload internally and should work out of the box ... here is an example how it used in DVC -
https://github.com/iterative/dvc/blob/b57077af11ae287941b4d2939071fda2ad01f483/dvc/remote/gdrive.py#L376