PyDrive icon indicating copy to clipboard operation
PyDrive copied to clipboard

large file download fails with OverflowError

Open rupertlevene opened this issue 10 years ago • 8 comments

On my 32-bit linux machine, files over 2GB fail to download. Memory usage while my test script runs gets very high, suggesting the entire download is being cached in memory; I think the download should be streamed to disk instead.

To use the script, upload a large file called bigvid.avi to google drive and put client_secrets.json in the working directory.

$ ./test.py 
bigvid.avi
Traceback (most recent call last):
  File "./test.py", line 17, in <module>
    f.GetContentFile('/tmp/bigvid-from-pydrive.avi')
  File "/usr/local/lib/python2.7/dist-packages/pydrive/files.py", line 167, in GetContentFile
    self.FetchContent(mimetype)
  File "/usr/local/lib/python2.7/dist-packages/pydrive/files.py", line 36, in _decorated
    return decoratee(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pydrive/files.py", line 198, in FetchContent
    self.content = io.BytesIO(self._DownloadFromUrl(download_url))
  File "/usr/local/lib/python2.7/dist-packages/pydrive/auth.py", line 54, in _decorated
    return decoratee(self, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pydrive/files.py", line 313, in _DownloadFromUrl
    resp, content = self.auth.service._http.request(url)
  File "/usr/local/lib/python2.7/dist-packages/oauth2client/util.py", line 135, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/oauth2client/client.py", line 547, in new_request
    redirections, connection_type)
  File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1593, in request
    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1335, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1318, in _conn_request
    content = response.read()
  File "/usr/lib/python2.7/httplib.py", line 541, in read
    return self._read_chunked(amt)
  File "/usr/lib/python2.7/httplib.py", line 624, in _read_chunked
    return ''.join(value)
OverflowError: join() result is too long for a Python string
$ cat test.py
#!/usr/bin/env python

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive

gauth=GoogleAuth()
if not gauth.LoadCredentialsFile("auth.txt") :
    gauth.CommandLineAuth()
    gauth.SaveCredentialsFile("auth.txt")

drive=GoogleDrive(gauth)

filelist=drive.ListFile({'q': "title='bigvid.avi'"}).GetList()
for f in filelist:
    print f['title'];
    f.GetContentFile('/tmp/bigvid-from-pydrive.avi')

$ ls -l /tmp/big*
ls: cannot access /tmp/big*: No such file or directory

rupertlevene avatar Feb 11 '15 14:02 rupertlevene

This uses the google-api-python-client under the hood. That is where the bug is. However, I am really sorry about this - that is appalling forethought to dump the entire thing into memory without streaming.

aliafshar avatar Feb 11 '15 17:02 aliafshar

You might want to have a look at https://github.com/googledrive/PyDrive/issues/27

Fjodor42 avatar Feb 03 '16 09:02 Fjodor42

I don't have a 32bit system handy for testing, but could you report whether replacing

filelist=drive.ListFile({'q': "title='bigvid.avi'"}).GetList()
for f in filelist:
    print f['title'];
    f.GetContentFile('/tmp/bigvid-from-pydrive.avi')

with

local_file = io.FileIO('/tmp/bigvid-from-pydrive.avi', mode='wb')
for f in file_list:
    print f['title']
    id = f.metadata.get('id')
    request = drive.auth.service.files().get_media(fileId=id)
    downloader = MediaIoBaseDownload(local_file, request, chunksize=2048*1024)

    done = False

    while done is False:
        status, done = downloader.next_chunk()
local_file.close()

works (you'll probably need an from apiclient.http import MediaIoBaseDownload somewhere)?

Inasmuch as it seems to download a 4Gb file om random data, without any serious memory use, on my machine, I posit the dreaded "works on my machine", but that is a 64bit one.

If it does work, I think I can cook up a way to let PyDrive take a decision to do this for files over a certain size, but I then think that I shall want to open a feature request to solicit responses as to what that limit should be, as well as whether the limit should be the chunk size, then.

Fjodor42 avatar Feb 17 '16 02:02 Fjodor42

Thanks, this works!

(I upped the chunk size by a factor of 10 to save time. Otherwise it was rather slow.)

On 17 February 2016 at 02:38, Fjodor42 [email protected] wrote:

I don't have a 32bit system handy for testing, but could you report whether replacing

filelist=drive.ListFile({'q': "title='bigvid.avi'"}).GetList() for f in filelist: print f['title']; f.GetContentFile('/tmp/bigvid-from-pydrive.avi')

with

`local_file = io.FileIO('/tmp/bigvid-from-pydrive.avi', mode='wb') for f in file_list: print f['title'] id = f.metadata.get('id') request = drive.auth.service.files().get_media(fileId=id) downloader = MediaIoBaseDownload(local_file, request, chunksize=2048*1024)

done = False

while done is False: status, done = downloader.next_chunk()

local_file.close()`

works (you'll probably need an from apiclient.http import MediaIoBaseDownload somewhere)?

Inasmuch as it seems to download a 4Gb file om random data, without any serious memory use, on my machine, I posit the dreaded "works on my machine", but that is a 64bit one.

If it does work, I think I can cook up a way to let PyDrive take a decision to do this for files over a certain size, but I then think that I shall want to open a feature request to solicit responses as to what that limit should be, as well as whether the limit should be the chunk size, then.

— Reply to this email directly or view it on GitHub https://github.com/googledrive/PyDrive/issues/30#issuecomment-184983418.

rupertlevene avatar May 13 '16 16:05 rupertlevene

@rupertlevene This should be resolved now. Post here if you are still encountering this issue.

RNabel avatar Jun 08 '16 03:06 RNabel

Reopening, as @Fjodor42 points out, and #62 references, there is no verification of this being resolved.

RNabel avatar Jun 09 '16 14:06 RNabel

I don't have a 32bit system handy for testing, but could you report whether replacing

filelist=drive.ListFile({'q': "title='bigvid.avi'"}).GetList()
for f in filelist:
    print f['title'];
    f.GetContentFile('/tmp/bigvid-from-pydrive.avi')

with

local_file = io.FileIO('/tmp/bigvid-from-pydrive.avi', mode='wb')
for f in file_list:
    print f['title']
    id = f.metadata.get('id')
    request = drive.auth.service.files().get_media(fileId=id)
    downloader = MediaIoBaseDownload(local_file, request, chunksize=2048*1024)

    done = False

    while done is False:
        status, done = downloader.next_chunk()
local_file.close()

works (you'll probably need an from apiclient.http import MediaIoBaseDownload somewhere)?

Inasmuch as it seems to download a 4Gb file om random data, without any serious memory use, on my machine, I posit the dreaded "works on my machine", but that is a 64bit one.

If it does work, I think I can cook up a way to let PyDrive take a decision to do this for files over a certain size, but I then think that I shall want to open a feature request to solicit responses as to what that limit should be, as well as whether the limit should be the chunk size, then.

Thank you for the solution. Side note:

import io
from googleapiclient.http import MediaIoBaseDownload

smichaud avatar Jun 09 '20 14:06 smichaud

@smichaud btw, GetContentFile has been rewritten (among other fixes and improvements) in the iterative/PyDrive2 - a maintained fork. It uses MediaIoBaseDownload internally and should work out of the box ... here is an example how it used in DVC -

https://github.com/iterative/dvc/blob/b57077af11ae287941b4d2939071fda2ad01f483/dvc/remote/gdrive.py#L376

shcheklein avatar Jun 09 '20 14:06 shcheklein