`find` method returns files which can raise `FileNotFound` when attempting to open due to invalid cache
It looks like find does not update the dircache, which means that new files since a previous listing will be returned by find, but then throw FileNotFound if you try to read them
You can reproduce by running two separate python processes and running the commands in a specific order (using BUCKET from environment variable, assuming aws access is set up):
1.
Process 1:
import s3fs
import os
fs = s3fs.S3FileSystem()
bucket = os.environ['BUCKET']
with fs.open(f's3://{bucket}/my_new_folder/file1.txt', 'w') as f: f.write('good')
Process 2:
import s3fs
import os
fs = s3fs.S3FileSystem()
bucket = os.environ['BUCKET']
files = fs.find(f's3://{bucket}/my_new_folder')
print(files) # expect to see file1.txt
with fs.open(f's3://{bucket}/my_new_folder/file1.txt') as f: print(f.read())
Process 1:
with fs.open(f's3://{bucket}/my_new_folder/file2.txt', 'w') as f: f.write('bad')
Process 2:
files = fs.find(f's3://{bucket}/my_new_folder')
print(files) # expect to see file1.txt and file2.txt
with fs.open(f's3://{bucket}/my_new_folder/file2.txt') as f: print(f.read()) # this will throw FileNotFound
Example Stacktrace:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/kosborne/.pyenv/versions/3.9.6/lib/python3.9/site-packages/fsspec/spec.py", line 1034, in open
f = self._open(
File "/Users/kosborne/.pyenv/versions/3.9.6/lib/python3.9/site-packages/s3fs/core.py", line 605, in _open
return S3File(
File "/Users/kosborne/.pyenv/versions/3.9.6/lib/python3.9/site-packages/s3fs/core.py", line 1919, in __init__
super().__init__(
File "/Users/kosborne/.pyenv/versions/3.9.6/lib/python3.9/site-packages/fsspec/spec.py", line 1382, in __init__
self.size = self.details["size"]
File "/Users/kosborne/.pyenv/versions/3.9.6/lib/python3.9/site-packages/fsspec/spec.py", line 1395, in details
self._details = self.fs.info(self.path)
File "/Users/kosborne/.pyenv/versions/3.9.6/lib/python3.9/site-packages/fsspec/asyn.py", line 111, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/Users/kosborne/.pyenv/versions/3.9.6/lib/python3.9/site-packages/fsspec/asyn.py", line 96, in sync
raise return_result
File "/Users/kosborne/.pyenv/versions/3.9.6/lib/python3.9/site-packages/fsspec/asyn.py", line 53, in _runner
result[0] = await coro
File "/Users/kosborne/.pyenv/versions/3.9.6/lib/python3.9/site-packages/s3fs/core.py", line 1132, in _info
if not refresh and self._ls_from_cache(path) is not None:
File "/Users/kosborne/.pyenv/versions/3.9.6/lib/python3.9/site-packages/fsspec/spec.py", line 359, in _ls_from_cache
raise FileNotFoundError(path)
FileNotFoundError: ascend-io-dev-kyle/my_new_folder/file2.txt
Thank you for the report. There was a plan to have find() populate the directory cache, and I cannot remember why it doesn't. In the meantime, of course you can clear the normal file cache to make sure that this doesn't happen to you. Probably that's not a satisfactory workaround.
Can you please check whether this is still an issue?
I upgraded to s3fs=2022.11.0, and I still reproduce the issue.
Here is as script that can be run independently to reproduce the issue (writes files to folder-to-repro-s3fs-inc-657 in your bucket)
usage
$ BUCKET=<your_aws_bucket> python3 <this_file>
import os
import s3fs
import subprocess
import tempfile
if 'BUCKET' not in os.environ:
raise Exception('please set the "BUCKET" env var to your AWS Bucket')
prefix = f's3://{os.environ["BUCKET"]}/my_new_folder'
fs = s3fs.S3FileSystem()
if fs.isdir(prefix):
fs.rm(prefix, recursive=True)
def write_file_in_subprocess(name):
with tempfile.NamedTemporaryFile() as f:
f.write(f"""
import s3fs
fs = s3fs.S3FileSystem()
with fs.open('{prefix}/{name}', 'w') as f: f.write('data')
""".encode('utf-8'))
f.flush()
subprocess.run(args=['python3', f.name])
write_file_in_subprocess('file1.txt')
print(fs.find(prefix))
with fs.open(f'{prefix}/file1.txt') as f: print(f.read())
write_file_in_subprocess('file2.txt')
fs = s3fs.S3FileSystem()
print(fs.find(prefix))
# workaround:
# fs.ls(prefix, refresh=True)
with fs.open(f'{prefix}/file2.txt') as f: print(f.read())
We are also running into the same issue.
Directories are only returned the first time find(..., withdirs=True) is called. This can lead other libraries (e.g. pyarrow) to not create directories before trying to download files, resulting in FileNotFound errors.
Here is a simple reproduction script (I'm using moto_server here):
import s3fs
fs = s3fs.S3FileSystem(
client_kwargs={
"endpoint_url": "http://127.0.0.1:5000",
"aws_access_key_id": "testing",
"aws_secret_access_key": "testing",
}
)
fs.mkdir("bucket/test")
fs.mkdir("bucket/test/sub")
fs.write_text("bucket/test/file.txt", "some_text")
fs.write_text("bucket/test/sub/file.txt", "some_text")
print(fs.find("bucket", withdirs=True))
print(fs.find("bucket", withdirs=True))
Output:
['bucket/test', 'bucket/test/file.txt', 'bucket/test/sub', 'bucket/test/sub/file.txt']
['bucket/test/file.txt', 'bucket/test/sub/file.txt']
The workaround is to pass use_listings_cache=False to S3Filesystem (or PyArrows storage_kwargs).
@krfricke , what version do you have? I have a feeling this is fixed on main, perhaps not yet released. Prepared to be shown otherwise :)
This was on latest release, but I've just installed from master (pip install -U git+https://github.com/fsspec/s3fs.git) and the error still comes up for me
OK, thanks, I'll look into it.