multiple users on same system encounter permissions errors
Environment
- OS: [Ubuntu 22.04]
- Hardware (GPU, or instance type): no gpu
To reproduce
Steps to reproduce the behavior:
- Create a StreamingDataset
- Log in as a different user
- Create a StreamingDataset
- Encounter a permissions error on trying to write to
tmp/streaming
A similar issue related to shared memory that only happens if your process crashes:
- Create a StreamingDataset
- Kill the process in an unclean way so that python doesn't have a chance to clean up shared memory (e.g. kill -9)
- Delete
/tmp/streamingto prevent the first issue. - Log in as a different user
- Create a StreamingDataset
- Encounter a permissions error on a shared memory resource (e.g. PermissionError: [Errno 13] Permission denied: '/000000_locals')
You can verify this with the following script:
from streaming import MDSWriter
from streaming import StreamingDataset
import tempfile
### generate a sample dataset...
# directory for a sample dataset
out_root = tempfile.TemporaryDirectory().name
local_root = tempfile.TemporaryDirectory().name
print("out_root: ",out_root)
# A dictionary of input fields to an Encoder/Decoder type
columns = {
'anumber': 'int'
}
# some sample data
samples = [
{
'anumber': 123123
}
for _ in range(2)
]
# write out the sample data
print("writing data...")
with MDSWriter(out=out_root, columns=columns) as out:
for sample in samples:
out.write(sample)
### finished generating the sample dataset
print("creating dataset object...")
remote_dir = out_root
local_dir = local_root
dataset = StreamingDataset(remote=remote_dir, local=local_dir, split=None)
for x in enumerate(dataset):
print(x)
input("wait for input (so you have a chance to kill the process in an unclean way)")
Expected behavior
No errors.
Additional context
StreamingDataset initialization will create a directory tmp/streaming if it does not exist yet, and so the first user will own that directory.
Subsequent users on the same system are now locked out unless the first user manually chmods the directory or system cleans up tmp/streaming.
A similar issue can happen with the SharedMemory objects in dev/shm.
Using the clean_stale_shared_memory function doesn't fix this because it encounters the same permissions error.
Hey! So we looked into this and weren't able to reproduce the first behavior, but we were able to reproduce the second (PermissionError: [Errno 13] Permission denied: '/000000_locals'). The reason this is happening is because we need to access each existing SharedMemory file to check for potential collisions between local directory names for different StreamingDatasets here -- without this, multiple StreamingDatasets could point to the same local directories, messing up the samples.
For this case, we would recommend making sure each user creating StreamingDatasets has the same permissions, or updating user permissions to make sure that /tmp/streaming and the SharedMemory files are accessible. Thanks for identifying the issue and submitting this PR!
Thanks for looking into it!
I'm confused how the first issue didn't replicate - my understanding of the issue was that this os.makedirs call made it so that /tmp/streaming is only writeable by the creating user. Was it perhaps the case for you that /tmp/streaming/ was already present and globally writeable rather than being created for the first time? This would explain the difference since the shared memory is cleaned up when the process exits normally and so would actually have been created by the first process. In my scenario /tmp/streaming did not exist yet.
For my system (an academic cluster), it's not easy to ensure that /tmp/streaming is accessible since I don't have any special privileges. Right now I've told my students to just chmod /tmp/streaming as part of every job they launch in case they are the first to be scheduled on some node, but if any other group starts using streaming then we'll have problems again.
I just doublechecked again and the /tmp/streaming directory was not present when creating the first streaming dataset. Even when creating equally permissioned users, and not killing the process in an unclean way (it's still running when the second streaming dataset is created), I can only reproduce the Permission denied: '/00000_locals' error. And that's due to the local directory checking that's needed before initializing every streaming dataset. If possible, could you send over the stack trace with the error for /tmp/streaming? Would love to take a look.
closing the issue due to inactivity. Please feel free to re-open if you think this is still an issue.
@karan6181 Reopening this issue because I ran into the same issue on my university cluster. There is also an easy solution. Instead of hardcoding /tmp/streaming can we have it respect the TMPDIR env var instead so each user can set a custom location instead?
https://github.com/mosaicml/streaming/pull/570 is merged.
@knighton is it already in release? I still have this problem (PermissionError: [Errno 13] Permission denied: '/000000_locals')
@Oktai15 are you still seeing this with the latest version of streaming?