mpifileutils icon indicating copy to clipboard operation
mpifileutils copied to clipboard

How to detect hardlinks?

Open timeu opened this issue 5 years ago • 9 comments

We would like to use dfind in combination with dwalk to find hardlinks. The dfind command line tool does not provide this option as of now. I am wondering how difficult it would be to add this feature? As far as I can tell, dwalk uses stat to retrieve information about a file/directory which is used for some filters (gid, atime, etc). Is the entire stat struct stored in the binary output of dwalk ? (the text output does not contain it)

timeu avatar May 20 '20 14:05 timeu

Unfortunately, we don't record the hard link information from stat in the binary file from dwalk either. In fact, we don't even have an interface for hard links in our mfu_flist, which is a fundamental data structure that all of the tools are built around. That means adding full and proper support would take time and require a new binary file format.

Can you describe in more detail the operation you'd like to do?

adammoody avatar May 20 '20 17:05 adammoody

Thanks for the quick reply. We are dealing with a storage migration where the source filesystem might have symlinks and hardlinks. Dsync will properly snyc the symlinks however the hardlinks are dereferenced on the target filesystem. Coreutils cp has the --preserve=link or -a flag to deal with this. My assumption was that even if mpifileutils cannot deal with hardlinks, we could at least filter out the hardlinks from the source and then copy them over using the regular coreutil cp tool or rsync.

timeu avatar May 20 '20 17:05 timeu

Thinking about this more. It would take a lot of work (time) to add full support for hard links into the library, but one might be able to create a one-off filtering tool relatively quickly. It would require stat'ing each file twice. One could do a full mfu_flist_walk to get an initial list, then each process could stat each of its items within its local list. The process checks the hard link count of each item, and either copies the item into an output list or drops the item. Finally, the resulting filtered list is written out.

adammoody avatar May 21 '20 16:05 adammoody

@adammoody That would be actually a good workaround. So basically one would write a new CLI tool that takes the output of dwalk and runs stat every file again and then outputs the hardlinks or the list without hardlinks. I haven't done any C code for a long long time but I will have a look. Probably makes sense to take the code from dfind as a base and adapt it ?

timeu avatar May 25 '20 10:05 timeu

I can likely hack up something to help get you started. Will do that in a bit.

adammoody avatar May 25 '20 23:05 adammoody

Here is a branch with a one-off tool: https://github.com/hpc/mpifileutils/tree/hardlink

You can grab the source here and see the buildme script for the compile line: https://github.com/hpc/mpifileutils/tree/hardlink/src/dhardlink

This finds all regular files with more than one hardlink count and adds those to a subset list. It prints the summary of hardlinked and non-hardlinked files. You can also have it output the list to a file.

This could be extended in a few ways. For example, there are sort and remap functions that could help if you wanted to list the set of files that all refer to the same inode. Given that, you could do a full copy of a directory (which would make multiple copies of hardlinked files), then send extras to drm to be deleted, and finally write another tool to hardlink those files again.

Feel free to submit PRs to this branch or I can help iterate on it with you.

adammoody avatar May 26 '20 00:05 adammoody

@adammoody WOW thanks a lot ! That's awesome. I would maybe first extend the CLI parameters so that one can output both the list of hardlinks and the other lists at the same time as the code maintains both lists. This way we could for example copy the non hardlinks with dsync and then copy the other ones using coreutils cp with the preserve links option (as a workaround). I will also have a look at the sort and remap functions. I guess they are in the mfu library directory ? Also I found https://github.com/hpc/mpifileutils/blob/master/src/common/mfu_io.c#L233 so I could use that function to re-create the hardlinks ?

timeu avatar May 26 '20 07:05 timeu

@adammoody I wonder if this can move forward? dsync(dcp) copies hardlink files as the regular file is still an big limitation today... it doesn't make "sync" between exact src and dst tree.

sihara avatar Aug 25 '21 07:08 sihara

FYI, note that Lustre has a mechanism to fetch the hard links to a file in O(1). The "trusted.link" xattr contains the parent directory FID and the filename for hard links to a file (subject to space limitations in the xattr, up to 160 links AFAIR).

The llapi_fid2path() function can iterate over all of the links in the "trusted.link" xattr and return the various pathnames to the file.

adilger avatar Aug 25 '21 07:08 adilger