RegEx support of Datasets packages

Open JohnMrziglod opened this issue 8 years ago • 1 comments

[from @gerritholl]

If I understand generate_filename correctly, the typhon.spareice.datasets approach assumes that the filename can be calculated using only the placeholders in the template. This is not the case for most real datasets. For example, many filenames contain orbit numbers or the string of a downlink station. That means it is necessary to include a regular expression. I'm not sure this is possible with the typhon.spareice.datasets approach but if it isn't, that would be a major limitation.

You are right. So far generate_filename only uses temporal placeholders. I thought about implementing user-defined placeholders but I have not had the time to do it. What do you need them for? Do you want to keep the information from the original filenames and create new filenames with it? A kind of filename conversion? Could you give me a more detailed example? How do you use typhon.Datasets for this?

Jan 03 '18 00:01 JohnMrziglod

As a regex example, an example of a HIRS filename is 'NSS.HIRX.NJ.D99127.S0632.E0820.B2241718.WI.gz'. I describe that with the regex r"(L?\d*\.)?NSS.HIR[XS].(?P<satcode>.{2})\.D(?P<year>\d{2})(?P<doy>\d{3})\.S(?P<hour>\d{2})(?P<minute>\d{2})\.E(?P<hour_end>\d{2})(?P<minute_end>\d{2})\.B(?P<B>\d{7})\.(?P<station>.{2})\.gz". Out of those, the parts B and station are present in the filename but not predictable from the starting time. In the case of FCDR_HIRS, I am either reading or writing data and I have both the re approach, and a template based approach:

stored_name = ("FIDUCEO_FCDR_L1C_HIRS{version:d}_{satname:s}_"
               "{year:04d}{month:02d}{day:02d}{hour:02d}{minute:02d}{second:02d}_"
               "{year_end:04d}{month_end:02d}{day_end:02d}{hour_end:02d}{minute_end:02d}{second_end:02d}_"
               "{fcdr_type:s}_v{data_version:s}_fv{format_version:s}.nc")
write_subdir = "{fcdr_type:s}/{satname:s}/{year:04d}/{month:02d}/{day:02d}"
stored_re = (r"FIDUCEO_FCDR_L1C_HIRS(?P<version>[2-4])_"
             r"(?P<satname>.{6})_"
             r"(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})"
             r"(?P<hour>\d{2})(?P<minute>\d{2})(?P<second>\d{2})_"
             r"(?P<year_end>\d{4})(?P<month_end>\d{2})(?P<day_end>\d{2})"
             r"(?P<hour_end>\d{2})(?P<minute_end>\d{2})(?P<second_end>\d{2})_"
             r"(?P<fcdr_type>[a-zA-Z]*)_"
             r"v(?P<data_version>.+)_"
             r"fv(?P<format_version>.+)\.nc")

My file-finder uses the regular expression, but the writing part uses the template. There is a duplication here, ideally one should only need one.

@gerritholl spareice.datasets supports this feature now partly. An user can define regular expressions and use them as placeholders in filenames (currently only in the basename, not in the directory name). Try this example (you need a file named NSS.HIRX.NJ.D99127.S0632.E0820.B2241718.WI.gz):

from typhon.spareice.datasets import Dataset
placeholder = {
    "satcode": "(.{2})",
    "B": "(\d{7})",
    "station": "(.{2})"
}
dataset = Dataset(
    "NSS.HIR[XS].{satcode}.D{year2}{doy}.S{hour}{minute}.E{end_hour}{end_minute}.B{B}.{station}.gz",
    placeholder=placeholder,
)
file_info = dataset.find_file("1999-05-08")
print(file_info)

This prints:

.../NSS.HIRX.NJ.D99127.S0632.E0820.B2241718.WI.gz
  Start: 1999-05-07 06:32:00
  End: 1999-05-07 08:20:00
  Attributes:
    satcode: NJ
    B: 2241718
    station: WI

file_info holds information about the file, you can access the parsed placeholders via file_info.attr. You can use it for generating filenames from other datasets:

other_dataset = Dataset("dummy_file_{year}{month}{day}_{satcode}_B{B}_{station}.dat")
other_dataset.generate_filename("1999-05-08", fill=file.attr)
#  '.../dummy_file_19990508_NJ_B2241718_WI.dat'

Jan 10 '18 01:01 JohnMrziglod