specs icon indicating copy to clipboard operation
specs copied to clipboard

Resource `name` should no longer be required (and need not be slug-friendly)

Open rufuspollock opened this issue 5 years ago • 9 comments

The one thing that really matters is existence of path or data. Everything else should be optional 😄

See http://tech.datopian.com/notebook/#package-json-and-npm-were-a-mistake - this makes a strong case that all the "meta" stuff in package.json was a mistake. All you need to identify a resource is a url or relative path! And that's truth for us with datapackage.json.

Original logic for adding name as required was that it gave you an identifier for a resource (and that it could be used in urls). However:

  • Why can't we use path as an identifier?
  • It's one more thing for publishers to think about and generate
  • name is sharply constrained to be slug friendly which means work must be done to process something
  • in e.g. CKAN name on Resource need not be unique amongst resources so work has to be done ...

rufuspollock avatar Jun 12 '20 05:06 rufuspollock

See http://tech.datopian.com/notebook/#package-json-and-npm-were-a-mistake - this makes a strong case that all the "meta" stuff in package.json was a mistake. All you need to identify a resource is a url or relative path! And that's truth for us with datapackage.json.

Just to mention another opinion - his current ideas are highly controversial and have to prove itself while Node-way has already done :smiley:

I think, resource names, at least, don't need to be a slug as field names don't

roll avatar Jun 12 '20 06:06 roll

It is not based on his credibility or not: more those arguments were very persuasive to me. Reducing "meta" noise here is useful.

Aside from that any objections to a) making this optional b) making it non-sluggy

rufuspollock avatar Jun 12 '20 11:06 rufuspollock

b) making it non-sluggy

I think I don't have any

a) making this optional

I would say it's really big. I haven't analyzed it yet regarding our libs but off the top of my head, resource names are in the core of the whole dataflows framework - https://github.com/datahq/dataflows/blob/master/PROCESSORS.md (cc @akariv)

roll avatar Jun 12 '20 12:06 roll

Resource name has been the chosen method by the fd specs for referencing a resource in a data package.

One example is the foreign key specification: https://specs.frictionlessdata.io/table-schema/#foreign-keys But I'm sure there are other uses.

I'm pretty sure that if we suddenly make this property non-mandatory or non-sluggish some implementations that assume these properties will break (and there's no reason for these implementation not to assume that as it's in the spec).

One way to overcome this is to allow it to be optional, and recommend for implementations to auto-generate a name in case it's missing (e.g. by using the path or some other ordinal).

Anyway, while on the subject I would make any of these properties optional before touching the resource name:

  • Package resources - allow a package without resources
  • Resource path/data - allow a resource without data (or data that cannot be described with a URI)

On Fri, Jun 12, 2020 at 3:13 PM roll [email protected] wrote:

b) making it non-sluggy

I think I don't have any

a) making this optional

I would say it's really big. I haven't analyzed it yet regarding our libs but off the top of my head, resource names are in the core of the whole dataflows framework - https://github.com/datahq/dataflows/blob/master/PROCESSORS.md (cc @akariv https://github.com/akariv)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/frictionlessdata/specs/issues/685#issuecomment-643238308, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAY5MVFRVLSY537W6CADLRWILWVANCNFSM4N4AFPOQ .

akariv avatar Jun 12 '20 18:06 akariv

Adding to what has already been said, I can understand the consideration of whether making the name property optional is a good idea. I guess this is a typical scenario where we are trying to balance the data packager role (‘packager’) with the data package consumer role (‘consumer’). From a packager's perspective, the advantage appears to be that it's one less property to worry about. The impact from the consumer’s perspective is that it increases the complexity of referencing or selectively processing any given resource.

Without the name property being mandatory, there would be no consistent way of referencing or selecting a resource irrespective of its ordinal position or path. Two examples of when this is a significant issue are when the path always points to the latest snapshot of a resource, or when the location of a resource is changed but the name and content remains unaltered. The mandatory name property allows the consumer to make assumptions about how to iterate over the list of resources and only get the data for a specific resource or a set of resources. If only the path were mandatory, the consumer would be forced to use some kind of pattern matching on the path, and that would only work consistently across all data packages if the specs required the path for each resource to follow a consistent naming convention across all versions (i.e. over time) of the data package. In other words, making the name property optional would break an existing, and likely quite common, processing pattern.

The fundamental issue is this: the name property acts as a logical identifier for a resource, whereas the path is an identifier for an instance of the logical resource. Although they are similar and there is a one-to-one relationship between these in a single version of a data package, this mapping can become a one-to-many across different versions of the same data package.

With regard to the name being a slug, I think having a consistent naming convention is a good idea; it's one less decision that the packager has to make and the consumer knows upfront how they can process this, if needed. Whether the naming convention is slugs, underscores, or some other convention is less significant, but since slugs have already been specified, is it really a good idea to change this now? The impact could be an unquantifiable number of broken implementations that expect this to be slugged. Do the benefits of relaxing this requirement outweigh the unquantified risk to an unknown number of consumers?

michaelamadi avatar Jun 21 '20 07:06 michaelamadi

I think the name generation should be handled by software as it is already (humans side) by for the interoperability standard it's important to present (machines side)

roll avatar Jun 21 '20 07:06 roll

@michaelamadi I guess my point here was that you could already have path be unique - the annoyance i see here is i'm already adding a file/url by path (which is a unique) so why do i then need to generate some other unique identifier for it.

@roll yes software could do this but it's another thing for the software to do (and this is a very really issue i've encountered recently with ckan => f11s conversion where name is not unique on ckan side ...)

rufuspollock avatar Jun 22 '20 10:06 rufuspollock

@michaelamadi I guess my point here was that you could already have path be unique - the annoyance i see here is i'm already adding a file/url by path (which is a unique) so why do i then need to generate some other unique identifier for it.

@rufuspollock That makes perfect sense and generating the name is an additional step that ideally wouldn't be required. I agree with @akariv that making the name property optional but recommended is a reasonable approach.

My primary concern is that when a resource has a path and no name, all external references to the resource become inherently fragile. Something as simple as a file rename or directory move will break a reference by path, and a shift or shuffle in the resources array can break a reference by ordinal position.

Do the benefits of making the most durable resource identifier (i.e. name) optional outweigh the referencing/lookup issues it could introduce?

michaelamadi avatar Jun 22 '20 13:06 michaelamadi

As someone just writing a data package generator, I'd be very grateful if I didn't have to worry about names.

You see, my data packages can contain lots of files the names of which I cannot control. It's totally possible that I will have to deal with

  • data/1.fits
  • Data/1.fits
  • data/32+29.fits
  • data/32-29.fits

all in one package. Sure, it's not terribly hard to make package-unique names for them (I will probably just enumerate all such files and call them data-1 through data-n) – but it's still a minor annoyance, and I cannot see how anyone would profit from such a thing.

msdemlei avatar Sep 01 '20 13:09 msdemlei

As someone who's maintaining a data package reader (and writer), I would prefer resource name to be kept mandatory. It is a very useful identifier to refer to a resource. Deriving one from path or data when reading can be complex (because you can have path or data, path can be an array and it doesn't seem straightforward to generate a name from data). This will result in implementations using different mechanism to refer to a resource, i.e. different names. And the names will likely not be meaningful to the user.

I am fine with relaxing the slug-type nature of name and would advise the same as for field names in Table Schema.

peterdesmet avatar Jun 27 '23 13:06 peterdesmet

After quite a long and comprehensive discussion the Data Package Working Group decided to keep resource.name required and unique but agreed to remove sluggishness requirements:

  • https://github.com/frictionlessdata/datapackage/pull/21
  • https://github.com/frictionlessdata/datapackage/pull/27

roll avatar Feb 20 '24 10:02 roll