concepticon-data Handling relations between concept lists in metadata

This is a follow-up issue on #144 with the main question as of how to handle obvious inter-list representations, as in:

Cases where splitting was actively done (I put the information in a "sublist" column):

https://github.com/clld/concepticon-data/blob/master/concepticondata/conceptlists/Yakhontov-1991-100.tsv

and where authors reference other lists (with list-name as column header, not really a nice solution):

https://github.com/clld/concepticon-data/blob/master/concepticondata/conceptlists/Syrjaenen-2013-226.tsv

The goal would be to find a principled way how to reference these problems in meta-data accompanying our concept lists (be it based on tags, on meta-data for a concept list identifier, etc.)

Jun 01 '16 06:06 LinguList

At least for these two cases, it seems as if this kind of metadata is redundant - or even a bit incorrect. E.g. looking at the Yakhontov case:

$ concepticon intersection Yakhontov-1991-100 Yakhontov-1991-65 | wc -l
65
$ concepticon intersection Yakhontov-1991-100 Yakhontov-1991-35 | wc -l
35
$ concepticon union Yakhontov-1991-65 Yakhontov-1991-35 | wc -l
99
$ concepticon intersection Yakhontov-1991-65 Yakhontov-1991-35 | wc -l
1

And I think in the Syrjaenen case all information is already contained in the concepticon mapping:

$ concepticon intersection Syrjaenen-2013-226 Comrie-1977-207 | wc -l
205

Do we want to allow for this information to get out of sync - i.e. do we want to have a track record of where a list maps things differently than we do? I'm leaning towards a no, here. I'd rather see the data getting our of sync as an indication of a bug or error somewhere.

Nov 07 '16 08:11 xrotwang

Puh, unfortunately, Yakhontov seems to be a real bug in the union and intersection code (mislinked item already identified). The question is a theoretical one: Is it possible, that authors will make links that we cannot re-produce with intersection/union. And I think the answer is "yes", as in large collections of largely relational list comparisons, as in Mann-2004-500, the authors could either make an error, or use a broadness of conceptual mapping that we'd not want to cover with union/intersection commands.

We have another kind of "flagging" in the data that is regularly encoded: An asterisk in front of a word or sometimes a cross or whatever character indicates that the word "belongs to Swadesh 100". This is common practice in many lists (also more to come). We have usually just reflected this practice, sometimes replacing italic script by an asterisk, or putting it into an extra column in those cases where it was too many different font-changes. In the SIL-list, for example, I added a "sublist" column to store this info, as they had strange symbols

http://concepticon.clld.org/values/SIL-2002-436-1

Actually, Yakhontov and also Syjaenen are very good test cases for the suitability of our mapping, right? In Yakhontov, I'll just have to check the error.

But this actually shows that its not necessarily consequent how the problem is currently handled: Yakhontov is split into three lists, as the 35 item list is the most-cited, although his 100-item list just had a sublist-column. So in cases like Yakhontov, the inter-links MUST be identical and synced, since it was me who created it, but I could as well leave it, as Yakhontov never published a list like that. In the case of Syrjaenen, the linking is just quoting what the say, and we could as well use their original table headers, like Leipzig-Jakarta, etc., as we just quote a source.

Nov 07 '16 09:11 LinguList

I just realized: the Yakhontov thing is exactly a problem of our union following strictly the linkings between other lists: we have the concept set "ear or hear" being broader than "ear" and "hear", and Yakhontov distinguishes both (as does also Swadesh and all others) in the list of 100 items. Since the union algorithm seeks to find the broadest possible match between two lists, this is the consequent behaviour, I'd say, and no bug in the algorithm. the correct linking will be revealed when using all three lists:

$ concepticon union Yakhontov-1991-35 Yakhontov-1991-65 Yakhontov-1991-100 | wc -l
100

Nov 07 '16 09:11 LinguList

Regarding Syrjaenen: Yes, I think in such a case leaving the data (including field names) as is would make most sense.

Regarding relations between concepts in general, I think an approach similar to the one we have for concept sets should work. Unfortunately we named the file storing relations between concept sets conceptrelations.tsv :) It's still not clear to me what we want to achieve with this. If we only want to store mapping assessments by concept list authors, we could just put the original data (e.g. the label prefixed with an asterisk) into another column of the concept list. Translating these assessments into relations would be either

redundant (in case these relations coincide with our mappings) or
somewhat opaque information - what should I do with a relation which says "author claims concept maps to Swadesh 100 but we don't think so"? I can't see any gain in making the human-readable information from a label "*HAND" machine readable, if the semantics is in conflict with other information we provide.

Nov 07 '16 09:11 xrotwang

Regarding the union algorithm: I think this behaviour should be configurable, i.e. there should be a flag either to opt-in to or opt-out of strict comparison (without following relations). If the semantics of "union" and "intersection" doesn't really match set-theoretic intuitions, we shouldn't use these names.

Nov 07 '16 09:11 xrotwang

Okay, so I guess the cases like Syrjaenen all are sources, and there's no problem with that, as sources contain relations, but we don't need to touch them further, and will just leave original column names.

Relations like the one I introduce in Yakhontov can simply be removed, as they are indeed redundant, same holds for Chen-1996-200 and the sublists, although I'd keep an A/B flag in a sublist column, as this is how we sometimes use those lists: presenting yakhontov and saying: this item is the big list, this is the small list. And "sublist" is a rather regular column name with expectable behavior.

As to the union algorithm: The behaviour is indeed tricky, but it is quite consistent with human practice, as we get exactly what people have been claiming to be the union of Swadesh-1952-200 and Swadesh-1955-100, although this is obviously NOT the case, with "burn" being transitive. vs. intransitive, etc. Having a flag that uses the clean union as defined by concepticon and the broader one including concept relations seems to be the best solution. I don't insist on the name "union" and "intersection", yet it reflects the practice of people in the past, and I consider it as a very nice service, that we can actually handle these things (with all caveats, as it human practice was not necessarily algorithmic and consistent).

Nov 07 '16 10:11 LinguList

Yes, I agree with all this (and maybe I shouldn't have claimed names like "union" and "intersection" exclusively for mathematics :) ). So leaving the current behaviour as the default but providing a --strict flag would be my preferred solution.

Nov 07 '16 10:11 xrotwang

yep, I think this is best. And actually: for the calculation of similarities between concept lists, we use so far the "--strict" behaviour, right?

The human "unions/intersections", by the way, will be a nice service for those who want to create questionnaires for field-work, as I have been already asked by colleagues, especially with the rudimentary component of internationalization, given that we have quite a few lists translated into different languages now.

Nov 07 '16 10:11 LinguList

Ok, but if we use "union" and "intersection" (maybe including options to select target languages) as a way to seed new concept lists, then the output should look like a concept list - or at least should be written to a conceptlist-like file upon request, right? Maybe we should put requirements for these commands into a separate issue.

Nov 07 '16 10:11 xrotwang

yes, I just opened #245 to further discuss this.

Nov 07 '16 11:11 LinguList