Add ability to have synonyms for author names and keywords
People seem to be very fond of adding free form keywords and authors when they upload validphys reports. This makes actually searching for something more difficult than it should. Some things tried for this that didn't work so well are:
-
Chasing people around to write the correct metadata. This has high costs and limited effectiveness.
-
Replace keywords manually with a one off script modifying the required meta.yaml files in the uplaoaded report. This has potential for mistake, even with done with a script and has side effects such as modifying the date of the index unless that is also taken care of. It is also rather annoying to do.
So we need a better solution. One possibility is to allow for defining synonyms somewhere, so that even if people decide to get creative with their names or keywords, they appear indexed uniformly. This can be done by adding to:
https://github.com/NNPDF/nnpdf/blob/master/validphys2/serverscripts/index-reports.py
the ability to read a yaml file with something like:
author_synonyms:
- "Emmanuele Nocera": [ERN, E_Nocera, "Emanuele R. Nocera"]
keyword_synonyms:
- "theory uncertainties": ["th uncertainties", 'th unc']
And then substituting by the canonical name.
An easy way of doing that is to build a reverse dictionary like
In [8]: {ele: canonical for spec in d['author_synonyms'] for canonical, syns in spec.items() for ele in syns}
Out[8]:
{'ERN': 'Emmanuele Nocera',
'E_Nocera': 'Emmanuele Nocera',
'Emanuele R. Nocera': 'Emmanuele Nocera'}
and then e.g. author = revmap.get(author, author).
See also:
https://github.com/NNPDF/nnpdf/issues/223
Ok some rough ideas I thought I should write down: (before starting I want to admit that I am a culprit of probably all of these)
-
Come up with some more rigid rules about what keywords should be, I have some rough formatting ideas (all words should be lower case, all acronyms should be CAPS, "N-point" should be hyphenated etc.), create a synonym mapping for existing ones, things that can't be mapped to sensible keys where possible should have author contacted and given utimatum to provide suitable fixup within the new guidelines or report gets deleted (for example I think the reports with the dataset name in the keyword is incorrect - keywords here should something like 'dataset' and 'comparison' and the dataset name should feature in the title, although I guess the rules will need to properly agreed on)
-
future uploads should conform to new set of keys - new keys will be requested somehow for new projects (maybe via a script?). Any which do not conform have some kind of grace window in which the author can request the new keyword or after which they get deleted - a warning should probably be issued upon upload if this is the case.
-
the keyword 'test' implies several things at the moment - perhaps lack of creativity since almost all of the vp reports are tests of some description, maybe laziness since it's a keyword that appears in some examples and clearly wasn't deleted or (and imo most sensibly) that this report is to be sent via email, maybe be discussed at a phone conference but isn't a permanent result. In this case I think the
testkeyword should indicate to the indexer that the report should be deleted after say 2-3 weeks, perhaps there should be a different keyword if the report is support material for a PR (I guess we want that to exist forever?)
I think this would help shrink the reports and also help the keywords be more useful. My ideas on keywords rules are as follows:
-
keywords should be, as much as possible, single words. Wider contexts should be built up out of several keywords:
Fits w. th. unc.Should betheory uncertainties(the project)fit(or maybe vp-comparefit to indicate the type of report). Then on the report index page one should be able to select more than one keyword, so one would selecttheory covarianceand thenfitand the reports would approximately be the fits in the "Fits with theoretical uncertainties" page on the Wiki. -
keywords should have a well defined format so that the indexer only has to compete with typos, not creative flair, as previously suggested I think all words lowercase and all acronynms upper would be a good start, there's a lot of duplicate keys because of plurals and switching special characters. I think generally we should avoid using special characters unless they really look nicer or are really necessary or help two words become a single keyword like "9-point".
As I said before I think the interactive meta should be included in vp-upload, I think one of the big problems people have is they're filling in the meta in a text editor and can't quite remember the key they used, something like vp-upload -i.
With regards to deletion, it seems harsh but I think it will keep things under control. Also most reports are quick to reproduce (which is the point of vp anyway) and so it's not a complete disaster if things get deleted. Perhaps in the first instance we don't delete the old reports which we can't make conform to new rules and instead archive them..
Ok some rough ideas I thought I should write down: (before starting I want to admit that I am a culprit of probably all of these)
* Come up with some more rigid rules about what keywords should be, I have some rough formatting ideas (all words should be lower case, all acronyms should be CAPS, "N-point" should be hyphenated etc.), create a synonym mapping for existing ones, things that can't be mapped to sensible keys where possible should have author contacted and given utimatum to provide suitable fixup within the new guidelines or report gets deleted (for example I think the reports with the dataset name in the keyword is incorrect - keywords here should something like 'dataset' and 'comparison' and the dataset name should feature in the title, although I guess the rules will need to properly agreed on)
It is probably a good idea to have guidelines on these things written down somewhere. I don't think we want to delete old reports, but maybe make their metadata match the new guidelines where possible, and where not, they were "correct" at the time in the sense that we didn't have any policy. Note that the historical development was something like plots actions, report action, index page, separate metadata, with the last two coming more or less together as far as I recall.
* future uploads should conform to new set of keys - new keys will be requested somehow for new projects (maybe via a script?). Any which do not conform have some kind of grace window in which the author can request the new keyword or after which they get deleted - a warning should probably be issued upon upload if this is the case.
Maybe from now we can require that the metadata is explicitly set and that things like author and keywords have to be created in a separate step and that a report can only contain the existing ones. This would become interesting if we were to automatically generate the indexes that are now in the wiki and had more features such as #224 (which I think would be a killer feature for many things, btw). I think this should be as good as we can enforce automatically without too much trouble and I don't think we should ever delete things that are uploaded successfully.
Also there is the philosophical point that the infrastructure should allow for doing quick and dirty things without too much bureaucracy.
* the keyword 'test' implies several things at the moment - perhaps lack of creativity since almost all of the vp reports are tests of some description, maybe laziness since it's a keyword that appears in some examples and clearly wasn't deleted or (and imo most sensibly) that this report is to be sent via email, maybe be discussed at a phone conference but isn't a permanent result. In this case I think the `test` keyword should indicate to the indexer that the report should be deleted after say 2-3 weeks, perhaps there should be a different keyword if the report is support material for a PR (I guess we want that to exist forever?)
It is probably not the most descriptive keyword, but I tend to use it when I am testing some code feature and I don't particularly care about the physics in the report. I find myself looking at old things marked with test to e.g. see how the runcard looks like. And again I like to thing that once I upload something it is potentially available forever (and backed up and whatnot).
I think this would help shrink the reports and also help the keywords be more useful. My ideas on keywords rules are as follows:
* keywords should be, as much as possible, single words. Wider contexts should be built up out of several keywords: `Fits w. th. unc.` Should be `theory uncertainties` (the project) `fit` (or maybe vp-comparefit to indicate the type of report). Then on the report index page one should be able to select more than one keyword, so one would select `theory covariance` and then `fit` and the reports would approximately be the fits in the "Fits with theoretical uncertainties" page on the Wiki. * keywords should have a well defined format so that the indexer only has to compete with typos, not creative flair, as previously suggested I think all words lowercase and all acronynms upper would be a good start, there's a lot of duplicate keys because of plurals and switching special characters. I think generally we should avoid using special characters unless they really look nicer or are really necessary or help two words become a single keyword like "9-point".
These seem reasonable.
As I said before I think the interactive meta should be included in vp-upload, I think one of the big problems people have is they're filling in the meta in a text editor and can't quite remember the key they used, something like
vp-upload -i.
That also seems reasonable.
With regards to deletion, it seems harsh but I think it will keep things under control. Also most reports are quick to reproduce (which is the point of vp anyway) and so it's not a complete disaster if things get deleted. Perhaps in the first instance we don't delete the old reports which we can't make conform to new rules and instead archive them..
See above.
cc @voisey @RosalynLP @enocera @lucarottoli
@wilsonmr have you made a branch for this mate?
Nope, I just had a look what already exists in the keywords index, feel free to if you want to work on it