ph-submissions icon indicating copy to clipboard operation
ph-submissions copied to clipboard

Scalable Reading of Structured Data (PH/JISC/TNA)

Open tiagosousagarcia opened this issue 4 years ago • 65 comments

The Programming Historian has received the following tutorial on 'Scalable Reading of Structured Data' by @maxodsbjerg, Helle Strandgaard Jensen, Josephine Møller Jensen, Alexander Ulrich Thygensen. This lesson is now under review and can be read at:

http://programminghistorian.github.io/ph-submissions/en/drafts/originals/scalable-reading-of-structured-data

Please feel free to use the line numbers provided on the preview if that helps with anchoring your comments, although you can structure your review as you see fit.

I will act as interim editor for the review process, until a permanent editor is assigned. The role of the editor is to solicit two reviews from the community and to manage the discussions, which should be held here on this forum.

Members of the wider community are also invited to offer constructive feedback which should post to this message thread, but they are asked to first read our Reviewer Guidelines (http://programminghistorian.org/reviewer-guidelines) and to adhere to our anti-harassment policy (below). We ask that all reviews stop after the second formal review has been submitted so that the author can focus on any revisions. I will make an announcement on this thread when that has occurred.

I will endeavor to keep the conversation open here on Github. If anyone feels the need to discuss anything privately, you are welcome to email me.

Our dedicated Ombudsperson is (Ian Milligan - http://programminghistorian.org/en/project-team). Please feel free to contact him at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudsperson will have no impact on the outcome of any peer review.

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

The Programming Historian is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or to requests for clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. Thank you for helping us to create a safe space.

tiagosousagarcia avatar Nov 02 '21 15:11 tiagosousagarcia

@maxodsbjerg, could I ask you to post the following on this thread, when you get a chance?

I the author|translator hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English|en français|en español to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

tiagosousagarcia avatar Dec 14 '21 15:12 tiagosousagarcia

Yes of course!

I the author hereby grant a non-exclusive license to ProgHist Ltd to allow The Programming Historian English|en français|en español to publish the tutorial in this ticket (including abstract, tables, figures, data, and supplemental material) under a CC-BY license.

maxodsbjerg avatar Dec 15 '21 10:12 maxodsbjerg

Hi @tiagosousagarcia! Thanks again for setting up the ticket. I noticed that the preview seems a bit off—the lesson should be displaying like this one. Let me know if you would like any help troubleshooting!

svmelton avatar Dec 16 '21 20:12 svmelton

Thanks @svmelton for the note and @jenniferisasi for fixing!

drjwbaker avatar Dec 17 '21 11:12 drjwbaker

I've made some edits https://github.com/programminghistorian/ph-submissions/commit/e4b2b4eace8d2a947a4f7682f518afdc89e9bfba#diff-83b53202f002a488c2e8a75ab1de9e95e1871fc1a19b3c70287ef848974fbac7 on l1-l117 with comments below:

  • l40/156: what is meant by 'timely dimensions'? Is this about chronology?
  • l50: is 'solved' the correct word choice here?
  • l58: is gender a binary?
  • l65: you should note here that is the reader doesn't have a dataset to hand, you will explain how to acquire one in the next section
  • l67: what does 'the free Essential' mean here?
  • l67-8: please use the full citations to these articles (at the bottom of the page) as you would with an other journal.
  • l70-l117: users should be guided in using R and installing packages, for example by reference to other articles on PH.

I'll pick up on the rest of the article later.

I note that you are writing in English in a second language, which will be taken into account during peer review. If the article passes through peer review, copyediting will focus on ensuring the articles meets our Write for a Global Audience guidelines.

drjwbaker avatar Dec 17 '21 12:12 drjwbaker

Finished now! Next set of edits/comments:

  • throughout: removed lots of extraneous dashes (usually) after code blocks.
  • png files: these are not showing correctly. @tiagosousagarcia: do we have these and/or are they in the correct folder?
  • l108: capitalise 'Twitter'. This should be consistent throughout.
  • l113: you assume that the reader knows what a 'datafarme' is. Either: briefly explain what this is, link to explanatory information, or signal that this isn't an introductory article (because introductory articles do not assume knowledge of particular programming architectures)
  • l119: light word choice edits
  • l126: this intro to what R packages are and how to install them should come around l70-l117.
  • l134-139: light word choice edits
  • l142: typo
  • l149: mention of regex should point to guidance on regular expressions.
  • l199: what does 'our needs' refer to here? What are 'our needs' at this point in the article?
  • l211: light word choice edit
  • l215: adapt note on dates
  • l236: again, recommend removing example that suggests gender is a binary.
  • l238: above, I've suggested not saying 'top 20 liked tweets' and replacing that with something like '20 most commonly liked tweets'. I suggest this revision is made throughout, as 'top' could be considered colloquial.
  • l336: added quote marks to "favorite_count". General point: please check the use of quote marks and code blocks for consistency as they appear to be missing or used differently at different points in the article to refer to code, variables, packages, or datasets.
  • l43: what does 'all' mean in the heading here?
  • l394: what do you mean by 'Global Environment' here?

drjwbaker avatar Dec 20 '21 17:12 drjwbaker

To summarise, there is a kernel of a good article here, but it needs to hold the hand of a the reader a little more, especially as a) the tutorial is intended as introductory and b) the tutorial attempts to allow the reader to follow multiple pathways.

So, firstly, these pathways need to be made clearer.

And secondly - and perhaps more importantly - the article needs to assume less knowledge, either by pointing the reader to things that explain new terms/concepts, or being more explicit about what the reader should do: the latter is particularly acute in the 'Data and Prerequisites' section, at which point, as it stands, I can see a reader not knowing what they are being asked to do, suddenly confronted as they are with descriptions of R packages (I know there is a note in the aims section, but the reader needs more help here).

In additin to that, there are some inconsistencies in styling for in paragraph mentions of code, variables, packages, and datasets that needs attention.

@tiagosousagarcia: anything to add from your read through.

drjwbaker avatar Dec 20 '21 17:12 drjwbaker

@drjwbaker we have the images and they should be in the correct place in the repo, but they are not referenced in the .md -- I'll add them in the commit below where I think they should go (they still need captions though);

Otherwise, just a few extra notes:

  • p 18 - if the user is trying to get twitter data using the rtweet package, there should be a note warning that the progress bar refers to the total number of requested tweets, rather than the progress of the operation. That is, if the reader requests 18000 tweets, but only 2600 are available, the progress bar will be stuck at about 15%, which might confuse people (it certainly confused me).

  • p 28 - fig. 1 and the code that creates it should probably include a scale on the y axis

  • p 40 - the heading 'Interaction count dispersed on verified status' seems a bit confusing to me

  • p 61 - perhaps a few lines explaining why we are exporting to JSON specifically (as opposed to, say, csv)

tiagosousagarcia avatar Dec 21 '21 08:12 tiagosousagarcia

@maxodsbjerg Just to add, I appreciate these are a lot of changes to get to. Please don't feel there is a hurry here, as I know many people are already starting their festive leave period. Let's check in again in the new year, and should you have any queries, please ask me and/or @tiagosousagarcia.

drjwbaker avatar Dec 21 '21 08:12 drjwbaker

Thank you all for the edits/comments! I'll look into them in the new year.

maxodsbjerg avatar Dec 21 '21 10:12 maxodsbjerg

@tiagosousagarcia I note these images still aren't rendering in the preview. To be honest I'm not sure how to fix as I've an issue with another article at the moment https://github.com/programminghistorian/ph-submissions/issues/436#issuecomment-1004843172

This one that @amsichani is working on - code here - works perfectly if that is any help!

drjwbaker avatar Jan 04 '22 14:01 drjwbaker

@drjwbaker I've noticed it pre-Christmas, but was hoping it was a case of delayed updating. I'll try to find where the bug is, but might need some help from the @programminghistorian/technical-team on this one

tiagosousagarcia avatar Jan 04 '22 14:01 tiagosousagarcia

@drjwbaker I've noticed it pre-Christmas, but was hoping it was a case of delayed updating. I'll try to find where the bug is, but might need some help from the @programminghistorian/technical-team on this one

A bit more info on the issue -- essentially, it seems that the image location is not being correctly replaced by the slug. The generated preview has https://programminghistorian.github.io/ph-submissions/images/LEAVE%20BLANK/scalable-reading-of-structured-data-1.png as the address for the first figure, for example, even though the slug is indicated correctly on the .md file. On commit 00df8f6 I've removed all 'LEAVE BLANK' fields to see if it nudges it into the right direction

tiagosousagarcia avatar Jan 04 '22 14:01 tiagosousagarcia

@drjwbaker I've noticed it pre-Christmas, but was hoping it was a case of delayed updating. I'll try to find where the bug is, but might need some help from the @programminghistorian/technical-team on this one

A bit more info on the issue -- essentially, it seems that the image location is not being correctly replaced by the slug. The generated preview has https://programminghistorian.github.io/ph-submissions/images/LEAVE%20BLANK/scalable-reading-of-structured-data-1.png as the address for the first figure, for example, even though the slug is indicated correctly on the .md file. On commit 00df8f6 I've removed all 'LEAVE BLANK' fields to see if it nudges it into the right direction

solved with commit 5c06b46

tiagosousagarcia avatar Jan 04 '22 14:01 tiagosousagarcia

@drjwbakern @tiagosousagarcia Thanks again for your comments! We had a meeting yesterday in our group and look forward to solving the comments. We divided the comments amongst us and plan on solving them in the next couple of weeks.

How would you prefer that we work with the comments? Fork the .md-file that you have been doing the light word editing on and ping you, when we're done?

maxodsbjerg avatar Jan 11 '22 13:01 maxodsbjerg

@maxodsbjerg Thanks for your note. I think a fork will work. So if you are happy with that approach, please proceed.

drjwbaker avatar Jan 11 '22 13:01 drjwbaker

Hello all,

Please note that this lesson's .md file has been moved to a new location within our Submissions Repository. It is now found here: https://github.com/programminghistorian/ph-submissions/tree/gh-pages/en/drafts/originals

A consequence is that this lesson's preview link has changed. It is now: http://programminghistorian.github.io/ph-submissions/en/drafts/originals/scalable-reading-of-structured-data

Please let me know if you encounter any difficulties or have any questions.

Very best, Anisa

anisa-hawes avatar Feb 06 '22 19:02 anisa-hawes

@maxodsbjerg Just checking in to see how you are getting along with the pre peer-review edits.

drjwbaker avatar Feb 07 '22 08:02 drjwbaker

@drjwbaker It is all going very well. We have just a few edits left and I plan on finishing them this week.

maxodsbjerg avatar Feb 07 '22 09:02 maxodsbjerg

Fab. Thanks for the update.

drjwbaker avatar Feb 07 '22 10:02 drjwbaker

@drjwbaker @tiagosousagarcia We have finished the editing now. You'll find the updated markdown here: https://github.com/maxodsbjerg/ScalableReadingOfStructuredData/blob/main/20220117_PHedits_scalable-reading-of-structured-data.md

We've also collected your comments in a markdown-file and described what we did (the text in italic following your comment). You'll find it here: https://github.com/maxodsbjerg/ScalableReadingOfStructuredData/blob/main/20220210_PH-lesson_Scalable_Reading_edits.md

maxodsbjerg avatar Feb 10 '22 13:02 maxodsbjerg

Thanks so much for this @maxodsbjerg. I'm going to replace our version of the article with this one. Then we'll send it out for peer review. Note that there may be a slight delay here as @tiagosousagarcia is on leave.

drjwbaker avatar Feb 14 '22 14:02 drjwbaker

(and thanks so much for the commentary on our suggestions: the article is much tighter now. Great job!)

drjwbaker avatar Feb 14 '22 14:02 drjwbaker

@inactinique and @martinmueller39 have kindly agreed to review this article. We can expect their reviews on the 1st and 15th of April respectively. If there are any questions, please feel free to post them on this ticket, or email me or @drjwbaker. Many thanks to our reviewers

tiagosousagarcia avatar Mar 07 '22 15:03 tiagosousagarcia

Dear authors,

Thank you for this very comprehensive tutorial.

General comments and suggestions

The code works fine (I tested it in a R-Studio notebook). The pre-requisites (software, experience) are well explained, though I think that you did not precise that a Twitter account was necessary to get the data with rtweet. The learning objectives are clearly defined, as well as the workflow you suggest to follow. The lesson is also overall well structured. Another strong point of your contribution is the fact that you explicit links with other lessons and, in your latest paragraph, you highlight the differences with the Beginner’s Guide to Twitter Data and explain how to overcome those differences.

I would suggest to highlight more clearly with which versions of R and R libraries you wrote the code as a change of version can slightly change the syntax of the code. Though on my machine, it worked with the latest versions of R and the packages you are using (macOS).

The code’s easy to reproduce, even for python oriented and R reluctant researchers like me :-). A few words on why R and not python would seem to me useful, but not mandatory. More interesting would be a few explanation on how much data you can handle with your R code and if there are strategies to adopt in case the dataset's too big (which won't happen with the way you are collecting tweets, but can easily happen with other ways to collect data).

I would also suggest a ‘further reading’ section at the end -- that would make your contribution a bit stronger and more interesting to researcher's who are using you lesson as a beginning point.

Particular comments

Some typos:

  • Paragraph 33 ‘aquireing’, ‘registerede’
  • Paragraph 41: ‘dispite’

There might be others, may I advise some proof-reading?

In paragraphs 53, 59, 66 and 68 (I might forget one), I would remove “(Output removed because of privacy reasons)” from the code cell, because it’s not code. Of course, it should be still stated that you removed the output for (very obvious) privacy reasons, but it should be in a text cell, not in a code cell.

You are once using ggplot2, but ggplot otherwise. I would decide for one of the two (or explain why you use the two).

I really enjoyed reading this lesson. Thank you again to the authors.

Statement

I know at least one of the authors. We are both members of the board of the Journal of Digital History.

inactinique avatar Apr 01 '22 09:04 inactinique

@inactinique Thank you very much for your comments and sorry for the late reply.

I will get back to my colleagues and incorporate your very good comments and suggestions. Thanks again!

maxodsbjerg avatar Apr 08 '22 11:04 maxodsbjerg

@maxodsbjerg Just to note that you don't need to respond until both reviews have come in and I've had a change to summarise them. But thanks for taking a look nevertheless.

drjwbaker avatar Apr 08 '22 11:04 drjwbaker

@drjwbaker Thanks for the clarification!

maxodsbjerg avatar Apr 08 '22 12:04 maxodsbjerg

@martinmueller39 Do you need some extra time to complete this review?

drjwbaker avatar Apr 26 '22 12:04 drjwbaker

Dear @maxodsbjerg and authors,

Thank you for your patience through the peer-review process. Unfortunately, our second reviewer had to drop out at the last moment. Instead of delaying the process further, exceptionally, we decided to continue the process with some further editorial support. What follows, then, is a mix between a peer- and an editorial review.

General comments and suggestions

Thank you for writing a clear and well-defined tutorial that will, I believe, be of interest to many PH readers. A good manual on the kinds of work to be done with twitter data (and not just on how to do it) is valuable to many disciplines and researchers in the humanities, and I am sure it will be greatly appreciated.

The tutorial is well structured and is easy to follow along (both in terms of code and ease of reading). I've found a couple of typos and less clear points that I noted below in detail.

The clear definition of the workflow and the use of scalable reading more generally are a high point of the tutorial for me. There are myriad ways of technically doing scalable reading (the R method here being just one of them), but the why and wherefore of this method remain unchanged, and I think you did a stellar job putting that across.

There is still, I think, some space to improve the tutorial even further. I hope my high-level suggestions below will be of help in that regard.

The multi-author conundrum

To some extent, with collaborative papers, there is no way of escaping this, as different voices will express themselves differently. In the tutorial, however, I think there is a marked shift between sections that is, sometimes, a little distracting to the reader. This is sometimes shown in less visible ways to the final reader of the tutorial (for example, in the .md paragraphs are sometimes written in a single line, other times have line breaks) which is trivial, but other times there is a considerable shift in register and tone from section to section. Some consolidation work needs to be done here, I think.

Code before explanation

This is more crucial for longer, or more complex pieces of code (I've noted them down below) -- I think it would be a benefit for the reader to see the code block before the explanation of its steps, so that there is an anchor to refer back to. Otherwise, the reader might be a little lost as to what exactly the explanation is referring to.

Conclusion/Next Steps/Further Reading

I get a sense the tutorial ends quite abruptly and openly, I would prefer to have a short, one-paragraph conclusion recapping the work that has been done and pointing the reader to the next steps in the scalable reading method. In other words, we have the distant reading aspect, but not the close reading one. I'm not suggesting, of course, that you need to include a close reading example, but as it stands, the reader is left with the impression that there are three, completely independent distant reading approaches which bear no relationship to each other. You've done some of that work throughout the tutorial, but I think a final (very short) section that recaps those points of connection and points the reader to where next to take the research would be very positive.

Line edits

(it's a long list, but most of these are quite small!)

  • l. 34 -- introduces the concept of scalable reading without explanation (only needs a small definition here, something like: ...scalable reading of structured data, a combination of close interpretation of individual texts and statistical analysis of the corpus)

  • #Lesson Aims before #Lesson Structure?

  • l. 46 -- 'The reproducible way of selecting...' good that you're introducing examples of disciplines that could use the method, but maybe also add something about its extendibility to others in the humanities.

  • l. 52 -- 'This step suggests a chronological exploration of a dataset.' -- delete, it just repeats the header.

  • l. 52 -- 'Had we worked on data from the National Gallery' -- rephrase, it implies a contradiction with what you said above (that it forms part of the discussion). I.e, 'In the case of the National Gallery data...'

  • l. 54 -- 'Had we worked on data from the National Gallery' -- rephrase

  • l. 56 -- 'Had we worked on data from the National Gallery' -- rephrase

  • #Data and Prerequisites: could not a third option be to offer a base dataset that users can get from PH to get started without having to follow other lessons?

  • #Data and Prerequisites: add another option, 'Using the rtweet package and your own twitter account, as described below'

  • l. 66 -- add a small, inline explanation of what 'packages' are.

  • l. 88 -- 'The package in from the same group' -> 'The package [comes|is|was created by] from the same group...' [typo]

  • l. 120 -- add a comment to the code, explaining what the function parameters are

  • l. 124 -- 'according to different periods in art history to which are represented the most or the least' -- a little confusing, rephrase. Perhaps: '...according to different periods in art history, in order to establish which periods are more or less represented in the National Gallery dataset'

  • l. 129 (and in the section more generally, and elsewhere in the tutorial) -- the subject changes midway through the sentence: you/we

  • ll. 164-167 -- the code should appear before the explanation, so that readers know what the explanation refers to. Somewhere between ll.142-143.

  • l. 195 -- 'beware' -> 'be aware'

  • l. 195 -- 'where we collected' -> 'when we collected'

  • l. 205 -- 'thus creating two lines for;' -> 'which creates two lines in the visualisation, one for...'

  • l. 209 -- in-line explanation of what 'aesthetics' are in this context

  • l.209 -- 'tells R, what the' -> 'tells R what the'

  • ll. 217-228 -- code should be before its explanation

  • ll. 245-150 -- explanation of the pipe operator should really come in the first coding section, particularly as you note that 'once you get a hold of this idea the remainder of the data processing will be easier to read and understand'

  • l. 255 -- 'verfied' -> 'verified'

  • ll. 367-276 -- more detailed explanation of plot construction would be good

  • l. 382 -- 'two different kinds of distant readings' -> 'two different kinds of distant reading'

  • l. 384 -- 'reading individual tweet' -> 'reading individual tweets'

  • l. 397 -- First mention of R Markdown -- needs an explanation and a reasoning

  • l. 414 -- 'are variables that changes' -> 'are variables that change'

  • ll 454-ff -- Why are we exporting the new dataset to a JSON file? i.e., why are we exporting it in the first place, and why specifically using JSON rather than a tabular format (csv, for example)? I still don't see a clear reasoning for it, though I might have missed it of course.

  • l. 496 -- 'how many likes you top-20 lies above' -> I don't really understand this sentence, there's a typo or something missing somewhere.

  • l.535 -- maybe add a note that fetching the twitter text for each url can also be automated using the API, though it is not covered in this tutorial

  • l. 540 -- 'the date of tweets is shown in a way, which is' -> 'the date of tweets is shown in a way which is'

tiagosousagarcia avatar May 04 '22 10:05 tiagosousagarcia