ph-submissions icon indicating copy to clipboard operation
ph-submissions copied to clipboard

Analyzing Multilingual French and Russian Text using NLTK, spaCy, and Stanza

Open hawc2 opened this issue 2 years ago • 48 comments

Programming Historian in English has received a proposal for a lesson, 'Introduction to Text Analysis for Non-English and Multilingual Texts' by @ian-nai.

I have circulated this proposal for feedback within the English team. We have considered this proposal for:

  • Openness: we advocate for use of open source software, open programming languages and open datasets
  • Global access: we serve a readership working with different operating systems and varying computational resources
  • Multilingualism: we celebrate methodologies and tools that can be applied or adapted for use in multilingual research-contexts
  • Sustainability: we're committed to publishing learning resources that can remain useful beyond present-day graphical user interfaces and current software versions

We are pleased to have invited @ian-nai to develop this Proposal into a Submission under the guidance of @lachapot as editor.

The Submission package should include:

  • Lesson text (written in Markdown)
  • Figures: images / plots / graphs (if using)
  • Data assets: codebooks, sample dataset (if using)

We ask @ian-nai to share their Submission package with our Publishing team by email, copying in @lachapot.

We've agreed a submission date of mid-late April. We ask @ian-nai to contact us if they need to revise this deadline.

When the Submission package is received, our Publishing team will process the new lesson materials, and prepare a Preview of the initial draft. They will post a comment in this Issue to provide the locations of all key files, as well as a link to the Preview where contributors can read the lesson as the draft progresses.

If we have not received the Submission package by late April, @lachapot will attempt to contact @ian-nai. If we do not receive any update, this Issue will be closed.

Our dedicated Ombudspersons are Ian Milligan (English), Silvia Gutiérrez De la Torre (español), Hélène Huet (français), and Luis Ferla (português) Please feel free to contact them at any time if you have concerns that you would like addressed by an impartial observer. Contacting the ombudspersons will have no impact on the outcome of any peer review.

hawc2 avatar Mar 23 '24 20:03 hawc2

Hello @lachapot and @ian-nai,

You can find the key files here:

You can review a preview of the lesson here:


There are a couple of small things I noticed while processing this lesson, which I will outline below:

  • [x] I believe the lesson file is missing the 'core' section of the lesson, Sample Code and Exercises (and its subheadings). I can see that this section is indeed provided in the associated Google Colab notebook, but we would want to see it in the main text.

    We’ve developed some guidelines for authors who choose to integrate codebooks in their lessons. Our aim is to support maintenance, future translatability, and flexible usability. The guidelines are based on a key understanding that we want our readers to be able to make the choice to work in Google Colab, work in their preferred alternative cloud-based development environment, or opt to run the code locally. If authors provide codebooks to accompany their lesson, we ask that:

    • Codebooks consist of the code + line comments only
    • Headings and subheadings mirror those of the lesson to support readers' navigation
    • Codebooks do not extend or replicate commentary from the lesson

    @ian-nai, when you make changes to the notebook, please share the new version with me (publishing.assistant[@]programminghistorian.org): we'll want to save a new copy of it in the lesson's /assets folder.

  • [x] Is it necessary that readers download the full corpus from Wikipedia? If so, we could consider hosting this asset directly in the lesson's assets folder. (Alternatively, if this code is specifically written to download / scape data assets from a webpage, it is fine that the data remains outside our repository as long as the data is open access.) If not, it may be helpful to make it clearer that downloading the full corpus is optional.

charlottejmc avatar Apr 19 '24 11:04 charlottejmc

Hello Ian @ian-nai,

What's happening now?

Your lesson has been moved to the next phase of our workflow which is Phase 2: Initial Edit. In this Phase, your editor Laura @lachapot will read your lesson, and provide some initial feedback.

(I see that Charlotte has raised a couple of queries above about 1. a missing core section in the Markdown file? and 2. whether the sample data would be useful to host on our repository, or if downloading that dataset from the web is intended to be part of the learning actions? I imagine that Laura will have thoughts on these, and you can take the conversation forwards together).

Laura will post feedback and suggestions as a comment in this Issue, so that you can revise your draft in the following Phase 3: Revision 1.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 1 <br> Submission
Who worked on this? : Publishing Assistant (@charlottejmc) 
All  Phase 1 tasks completed? : Yes
Section Phase 2 <br> Initial Edit
Who's working on this? : Laura (@lachapot)  
Expected completion date? : May 19
Section Phase 3 <br> Revision 1
Who's responsible? : Author (@ian-nai) 
Expected timeframe? : ~30 days after feedback is received

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

anisa-hawes avatar Apr 19 '24 21:04 anisa-hawes

Hi everyone,

Thank you very much for your lesson, @ian-nai. And thank you @charlottejmc and @anisa-hawes for setting it all up. I will provide some initial feedback in about two weeks time. Looking forward to working with you on this!

lachapot avatar Apr 19 '24 21:04 lachapot

Hi @lachapot,

Just an update to say that @ian-nai has very helpfully worked on solving the two points I flagged in my comment above.

  • [x] Ian has opted to embed the code into the Markdown file directly rather than include a separate codebook. I've now updated the lesson file with the new copy of his lesson, which he shared over email, and deleted the codebook from the assets folder.

  • [x] We confirmed that readers don't need to download the whole corpus from Wikipedia (only the excerpt from the text file which we host in the assets folder), and this is now reflected in the lesson's wording.

charlottejmc avatar Apr 24 '24 14:04 charlottejmc

Great, thank you very much for this update @charlottejmc and @ian-nai!

lachapot avatar Apr 24 '24 16:04 lachapot

Thank you very much for your lesson submission, @ian-nai! This is an exciting contribution that addresses important gaps in the field of text analysis/NLP. Here are some suggestions for revisions as we start preparing the lesson for external peer-review.

Overall, I’d suggest that there are three broad areas of revision we could focus on at this point:

  • The first is to define more specifically the scope of the lesson: what will readers learn from the lesson, and what foundational generalizable skills does it provide for them to apply to their own research projects? In essence, this is to help more clearly and directly “advertise” the lesson so that readers can more immediately recognize what the lesson is about and whether it’s relevant to them.

  • Currently, the lesson leaves quite a lot unexplained and could flesh out further both the broader context as well as the specific methodological procedures presented in this lesson — i.e. delving more deeply into the broader political and technical stakes of doing text analysis with different languages; the core steps and methods of text analysis (e.g. digitization/OCR, preprocessing procedures, text analysis methods etc.) and how the lesson fits in to this (e.g. some preprocessing procedures are problematic when working with different languages and how do we address this?). Without this broader context and background information, the lesson is a little hard to follow and there isn’t an obvious methodological narrative to guide the reader through the lesson (i.e. it’s difficult to understand what the rationale for each step is (language recognition, POS, lemmatization), how the steps fit together, and how they might be useful in a text analysis workflow). Once a clearer scope is defined, some restructuring and expanding of parts of the lesson could help give more flow to the lesson and lay out more fully the different steps covered as well as the rationale for each step (i.e. the broader context — why are these steps useful to know about and how do they fit in with steps involved in text analysis workflow more generally?).

  • Somewhat relatedly, there is also a question around the level of difficulty of the lesson — is this an introductory lesson aimed at beginners or is this an intermediate lesson that assumes some prior knowledge? Currently, it seems aimed at beginners, but background information and specialist concepts are not consistently explained in lay terms and complex procedures are not always broken down into simple steps with each step fully explained (e.g. the packages introduction mentions pre-trained models, pipelines, processing times, which might not necessarily be beginner-friendly terminology and could either be explained in more simple, lay terms or could provide links to external sources for more information). I’m assuming this lesson is aimed at beginners, so the suggestions I make below are for a beginner level lesson.

Here are some more specific suggestions for revisions for each section of the lesson to address the areas outlined above (when I mention paragraph numbers I’m referring to the lesson preview):

Lesson Goals section

  • [ ] As mentioned above, the “Lesson Goals” could be more specific. It seems to me that what’s particularly exciting and unique about the lesson is the focus on multilingual text (specifically, text that includes both Russian and French) and this lesson shows how you can perform two fundamental preprocessing steps that are widely used in text analysis (POS tagging and lemmatization) for multilingual text. Rather than simply presenting this lesson as an introduction to text analysis, this section could be clearer about the specific goals and methods the lesson covers so that readers immediately know what specific skills are covered and why.

  • [ ] Similarly, I might suggest tweaking the title of the lesson to be a bit more descriptive and specific. For example, it might be useful to name the specific tools used, and perhaps also follow the Bender rule and explicitly name the specific languages that are addressed in the lesson.

  • [ ] Don’t forget to flag up any prerequisites here (perhaps in its own separate section), i.e. give some indication the level of difficulty and what users need (in terms of skills/knowledge, tools/packages, and data) to follow this lesson. Cf. for example the “Preparation” section in this lesson or this lesson. It might also make sense to move all initial installation information to this section.

Basics of Text Analysis and Working with Non-English and Multilingual Text sections

I’d suggest restructuring these two sections — perhaps merging parts of both sections and potentially also breaking them down into subsections to expand on particular points and examples — in order to set up the focus of the lesson more clearly and get more quickly to the heart of the lesson, i.e. working with multilingual text. As I understand it, this section should be an introduction to text analysis and why text analysis is a useful skill (providing examples of how people have applied it and what it could be used for), but it should also introduce this in context of multilingual text analysis and issues of language diversity in text analysis/NLP so that the reader can understand the broader issues and how the methods presented in this lesson address these issues.

  • [ ] Paragraphs 3 and 4 of “Basics of Text Analysis” could be condensed and could also offer more concrete examples of projects and applications — you could point to other Programming Historian lessons (or provide examples of other projects) that illustrate the many applications of text analysis for further reference for readers.

  • [ ] Then I’d suggest adding more general context and discussion on issues of linguistic bias in computational text analysis, and the challenges and considerations to take into account when working with different languages or with multilingual texts specifically. You don’t necessarily have to cover everything in detail, you can sketch out main points and link to further reading/resources if people want more information, but providing more crucial background knowledge for people unfamiliar with these issues of language diversity and text analysis will help strengthen the narrative flow of the lesson as well as clarify the lessons own stance in relation to these issues. The information currently in bullet points in “Working with Non-English and Multilingual Text” could be expanded into more flowing prose commentary — with some of the information integrated into these contextual discussions of linguistic bias in NLP, and the examples you provide (encoding, right to left scripts, logographic languages, etc) could be expanded further (potentially adding further reading/resources, images and more specific examples) to illustrate more concretely challenges that people might encounter when working with different languages.

  • [ ] I’d also suggest adding here a section that introduces key steps and concepts of text analysis relevant to the lesson (e.g. laying out how parts of speech tagging and lemmatization are fundamental steps in text analysis amongst others, but that these can be difficult to realize with multilingual texts because of issues outlined above). This can be an occasion to introduce and explain any specialist vocabulary or fundamental concepts, and make clear what specific methodologies are presented in this lesson and how they might fit into a broader text analysis workflow.

Tools We’ll Cover section

  • [ ] When discussing the different packages, perhaps try to consistently add links to documentation and information on the different languages these tools work with (perhaps also note whether these packages have documentation in languages other than English if possible). Also make sure to explain in beginner terms the general features of the packages and how that might be relevant to the user’s considerations.

  • [ ] It might also help with the structure and flow of the lesson to have a few introductory sentences to this section that link back to the previous discussion and clarify why we’re comparing these packages and what the payoff is of comparing these different packages (e.g. that different libraries exist for NLP, these are widely used NLP packages, they cover different languages, they might be more or less difficult to use etc.).

Sample Code and Exercises section

  • [ ] Add the original Russian title for War and Peace with English title in brackets after the Russian in paragraph 7.

  • [ ] The sample data is a few lines of text, it might be clearer to say “We will take a few lines of text from…” rather than “a corpus of text”?

  • [ ] For clarity, I’d suggest using the comments in code snippets strictly for explaining the code and move out of code blocks into the main body of the lesson any more contextual comments (e.g. in code snippet at paragraph 9, the comment “we are using minimally preprocessed excerpt..” could be moved below the code snippet rather than in the snippet). Some of this more contextual information could also be more fully explained and linked back to discussions introduced in the previous sections (e.g. summarize key points of the article you reference: what constitutes typical preprocessing steps, and how is this sometimes problematic in relation to questions of language diversity specifically?).

  • [ ] Consider splitting longer blocks of code where appropriate and, as mentioned above, moving code comments to markdown text where appropriate (e.g. code block at paragraph 24 could be split and unpacked further, similarly paragraph 28 could be split into two or three blocks, etc.).

  • [ ] Make sure that code comments are as descriptive and explicit as possible of what the code is actually doing (e.g. rather than just “Russian only” in code at paragraph 14, specifying that this is storing in the variable rus_sent the sentence at index 5 (or the sixth sentence in the list), or adding clarifying comments breaking down the procedure when setting up spaCy of downloading the relevant model, loading, and creating spaCy document that contains rich linguistic information we can use for further analysis/processing, (e.g. POS) and then linking to the website and specifying that readers can choose the model relevant to their research). In general, keep in mind, as far as possible, how readers might want to generalize to their own projects and provide pointers, where possible, to help readers generalize to their own projects.

  • [ ] Similarly, make sure to clarify any specialist concepts and terminology by providing explanations in the lesson itself and/or by linking to external resources (e.g. tokenization, regex, etc).

  • [ ] Perhaps add in more explanation for some of the outputs (e.g. what do the lemmatization outputs show specifically?). It might also be useful for readers to have some brief discussion or at least flagging of limitations they might want to consider (e.g. how well is lemmatization performing for each language?) and perhaps pointers to how lemmatization or POS can be used in further analysis (by linking to other Programming Historian lessons for example). 

  • [ ] Consider perhaps renaming or adding in titles for your sections to bring out more explicitly the methodological narrative you’re presenting here. E.g. “Identifying Languages” could be something like “How to automatically detect different languages and scripts”… It also seems that perhaps a section title could be added, after loading and tokenizing the text, to indicate the part of the lesson that demonstrates or compares how well the different packages detect languages and the limitations when working with multilingual text.

  • [ ] One small problem with the code at paragraph 30: The output is only Russian POS (probably need to iterate over the processed docs?).

  • [ ] For spaCy code at paragraph 32 I get an error 'Document' object is not iterable (perhaps check the naming of your variables: fre_nlp but then nlp used for creating the doc…)?

Otherwise the code runs smoothly!

I hope this is helpful! Let me know if there’s anything you’d like to discuss or if you have any questions. Ideally, this first round of revisions would happen within 30 days so we can move swiftly on to the next phase, but let us know if there are any adjustments you need to make on the timeline.

Thanks again for this exciting contribution and looking forward to working on this with you!

Laura

lachapot avatar May 06 '24 23:05 lachapot

What's happening now?

Hello Ian @ian-nai. Your lesson has been moved to the next phase of our workflow which is Phase 3: Revision 1.

This Phase is an opportunity for you to revise your draft in response to @lachapot's initial feedback.

I've sent you an invitation to join us as an Outside Collaborator here on GitHub. This will give you the Write access you'll need to edit your lesson directly.

We ask authors to work on their own files with direct commits: we prefer you don't fork our repo, or use the Pull Request system to edit in ph-submissions. You can make direct commits to your file here: /en/drafts/originals/non-english-and-multilingual-text-analysis.md. @charlottejmc and I can help if you encounter any practical problems!

When you and Laura @lachapot are both happy with the revised draft, we will move forward to Phase 4: Open Peer Review.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 2 <br> Initial Edit
Who worked on this? : Editor (@lachapot) 
All  Phase 2 tasks completed? : Yes
Section Phase 3 <br> Revision 1
Who's working on this? : Author (@ian-nai)  
Expected completion date? : June 8
Section Phase 4 <br> Open Peer Review
Who's responsible? : Reviewers (TBC) 
Expected timeframe? : ~60 days after request is accepted

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

anisa-hawes avatar May 08 '24 10:05 anisa-hawes

Thank you, @lachapot and @anisa-hawes! I have made my initial revisions as a commit to the Markdown file.

ian-nai avatar May 28 '24 15:05 ian-nai

Thank you, @ian-nai!

One small practical note is that the snippet at line 23 {% include toc.html %} generates the table of contents for us, drawing upon the ## Sections and ### Subsections you've defined within the text. So we don't need to list the structure out as you have done between lines 27-40.

More of a house-style/typographical note is that we use ## Header 2 as the largest section header, which (in our style-sheet) also keeps all headings left-aligned. @charlottejmc can help us by reviewing the Preview and will make these adjustments ☺️

@lachapot will review your revisions and advise if we are ready to move onwards to the next Phase of the workflow (which will be Phase 4 Open Peer Review).

anisa-hawes avatar May 28 '24 18:05 anisa-hawes

Thanks @ian-nai and @anisa-hawes, I've made that small adjustment now.

charlottejmc avatar May 29 '24 08:05 charlottejmc

Hi @ian-nai,

Thank you for your revisions! The lesson looks great. I think you’ve addressed the major points of revision and the lesson has a clearer focus, is easier to follow, and explains key points and concepts more clearly.

There are just a few small points and typos I noticed that we might want to address before peer-review:

  • [x] typo in paragraph 1: “We will also look at ways to detect the languages…”

  • [x] typo in paragraph 6: space missing (“An excerpt of”)

  • [x] In “Basics of Text Analysis”: perhaps add a transition sentence between the introductory paragraph and the “Key Steps and Concepts” section (i.e. between paragraphs 7 and 8) that explains how these steps and concepts the lesson is covering relates to text analysis more generally: e.g. simply explaining that text analysis methods require certain preprocessing steps, steps necessary to prepare the text for analysis, and these are especially important and/or complicated when it comes to working with multilingual text.

  • [x] perhaps edit the section title “Multilingual Text” to something more specific e.g. “Challenges Facing Multilingual Text Analysis” or similar?

  • [x] paragraph 10 line 3: I’d suggest changing “for processing other languages” to “for processing a variety of languages”

  • [x] Might it be possible to add in some citations/further reading or even images/specific examples in “Multilingual Text” section for some of the points you discuss?

  • [x] paragraph 15, last line: wouldn’t it be more accurate to say “some simple preprocessing methods on the text” than “some simple analysis methods on the text” to align more with what was stated in lesson goals?

  • [x] delete the sentence at paragraph 17 since it is repeated in the first line of following paragraph

  • [x] paragraph 19, perhaps edit 3rd line to something like: “since we will primarily be focusing here on our specific pre-processing methods of tokenization, language detection, POS tagging, and lemmatization.”

  • [x] add a “Tokenization” section title before paragraph 22?

  • [x] typo: second bracket missing in paragraph 26 after “this documentation ).”

  • [x] In the “Conclusion” perhaps add a few sentences that reiterate more specifically what the lesson has covered and how it might fit into a text analysis workflow e.g. we covered how to tokenize text, automatically detect languages, and identify parts of speech and lemmatize text in different languages. These can be preprocessing steps to prepare for further text analyses (cf. Suggested Reading below) or may already provide some results for analyses… (or something similar)

Once these have been addressed, I’d say we’re ready to move on to the next phase of external peer-review. Anisa (@anisa-hawes) will provide details about this next phase, and I’ll be back in touch once reviewers have been confirmed.

Thanks again for all your work on this lesson! Looking forward to next steps. Laura

lachapot avatar Jun 11 '24 20:06 lachapot

Thank you, @lachapot.

Hello @ian-nai.

Let us know if you have difficulties with implementing these revisions. Charlotte and I are on-hand if you need anything 🙂

anisa-hawes avatar Jun 12 '24 17:06 anisa-hawes

Thank you, @lachapot and @anisa-hawes! I made the additional revisions in my most recent commit.

ian-nai avatar Jun 13 '24 15:06 ian-nai

Great, thank you @ian-nai! It looks good, we're ready to move on to the next stage now. Anisa (@anisa-hawes) will provide some details about the peer-review phase, and I'll reach out once reviewers have been confirmed. Thank you!

lachapot avatar Jun 13 '24 15:06 lachapot

Hello Ian @ian-nai,

Thank you for your work on the revisions ✨

What's happening now?

Your lesson has been moved to the next phase of our workflow which is Phase 4: Open Peer Review.

This phase will be an opportunity for you to hear feedback from peers in the community.

Laura @lachapot will invite two reviewers to read your lesson, test your code, and provide constructive feedback. In the spirit of openness, reviews will be posted as comments in this issue (unless you specifically request a closed review).

After both reviews, Laura will summarise the suggestions to clarify your priorities in Phase 5: Revision 2.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 3 <br> Revision 1
Who worked on this? : Author (@ian-nai)
All  Phase 3 tasks completed? : Yes
Section Phase 4 <br> Open Peer Review
Who's working on this? : Reviewers (@wjbmattingly + TBC)
Expected completion date? : ~60 days after request is accepted
Section Phase 5 <br> Revision 2
Who's responsible? : Author (@ian-nai)
Expected timeframe? : ~30 days after editor's summary

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

anisa-hawes avatar Jun 13 '24 17:06 anisa-hawes

Open Peer Review

During Phases 2 and 3, I provided initial feedback on this lesson, and worked with @ian-nai to complete a first round of revisions.

In Phase 4 Open Peer Review, we invite feedback from others in our community.

Welcome William Mattingly @wjbmattingly and Merve Tekgürler @mervetekgurler. By participating in this peer review process, you are contributing to the creation of a useful and sustainable technical resource for the whole community. Thank you.

Please read the lesson, test the code, and post your review as a comment in this issue by August 20.

Reviewer Guidelines:

A preview of the lesson:

-- Notes:

  • All participants in this discussion are advised to read and be guided by our shared Code of Conduct.
  • Members of the wider community may also choose to contribute reviews.
  • All participants must adhere to our anti-harassment policy:

Anti-Harassment Policy

This is a statement of the Programming Historian's principles and sets expectations for the tone and style of all correspondence between reviewers, authors, editors, and contributors to our public forums.

Programming Historian in English is dedicated to providing an open scholarly environment that offers community participants the freedom to thoroughly scrutinize ideas, to ask questions, make suggestions, or request clarification, but also provides a harassment-free space for all contributors to the project, regardless of gender, gender identity and expression, sexual orientation, disability, physical appearance, body size, race, age or religion, or technical experience. We do not tolerate harassment or ad hominem attacks of community participants in any form. Participants violating these rules may be expelled from the community at the discretion of the editorial board. If anyone witnesses or feels they have been the victim of the above described activity, please contact our ombudsperson Dr Ian Milligan. Thank you for helping us to create a safe space.

lachapot avatar Jul 18 '24 20:07 lachapot

Thank you @ian-nai for your amazing work on this very important topic and increasing the visibility of non-English and multilingual approaches in DH. I am grateful for the opportunity to review and support this effort.

  • [ ] paragraph 7: "Harnessing computational methods allows you to quickly perform tasks that are far more difficult to do without computational methods." -> there is a jump from tasks to example lessons. It would be better for the flow to add some examples of tasks and keep the revelant lesson examples to a later point in text. It would be easier for the reader with no previous experience to first get a sense of text analysis.

  • [ ] paragraph 9: "steps" -> I would not necessarily describe these as steps but tasks. But if we are going step by step, it should start with tokenization. I would also add some concrete examples of each here. The lesson is currently very text heavy and it would be easier to follow if there was a random example like: Sentence 1 -> Word A, Word B,... for tokenization or a fully pos-tagged sentence.

  • [ ] paragraph 9: "root form" -> Lemmatization reduces the word into its dictionary form. Root form of a word is not always its dictionary form, ie root form of coder is also code but we want to keep coder as is.

  • [ ] paragraph 9: "tokenization" -> since this lesson uses stanza, it would make sense to talk about the differences between tokenization in the regular sense as described here and tokenization for LLMs or deep learning more broadly.

  • [ ] paragraphs 10-14: "challenges ..." -> for this section more broadly, i would suggest a reordering. First, multilingual and non-English do not mean the same thing. Working with multilingual texts have some challenges like language detection but working with a monolingual non-English text does not. It does not come across very clearly at this point. Second, non-English texts are not monolithic. I would strongly urge the author to give concrete examples about Russian and French (the languages of this course) and then make comments about language and technology. Thirdly, it is important to also add that English is not monolithic either. Dominant forms of English are well-represented but there are many Englishes in the world that do not work with the existing infrastructures for English. I would suggest looking into Setsuko Yokoyama: https://hass.sutd.edu.sg/faculty/setsuko-yokoyama/

  • [ ] paragraph 11: "Support from already existing tools for non-English languages is often lacking, but is improving in quality with the introduction of a greater quantity of high quality models for processing a variety of languages" this sentence is vague about what support is and how the quality and quantity of tools changed. It leaves the impression that non-English languages are a monolith. I would suggest giving examples from the research languages, Russian and French, with more clear descriptions of how support for these two languages evolved over time.

  • [ ] paragraph 11: "many of them specific to the script your language(s) are written in" -> I do not necessarily agree with this statement. Although script is very important, syntax also matters, especially for tasks like lemmatization. i would reformulate this paragraph or rather this section to emphasize script, syntax, and sufficient examples in training data among other issues that low-resourced languages face.

  • [ ] paragraph 11: "these languages" -> which languages? not all alphabet based languages are represented in spaCy, ie turkish

  • [ ] paragraph 15: the main difference between these three tools that must be mentioned here is that Stanza is a neural NLP pipeline. spaCy 3.0 is also supposed to have transformer-based pipelines but as far as i know NLTK is still running on predominantly statistical approaches.

  • [ ] paragraph 15: "stopwords" -> link to a definition would be useful here

  • [ ] paragraph 15: "pretrained neural models" -> it is mentioned here but the differences between a neural model and not neural one are not explained. Also a link to the Stanza paper could be useful: https://aclanthology.org/2020.acl-demos.14/

  • [ ] paragraph 17: there is some repetition here. the link to the source was already given above (paragraph 3-6). It would make sense to keep all the information together in paragraph 17 and highlight it as data and remove the code, description of the work and link to data in the intro.

  • [ ] paragraph 19: On a similar note, I am following the whole course sentence by sentence and i already installed the libraries in paragraph 4. I wonder if that part should also be down here. it would be better for those who are not very used to coding to have all the coding steps in one place. also it would make sense for the installation of the libraries to follow their description.

  • [ ] also consider creating a Colab notebook to accompany this piece. It would be easier than asking folx to download Python and you can have a direct way of loading the text file instead of asking the readers to download and upload

  • [ ] paragraph 20: "clean out" -> Maybe say remove instead of clean out?

  • [ ] paragraph 20" "this article" -> this is an amazing resource that contains descriptions of many relevant concepts including for example stopwords and unicode. i would suggest linking this further up where the author is describing unicode

  • [ ] paragraph 20: "throughly cleaning the text" -> what does that mean and how is it different? the text as far as i can tell is clean, as in does not contain any OCR errors or mistakes in line splitting, etc. do you mean pre-processing the text to remove punctuation?

  • [ ] paragraph 21: i would strongly suggest that the author defines what cleaning and pre-processing a text means. i would consider all of the tasks mentioned here (maybe except for tokenization) as regular tasks since this tutorial is not doing anything other than these tasks. If this was a tutorial on sentiment analysis of course these would be pre-processing tasks. But this tutorial is about syntax related tasks and semantic analysis.

  • [ ] paragraph 22: why did we remove the newline characters and not split the text by newline into a list of lines? I think it is very important to read the way that the author approached this step. The next step involves finding sentences in a continues body of text and that is why it makes sense to create one long string from the entire text file. however this is not a requirement for all text analysis approaches. since this is an introductory tutorial, showing how the sausage is made would be very useful

  • [ ] paragraph 24: this is great but i would add some further details about the differences between splitting by punctuation and sentence tokenization.

  • [ ] paragraph 25: we can print these because we know the data. however this is not generally true. a new random text or even playing with the entire novel would mean that we cannot print the 6th sentence and get a russian sentence. it might seem like a small detail but we need a signpost here that says that this code is only applicable to this specific text and that is only because we know what sentences are which. Moreover, it would be even better to print in a for loop: for sentence in nltk_sent_tokenized: print(sentence, "\n") for better legibility and then point out which one is which. it would be more like how a user would encounter this task in the wild

  • [ ] paragraph 25: "one entirely in French" -> this is not true. the sentence contains a Russian phrase: Non, je vous préviens, que si vous ne me dites pas, que nous avons la guerre, si vous vous permettez encore de pallier toutes les infamies, toutes les atrocités de cet Antichrist (ma parole, j’y crois) — je ne vous connais plus, vous n’êtes plus mon ami, vous n’êtes plus мой верный раб, comme vous dites.

  • [ ] paragraph 26: why does this difference matter? also can we amend the print statement to be a for loop for legibility?

  • [ ] paragraph 27: "concatenating..." -> are we actually concatenating or creating a more legible print statement? I do not see where in this code we actualy add Russian to the variable spacy_rus_sent. Printing this only prints the sentence itself: print(spacy_rus_sent) i would urge against calling this concatenation because that refers to a very specific string operation which is not the case here

  • [ ] -paragraph 27: why did we not change all the spacy tokens in the list spacy_sentences into strings using list comprehension? for the printing that is happening below, we rely on previous knowledge of the text and how this specific model tokenized it. that is not reproducible outside of this tutorial and again hides some of the decisions that went into this piece of code

  • [ ] paragraph 29: "tokens" -> here for example, it would have been super useful to know more about tokenization in DL sense

  • [ ] paragraph 31-32 -> the readers would benefit from a description of how this task works computationally. how does the language of a sentence get detected? why this particular algorithm and not another one?

  • [ ] paragraph 32 -> is there a way to print probabilities assuming that there are probs assigned to each classification?

  • [ ] code below paragraph 32 -> having a hard time running this code as is:

    ipython-input-13-f88542156ff6> in <cell line: 14>()
    2 # setting up our pipeline
    3 Language.factory("language_detector", func=get_lang_detector)
    4 nlp.add_pipe('language_detector', last=True)
    5
    6 # running the language detection on each sentence and printing the results
    ttributeError: 'MultilingualPipeline' object has no attribute 'add_pipe'
    
    ...
    
    when I run the same cell again, I get this:
    alueError                            	Traceback (most recent call last)
    ipython-input-14-f88542156ff6> in <cell line: 13>()
    1
    2 # setting up our pipeline
    3 Language.factory("language_detector", func=get_lang_detector)
    4 nlp.add_pipe('language_detector', last=True)
    rames
    usr/local/lib/python3.10/dist-packages/spacy/language.py in add_factory(factory_func)
    16                     	name=name, func=existing_func, new_func=factory_func
    17                 	)
    18                 	raise ValueError(err)
    19
    20         	arg_names = util.get_arg_names(factory_func)
    alueError: [E004] Can't set up pipeline component: a factory for 'language_detector' already exists. Existing factory: <function get_lang_detector at 0x7f885a95f1c0>. New factory: <function get_lang_detector at 0x7f885a95eef0>```
    
    
  • [ ] code above paragraph 39: I think this loop is a bit confusing because it checks for punctuation after cyrillic. maybe amend it like so:

for word in tokenized_sent:
  if word not in punctuation:
	if regex.search(r'\p{IsCyrillic}', word):
  	cyrillic_words.append(word)
	else:
  	latin_words.append(word)
  • [ ] paragraph 46: "preserve the punctuation" -> why? i am not sure if it is worth keeping the punctuation. the loop is very complex and punctuation adds very little to readers' learning at the current introductory level.
  • [ ] code below paragraph 46 uses the same variable names as earlier with cyrillic_words and latin_words
  • [ ] paragraph 51: "the word's lemma" -> here it is correctly defined but above it says root form
  • [ ] paragraph 56: "further text analyses" -> what are some examples?

Overall, I think this is an amazing effort but there are some major issues that need to be addressed before publication. First of all, I would like to suggest a new framing. This text is not an introduction to text analysis for Non-English and Multilingual Texts but rather more specific. it is not introduce text analysis as a broad category of tasks and methods but rather focuses on 3 very specific tasks and 3 tools and how they apply to 2 European languages. As such, I firmly believe that the scope and the title need to reflect the actual contents. Second, the text needs to be reordered slightly. The introduction should include definitions of non-English and multilingual. It should also place these 3 tasks within the broader toolkits of NLP. One really important distinction is between pre-processing, cleaning, and tasks. There is a conflation between these terms throughout the text. Cleaning a text means cleaning the OCR mistakes and other issues related to the structure of the text such as weird spaces. Lemmatization on its own is not a pre-processing task but it can be. Lemmatization is a task within the broader domain of syntactic analysis. It can be a task in and of itself (for example SIG MORPHON Challenge 2019) but it can very well be a pre-processing step for semantic analysis tasks. This difference needs to be clarified. Thirdly, we need more examples of text analysis and what can be done with the results of these approaches. As such this is all sort of unclear where it might lead to and what one can do with having learnt these tasks. Ultimately it is a very valuable effort and it is clear that the author invested a lot of time and effort in putting together a very useful guide. Thank you for all your work!

mervetekgurler avatar Aug 26 '24 20:08 mervetekgurler

About publishing the lesson as a notebook, I am just wondering if there could be an accompanying Jupyter notebook posted on GitHub if the author and/or PH does not want to use Colab. It could be better for long term sustainability and the author can still set it up such that people can download and run that notebook on Colab or on their own machines. This way the code can be kept in the lesson as is and there is also an option to run it more smoothly in a notebook. I am thinking of this as an example of what I am trying to describe: https://maartengr.github.io/BERTopic/getting_started/best_practices/best_practices.html They do it with a Colab notebook here but the same can easily be achieved with a Jupyter notebook posted on GitHub.

mervetekgurler avatar Aug 26 '24 20:08 mervetekgurler

Thank you very much for your review @mervetekgurler! We really appreciate the time and effort you put into this. @ian-nai, could I please ask you to wait until we receive a review from our second reviewer before making any changes to the lesson so that we can ensure both reviewers are reviewing the same lesson. In the meantime, perhaps Anisa @anisa-hawes or Charlotte @charlottejmc might have some recommendations on what might be best practice re including an accompanying notebook for the lesson? Thank you!

lachapot avatar Aug 26 '24 22:08 lachapot

Thank you for letting me review this submission. Below you will find my comments:

  • Paragraph 5: Consider adding ! to the pip commands since many may be following along in a notebook. You do this later in Paragraph 26.
  • Paragraph 2: Fix the closing ) with spaCy version.
    • While prior knowledge of Python is not necessarily required, it will be helpful in understanding the basic structure of the code. Code for this tutorial is written in Python 3.10 and uses the NLTK (v3.8.1), spaCy(v3.7.4, and Stanza (v1.8.2) libraries to perform its text processing. This
  • Paragraph 21: Consider adding a replace for \n- to "". This can help handle instances where a word is split in the middle and separated by lines.
    • cleaned_war_and_peace = war_and_peace.replace("\n", " ") print(cleaned_war_and_peace)
  • Paragraph 27: should be concatenating
  • Paragraph 27: Instead of converting the sentence to a string by using str(), consider using sent.text.
  • Paragraph 33: You need to add ! to pip install
  • Paragraph 3-39: I like the clever approach for detecting the language based on script. This may not always work, though. What about transliterated texts? What about languages that share the same script?
  • Paragraph 43: # downloading our Russian corpus from spaCy => Russian model, not corpus
  • Paragraph 45: It will be challenge for students new to Python to be able to parse what is happening in this code block, even with the comment. Consider adding some explanatory text.
  • Paragraph 52-54: Keep the order of languages consistent.
  • Paragraph 56: beneficial instead of binificial

I think this could make for a great contribution to the Programming Historian. I would recommend thinking a bit more about how this will fit within the current offerings of the website, though. This work appears to try and do two things at once: be a multilingual text analysis notebook (according to the title) and a case study on a specific multilingual text. In trying to do both, I think the author does themselves a disservice by spreading the article too thin. They clearly have a lot to offer in each area. I would recommend considering this to be two separate lessons. I would recommend maybe adding extra authors to a work on multilingual NLP broadly where those who work with different language families come together to provide some key methodological considerations when working with multilingual documents or corpora. We see some of these considerations very clearly demonstrated early in the notebook, such as the discussion of UTF-8 and tokenization in Chinese. After these quick notes, though, the notebook changes away from lessons on multilingual NLP broadly to a narrow case-study. This isn't reflected well in the current title. The majority of the notebook focuses on a specific case study and I think that these types of contributions are very helpful. If the author wants to make this the sole focus of the work, then I'd recommend changing the title a bit to reflect that.

The structure itself is well thought out. I particularly like how the author has framed it as a an analysis of three different major NLP libraries: NLTK, spaCy, and Stanza. My one concern here is that this structure reflects a pre-2018 NLP landscape. I was surprised to not see transformers used for some of these tasks. Transformers are particularly suited and solving some of the key challenges presented in the article. Instead, the author has opted to use the small multilingual model from spaCy, for example. Another key thing left out are LLMs. While I'm cautious in my use of LLMs (mainly because I work with sensitive material), there are some clear use-cases here worth of exploring. Even if the author determines that neither transformers nor LLMs fit this particular problem space, that would be a great contribution. Explaining why would help others.

It would also be helpful to have the outputs visible so that one can follow along without having to run the code for themselves.

Since the notebook does not need the entire text, perhaps the author could pull out a paragraph (or two) from the text and pasted it inside the notebook. This would allow students to follow the material more easily without having to execute the code itself.

Since one of the purposes of this notebook is to compare the different approaches to the problem with different libraries, consider displaying the outputs from each as a markdown table with each side-by-side.

Thanks for your work on this article! I really enjoyed reading it and thank you for letting me review it. I think with some adjustments to the structure, fixing some of the typos, and making the code more consistent, this would be a great contribution to the Programming Historian.

wjbmattingly avatar Aug 27 '24 14:08 wjbmattingly

Thank you very much to both our reviewers for your thoughtful and detailed comments!

@ian-nai, I’ve collated here the comments from our reviewers so that everything is all in one place and hopefully that will make it easier for you to do your edits.

Both reviewers are enthusiastic about your lesson and, from my reading, the main points they raise revolve around:

  • specifying upfront the scope of the lesson and streamlining the lesson around that focus (i.e. a case study demonstrating how to tokenize, lemmatize, tag Parts Of Speech, and automatically detect languages in multilingual French and Russian text using NLTK; spaCy and Stanza).

  • refining the introductory sections, especially by clarifying terminology and adding in more specific examples and illustrations that relate to the focus of the lesson;

  • clarifying parts of the Exercise sections in order to make them more transparent and accessible to readers and help readers transfer the methods they are learning to their own projects (especially by adding more details on the rationale behind the code, and providing more examples of further analyses in the concluding section).

Title

  • [x] Edit the title of the lesson to more specifically reflect its focus (i.e. naming the specific methods and languages covered in the lesson).

Lesson Goals

  • [x] Consider adding a few more details in the “Lesson Goals” section about what is covered in the lesson specifically (e.g. moving some details from paragraph 6 to Lesson Goals). This might help, along with revising the title, to specify the scope of the lesson upfront.

Preparation

  • [x] Paragraph 2: Typo: Fix the closing ) with spaCy version. “While prior knowledge of Python is not necessarily required, it will be helpful in understanding the basic structure of the code. Code for this tutorial is written in Python 3.10 and uses the NLTK (v3.8.1), spaCy(v3.7.4, and Stanza (v1.8.2) libraries to perform its text processing. This”

  • [x] Paragraph 5: Consider adding ! to the pip commands since many may be following along in a notebook. You do this later in Paragraph 26.

Basics of Text Analysis and Working with Non-English and Multilingual Text

  • [x] Paragraph 7: Add specific examples of text analysis tasks (e.g. from the example lessons cited later on): “Harnessing computational methods allows you to quickly perform tasks that are far more difficult to do without computational methods." -> there is a jump from tasks to example lessons. It would be better for the flow to add some examples of tasks and keep the relevant lesson examples to a later point in text. It would be easier for the reader with no previous experience to first get a sense of text analysis.

Key Steps and Concepts of Text Analysis Relevant to the Lesson

  • [x] Paragraph 9: Terminology and restructuring: "steps" -> I would not necessarily describe these as steps but tasks. But if we are going step by step, it should start with tokenization.

  • [x] Paragraph 9: I would also add some concrete examples of each task listed here. The lesson is currently very text heavy and it would be easier to follow if there was a random example like: Sentence 1 -> Word A, Word B,... for tokenization or a fully pos-tagged sentence.

  • [x] Paragraph 9: Clarification of terminology: ”root form" -> Lemmatization reduces the word into its dictionary form. Root form of a word is not always its dictionary form, ie root form of coder is also code but we want to keep coder as is

  • [x] Paragraph 9: Clarification of terminology: "tokenization" -> since this lesson uses stanza, it would make sense to talk about the differences between tokenization in the regular sense as described here and tokenization for LLMs or deep learning more broadly.

  • [x] This might also be a good place to clarify some terminology used throughout the lesson (especially distinguishing between cleaning and preprocessing) as well as more clearly situating how the tasks covered in the lesson are situated in the broader toolkits of NLP. This might help clarify early on terminology that is used later e.g. in paragraph 21: I would strongly suggest that the author defines what cleaning and pre-processing a text means. One really important distinction is between pre-processing, cleaning, and tasks. Cleaning a text means cleaning the OCR mistakes and other issues related to the structure of the text such as weird spaces. Lemmatization on its own is not a pre-processing task but it can be. Lemmatization is a task within the broader domain of syntactic analysis. It can be a task in and of itself (for example SIG MORPHON Challenge 2019) but it can very well be a pre-processing step for semantic analysis tasks. This difference needs to be clarified. I would consider all of the tasks mentioned here (maybe except for tokenization) as regular tasks since this tutorial is not doing anything other than these tasks. If this was a tutorial on sentiment analysis of course these would be pre-processing tasks. But this tutorial is about syntax related tasks and semantic analysis.

Challenges Facing Multilingual Text Analysis

  • [x] Paragraphs 10-14: "challenges ..." Clarification of terminology and restructuring: Perhaps start by distinguishing more precisely what is meant by multilingual and non-English and their related challenges in computational text analysis/NLP: multilingual and non-English do not mean the same thing. Working with multilingual texts have some challenges like language detection but working with a monolingual non-English text does not. It does not come across very clearly at this point.

  • [x] Then, consider adding concrete examples related to the lesson (ie related to Russian and French) to highlight how different languages may face different challenges in NLP, and what those challenges might be specifically: non-English texts are not monolithic. I would strongly urge the author to give concrete examples about Russian and French (the languages of this course) and then make comments about language and technology. For example:  paragraph 11: "Support from already existing tools for non-English languages is often lacking, but is improving in quality with the introduction of a greater quantity of high quality models for processing a variety of languages" this sentence is vague about what support is and how the quality and quantity of tools changed. It leaves the impression that non-English languages are a monolith. I would suggest giving examples from the research languages, Russian and French, with more clear descriptions of how support for these two languages evolved over time.

  • [x] Finally, the final paragraphs could expand to broader reflections on the current state of the field and/or specific considerations related to questions of computational analysis of different languages, including:

    • paragraph 11: "many of them specific to the script your language(s) are written in" -> I do not necessarily agree with this statement. Although script is very important, syntax also matters, especially for tasks like lemmatization. i would reformulate this paragraph or rather this section to emphasize script, syntax, and sufficient examples in training data among other issues that low-resourced languages face.
    • paragraph 11: "these languages" -> which languages? not all alphabet based languages are represented in spaCy, ie turkish
    • paragraph 20: "this article" -> this is an amazing resource that contains descriptions of many relevant concepts including for example stopwords and unicode. i would suggest linking this further up where the author is describing unicode (Can link to this article here as well as mention it later on.)
    • It is important to also add that English is not monolithic either. Dominant forms of English are well-represented but there are many Englishes in the world that do not work with the existing infrastructures for English. I would suggest looking into Setsuko Yokoyama: https://hass.sutd.edu.sg/faculty/setsuko-yokoyama/

Tools We’ll Cover

  • [x] Paragraph 15: the main difference between these three tools that must be mentioned here is that Stanza is a neural NLP pipeline. spaCy 3.0 is also supposed to have transformer-based pipelines but as far as i know NLTK is still running on predominantly statistical approaches.

  • [x] Paragraph 15: "stopwords" -> link to a definition would be useful here

  • [ ] Paragraph 15: "pretrained neural models" -> it is mentioned here but the differences between a neural model and not neural one are not explained. Also a link to the Stanza paper could be useful: https://aclanthology.org/2020.acl-demos.14/

Sample Code and Exercises

  • [x] Paragraph 17: Restructuring: consider gathering together here information on data that will be used in this lesson from “Lesson Text File” in paragraph 6 to here: The link to the source was already given above (paragraph 3-6). It would make sense to keep all the information together in paragraph 17 and highlight it as data and remove the code, description of the work and link to data in the intro.

  • [x] Since the notebook does not need the entire text, perhaps the author could pull out a paragraph (or two) from the text and pasted it inside the notebook. This would allow students to follow the material more easily without having to execute the code itself.

Loading and Preparing the Text

  • [x] Paragraph 20: "clean out" -> Maybe say remove instead of clean out?

  • [x] Paragraph 20-21: check whether using terminology consistently and clarify terms used: "throughly cleaning the text" -> what does that mean and how is it different? the text as far as i can tell is clean, as in does not contain any OCR errors or mistakes in line splitting, etc. do you mean pre-processing the text to remove punctuation?

  • [x] Paragraph 21: Consider adding a replace for \n- to "". This can help handle instances where a word is split in the middle and separated by lines.

  • [x] Paragraph 22: Add in details to guide reader through the code: why did we remove the newline characters and not split the text by newline into a list of lines? I think it is very important to read the way that the author approached this step. The next step involves finding sentences in a continues body of text and that is why it makes sense to create one long string from the entire text file. however this is not a requirement for all text analysis approaches. since this is an introductory tutorial, showing how the sausage is made would be very useful

Tokenization

  • [x] Paragraph 24: this is great but i would add some further details about the differences between splitting by punctuation and sentence tokenization.

  • [x] Paragraph 25: we can print these because we know the data. however this is not generally true. a new random text or even playing with the entire novel would mean that we cannot print the 6th sentence and get a russian sentence. it might seem like a small detail but we need a signpost here that says that this code is only applicable to this specific text and that is only because we know what sentences are which. Moreover, it would be even better to print in a for loop: for sentence in nltk_sent_tokenized: print(sentence, "\n") for better legibility and then point out which one is which. it would be more like how a user would encounter this task in the wild

  • [x] Paragraph 25: "one entirely in French" -> this is not true. the sentence contains a Russian phrase: Non, je vous préviens, que si vous ne me dites pas, que nous avons la guerre, si vous vous permettez encore de pallier toutes les infamies, toutes les atrocités de cet Antichrist (ma parole, j’y crois) — je ne vous connais plus, vous n’êtes plus mon ami, vous n’êtes plus мой верный раб, comme vous dites.

  • [x] Paragraph 26: clarify why does this difference matter? also can we amend the print statement to be a for loop for legibility? 

  • [x] Paragraph 27: typo: it should be concatenating in the final comment in code block (“# concenating the French and Russian sentence and its label”)

  • [x] Paragraph 27: Instead of converting the sentence to a string by using str(), consider using sent.text.

  • [x] Paragraph 27: "concatenating..." -> are we actually concatenating or creating a more legible print statement? I do not see where in this code we actually add Russian to the variable spacy_rus_sent. Printing this only prints the sentence itself: print(spacy_rus_sent) i would urge against calling this concatenation because that refers to a very specific string operation which is not the case here

  • [x] Paragraph 27: Add in details to guide reader through the code: why did we not change all the spacy tokens in the list spacy_sentences into strings using list comprehension? for the printing that is happening below, we rely on previous knowledge of the text and how this specific model tokenized it. that is not reproducible outside of this tutorial and hides some of the decisions that went into this piece of code

  • [x] Paragraph 29: "tokens" -> here for example, it would have been super useful to know more about tokenization in DL sense

Automatically Detecting Different Languages

  • [x] Paragraph 31-32 -> the readers would benefit from a description of how this task works computationally. how does the language of a sentence get detected? why this particular algorithm and not another one?

  • [x] Paragraph 32 -> is there a way to print probabilities assuming that there are probs assigned to each classification?

  • [x] Paragraph 33-39: Provide further details or caveats so that readers can be aware of how the code is working and how they might need to modify or change it in different contexts: I like the clever approach for detecting the language based on script. This may not always work, though. What about transliterated texts? What about languages that share the same script?

  • [x] Paragraph 33: You need to add ! to pip install

  • [x] code below Paragraph 32 -> having a hard time running this code as is:

12 # setting up our pipeline
13 Language.factory("language_detector", func=get_lang_detector)
14 nlp.add_pipe('language_detector', last=True)
15
16 # running the language detection on each sentence and printing the results
AttributeError: 'MultilingualPipeline' object has no attribute 'add_pipe'

....

#when I run the same cell again, I get this:
ValueError                            	Traceback (most recent call last)
<ipython-input-14-f88542156ff6> in <cell line: 13>()
11
12 # setting up our pipeline
13 Language.factory("language_detector", func=get_lang_detector)
14 nlp.add_pipe('language_detector', last=True)
frames
/usr/local/lib/python3.10/dist-packages/spacy/language.py in add_factory(factory_func)
516                     	name=name, func=existing_func, new_func=factory_func
517                 	)
518                 	raise ValueError(err)
519
520         	arg_names = util.get_arg_names(factory_func)
ValueError: [E004] Can't set up pipeline component: a factory for 'language_detector' already exists. Existing factory: <function get_lang_detector at 0x7f885a95f1c0>. New factory: <function get_lang_detector at 0x7f885a95eef0>
  • [x] code above Paragraph 39: I think this loop is a bit confusing because it checks for punctuation after cyrillic. maybe amend it like so:
for word in tokenized_sent:
  if word not in punctuation:
	if regex.search(r'\p{IsCyrillic}', word):
  	cyrillic_words.append(word)
	else:
  	latin_words.append(word)

Part-of-Speech Tagging

  • [x] Paragraph 43: typo in code cell: “# downloading our Russian corpus from spaCy” => Russian model

  • [x] Paragraph 45: It will be challenging for students new to Python to be able to parse what is happening in this code block, even with the comment. Consider adding some explanatory text.

  • [x] Paragraph 46: "preserve the punctuation" -> why? i am not sure if it is worth keeping the punctuation. the loop is very complex and punctuation adds very little to readers' learning at the current introductory level.

  • [x] Code below Paragraph 46 uses the same variable names as earlier with cyrillic_words and latin_words

Lemmatization

  • [x] Paragraph 51: "the word's lemma" -> here it is correctly defined but above it says root form

  • [x] Paragraph 52-54: Keep the order of languages consistent.

General Comments for Sample Code and Exercise Sections

  • [x] It would also be helpful to have the outputs visible so that one can follow along without having to run the code for themselves.

  • [x] Since one of the purposes of this notebook is to compare the different approaches to the problem with different libraries, consider displaying the outputs from each as a markdown table with each side-by-side.

Conclusion

  • [x] paragraph 56: "further text analyses" -> what are some examples? Consider providing more examples of what can be done with the results of these approaches. What might readers be able to do having learnt these tasks?

  • [x] Paragraph 56: Typo: beneficial instead of binificial

Thank you very much @ian-nai for your continued work on this lesson. Please do reach out if you have any comments or if you want to discuss or clarify anything. Looking forward to working on the revised version with you! Laura

lachapot avatar Aug 28 '24 01:08 lachapot

Hello @lachapot, @mervetekgurler and @ian-nai,

A number of our lessons do include an optional notebook to be used by readers who find it simpler or who aren't able to run the lesson's code locally. We could certainly do so here as well, by adding a .ipynb file to the lesson's assets folder and linking readers to it within the text.

Notebooks associated with lessons are hosted within our infrastructure so that we can manage them as assets and ensure their sustainability. To support this, we’ve developed some guidelines for authors who choose to integrate notebooks in their lessons. Our aim is to ensure effective maintenance, future translatability, and flexible usability.

These guidelines are based on a key understanding that we want our readers to be able to make the choice to work either in Google Colab, in their preferred alternative cloud-based development environment, or opt to run the code locally.

We ask that:

  • Notebooks consist of the code + line comments only
  • Headings and subheadings mirror those of the lesson, to support readers' navigation
  • Notebooks do not replicate nor extend commentary from the lesson

If you are interested in this option, I can help put together such a notebook by copying the headings and code blocks from the lesson text into an .ipynb file.

Do let me know, and thank you for all your work so far! ✨

charlottejmc avatar Aug 29 '24 03:08 charlottejmc

Hi @charlottejmc,

Thank you very much! I would be happy to have a notebook included with the lesson. I will be working to make changes to the Markdown file, so perhaps we should wait to copy the code blocks over into the .ipynb file until I have made all of the suggested changes.

ian-nai avatar Aug 29 '24 14:08 ian-nai

Hello Ian @ian-nai,

What's happening now?

Your lesson has been moved to the next phase of our workflow which is Phase 5: Revision 2.

This phase is an opportunity for you to revise your draft in response to the peer reviewers' feedback.

Laura @lachapot has summarised their suggestions, but feel free to ask questions if you are unsure. I'd encourage you to check off the tasks/suggestions listed in Laura's comment as you work.

Please make revisions via direct commits to your file: /en/drafts/originals/non-english-and-multilingual-text-analysis.md. @charlottejmc and I are here to help if you encounter any difficulties.

When you and Laura are both happy with the revised draft, the Managing Editor @hawc2 will read it through before we move forward to Phase 6: Sustainability + Accessibility.

%%{init: { 'logLevel': 'debug', 'theme': 'dark', 'themeVariables': {
              'cScale0': '#444444', 'cScaleLabel0': '#ffffff',
              'cScale1': '#882b4f', 'cScaleLabel1': '#ffffff',
              'cScale2': '#444444', 'cScaleLabel2': '#ffffff'
       } } }%%
timeline
Section Phase 4 <br> Open Peer Review
Who worked on this? : Reviewers (@wjbmattingly + @mervetekgurler)
All  Phase 4 tasks completed? : Yes
Section Phase 5 <br> Revision 2
Who's working on this? : Author (@ian-nai)
Expected completion date? : Oct 4
Section Phase 6 <br> Sustainability + Accessibility
Who's responsible? : Publishing Team
Expected timeframe? : 7~21 days

Note: The Mermaid diagram above may not render on GitHub mobile. Please check in via desktop when you have a moment.

anisa-hawes avatar Sep 04 '24 12:09 anisa-hawes

Hi @lachapot, @charlottejmc and @anisa-hawes,

Just letting you know that I made the additional revisions in my most recent commit. I believe I covered all of the feedback from the reviewers, but please let me know if there’s anything else you’d like me to change!

ian-nai avatar Sep 10 '24 17:09 ian-nai

Hello @ian-nai,

Thank you for making these changes. @lachapot will now take a look and confirm whether the lesson is ready for its final read-over by our Managing Editor, before moving to Phase 6: Sustainability & Accessibility.

In the meantime, I will work on preparing the associated notebook, by replicating the code and headings you've included in the lesson body.

charlottejmc avatar Sep 11 '24 06:09 charlottejmc

Hi again @ian-nai,

I've now uploaded the associated notebook to your lesson's assets folder. Could you take a look and make sure everything works as expected?

I'll leave it up to you to link the notebook from within the lesson body where you see fit – perhaps in ## Sample Code and Exercises?

I have an other query about your assets, however: I've noticed you don't link to war-and-peace-excerpt.txt in the lesson, rather a Google Drive file (https://drive.google.com/file/d/1K5kmgqbNUFRDGD5it5foVHBgjJavdg5w/view?usp=sharing). Can I switch the link so that it points to the excerpt on our repository instead?

Thank you very much!

charlottejmc avatar Sep 12 '24 01:09 charlottejmc

Hi @ian-nai

Thank you for making your revisions so promptly! The lesson is much improved and nearly ready, but I feel there are still some points raised in the reviews that have not quite fully been addressed, especially with regards to streamlining the introductory sections around the particular focus and scope of the lesson rather than presenting it as a general introduction to multilingual text analysis. Here are a few points I think should be considered a bit further before passing on to the Managing Editor:

Lesson Title:

  • [x] I feel like the revised title is a little too long, and it might be best to remove altogether the “Introduction to Text Analysis…” part since this can be somewhat misleading about the actual scope and content of the lesson. I’d suggest something like: “Analyzing multilingual French and Russian text using NLTK, spaCy, and Stanza” or “Analyzing multilingual text using NLTK, spaCy, and Stanza”. It’s probably best to name the specific languages covered in the lesson, but it does make it sound more awkward.

Lesson Goals

Paragraph 1: Some small language change suggestions:

  • [x] delete corpus (since we’re only analyzing one text): “begin analyzing non-English and/or multilingual text”

  • [x] change “preprocessing steps” to “preprocessing tasks” for consistency throughout the lesson

  • [x] Perhaps add “to automatically detect which languages are present” (or computationally detect if you prefer that word)?

  • [x] Paragraph 2: Perhaps move the last sentence to the start of the paragraph to make a smoother transition? “We will show how to perform these preprocessing tasks above using three commonly used packages for performing text analysis and natural language processing (NLP): the Natural Language Toolkit (NLTK), spaCy, and Stanza. In doing so, we’ll go over the fundamentals of these packages, and review and compare the core features of the packages so you can become familiar with how they work and learn how to discern which tool is right for your specific use case and coding style.”

Basics of Text Analysis and Working with Non-English and Multilingual Text

  • [x] Paragraph 6: I’d suggest adding an example of text analysis using a method covered in this lesson first (and also that gives a sense of how people are using these methods in their projects): For example something like: “Harnessing computational methods allows you to quickly perform tasks that are far more difficult to do without computational methods. For example, in this lesson we cover part-of-speech tagging. This method can be used to quickly identify all verbs and their associated subjects and objects across a corpus of texts which could then be used to develop analyses of agency and subjectivity in the corpus of text (as Dennis Tenen does in his article “Distributed Agency in the Novel,” for example). In addition to the methods we cover in this lesson, other commonly-performed….”

  • [x] Paragraph 6: Can you think of examples of specific research/articles that use sentiment analysis and Named Entity Recognition? It might be helpful to link to a concrete example if readers want to look this up, and quickly gloss in the lesson how researchers have used these methods in that example.

  • [x] Paragraph 7: To avoid repetition of “perform” change to “Using text analysis methods first requires….”

Key Concepts of Text Analysis Relevant to the Lesson

I feel like the flow of this section and subsections could still be improved in order to streamline them a bit more around the specific focus of the lesson. Also a discussion of the distinctions between cleaning and processing and how this fits into text analysis pipelines more broadly is still missing. To address this, I’d suggest reorganizing parts of these sections and adding more examples specific to the lesson. Feel free to adapt as you wish, but I’d suggest something like this:

  • [x] Starting at Paragraph 7:


“Using text analysis methods first requires that we perform certain tasks which are necessary to prepare the text for computational analysis. These tasks can be especially important (and sometimes particularly challenging) when working with multilingual text.

For example, you might first need to turn your text into machine-readable text using methods such as Optical Character Recognition that transform scanned images into machine-readable text. OCR can work very well for certain languages, but less so for others or for different types of text (such as handwritten texts). This means that, depending on the languages and texts you work with and the quality of OCR method, you might need to “clean” your text - ie. correct the errors made by OCR - in order to use it in your analyses. For an introduction to OCR and cleaning see these Programming Historian lessons: “OCR with Google Vision API and Tesseract", “OCR and Machine Translation", and “Cleaning OCR’d text with Regular Expressions".

Once you have a clean text that is machine-readable, further tasks will be necessary in order to prepare the text for analysis. These are usually referred to as preprocessing tasks. These tasks can sometimes be implemented in their own right – for example, lemmatizing a text and analyzing the results can provide useful information – or they can be implemented as preliminary tasks for further analysis — for example, tokenizing a text prior to performing sentiment analysis. Yet again, however, preprocessing tasks can often involve particular challenges and considerations depending on the types of languages and texts you are working with.

Key Concepts of Text Analysis Relevant to the Lesson

In this lesson, we focus on three key processing tasks — part-of-speech tagging, lemmatization, and tokenization — and show how these tasks can be applied to a multilingual and non-English language text.”

And then move on to the definition of key concepts (but delete paragraph 8 since it has been moved elsewhere).

For the POS paragraph:

  • [x] Perhaps add in parentheses examples of what POS are: “POS tagging involves marking each word of a text with its corresponding part-of-speech (e.g. nouns, verbs, adjectives, etc.).”

  • [x] Typo in “Part-of-Speech” paragraph: change “probabalistic” to “probabilistic”

  • [x] Could you add a screenshot image of the example from the NLTK book rather than just the tagged text? A visual illustration would be good and would help break up the flow of text a bit.

 

Challenges Facing Non-English and Multilingual Text Analysis

As mentioned above, this section could be streamlined more to dovetail with previous sections and focus in on the specific languages and methods covered in this lesson. I’ve opted here in my suggestion to kind of focus the section around explaining why you’re comparing different NLP libraries and why/how they handle different tasks differently. Again, feel free to modify as you see fit, but I’d suggest revising along these lines (cf below) although issues that still might need to be addressed:

  • [x] the reviewers recommended using examples specific to the lesson, in the first paragraph I kept the Chinese example, but maybe this could be somehow replaced with an example from Russian (I don’t know Russian so unsure if this would work)

  • [x] and the title of this subsection could also be modified slightly

  • [x] Starting at Paragraph 9: 


“These concepts are presented in this lesson as practical examples of how NLTK, spaCy, and Stanza handle these fundamental processing tasks differently. The way that text analysis packages implement certain tasks can vary depending on a number of criteria (the choice of algorithm, the choice of models and training data they rely on, etc.). The way packages implement certain tasks can therefore depend on the quality and availability of all these components to perform well for any specific language and may reproduce assumptions that align with features of the English language and that do not always transfer well to features of other languages. For example, default tokenizing procedures could assume that words are series of characters separated by a space. This might work well for English and other alphabet-based languages such as French, but character-based language, such as Chinese, handle word boundaries very differently. Tokenizing a text in Chinese may therefore involve artificially inserting spaces between characters, a process known as segmentation (see Melanie Walsh’s “Text Pre-Processing for Chinese” for an introduction).

As it stands therefore, many of the resources available for learning computational methods of text analysis privilege English-language texts and corpora, and often omit information necessary to begin working with non-English source material, and it might not always be clear how to use or adapt these tools when working with a variety of different languages. And yet, whilst support from already existing tools for non-English languages is often lacking, it is improving in quality with the introduction of a greater quantity of high quality models for processing a variety of languages. For the languages focused on in this tutorial, for example–Russian and French–support for performing tasks such as part-of-speech tagging has been expanded with the introduction and refinement of models from spaCy and Stanza. Still, many tutorials and tools you encounter will default to or emphasize English-language compatibility in their approaches. It is also worth noting that the forms of English represented in these tools and tutorials tends to be limited to Standard English, and that other forms of the language are likewise underrepresented.

While this focus on English text can pose a challenge for working with non-English texts written in a single language, multilingual texts–or texts written in more than one language–can present their own challenges, such as detecting which language is present at a given point in the text or working with different text encodings. If methods often privilege English language assumptions, they are also often conceived to work with monolingual texts and do not perform well with texts that contain many different languages. For example, as discussed later in this lesson, the commonly recommended sentence tokenizer for NLTK (PunktSentenceTokenizer) is trained to work with only one language at a time and therefore won’t be the best option when working with multilingual text. This lesson will show how models can be applied to target specific languages within a text to maximize accuracy and avoid improperly or inaccurately parsing text. 

In this lesson, we compare NLTK, spaCy, and Stanza since they all contain models that can navigate and parse the properties of many languages. However, you may still have to adjust your approach and workflow to suit the individual needs of the language(s) and texts you are working with. There are a number of things to consider when working with computational analysis of non-English text, many of them specific to the language(s) your texts are written in. Factors such as a text’s script, syntax, and the presence of suitable algorithms for performing a task as well as relevant and sufficient examples in training data can all affect the results of computational methods applied to a text. In your own work, it’s always best to thorough think through an approach and look into the assumptions of particular methods (such as looking into the documentation of a particular method) before applying them to your texts in order to assess how that approach suits your personal research or project-based needs. Being flexible and open to changing your workflow as you go is also helpful.

For further reading on these topics, please consult the brief bibliography below:”

Tools We’ll Cover:

  • [x] Paragraph 15: typo: change “diffrent” to “different”

  • [x] Paragraph 16: In the second point on NLTK, perhaps delate parentheses: “It supports different numbers of languages for different tasks: it contains lists of stopwords for 23 languages, for example, but only has built-in support for word tokenization in 18 languages.”

Sample Code and Exercises:

  • [x] Paragraph 18: delete this paragraph since it repeats paragraph 17

  • [x] Paragraph 22-23: There’s still terminological confusion in this paragraph (with the use of preprocessing since we’re referring to the tasks we’re covering in the lesson as preprocessing tasks), and it’s also still a little unclear why we’re removing newlines. Perhaps merge both paragraphs and edit a bit further for clarity, for example: “Now, let’s remove the newline characters. We will replace all newlines (represented as a “\n” in the code) with a space (represented with a “ “), assigning the cleaned text to a new variable named “cleaned_war_and_peace” and print it to check what we’ve done. Replacing the newline characters with a space will combine the text into a continuous string. We’re doing this because in the next step we want to tokenize our text into sentences so that we can assess the language of each sentence. Removing newlines will homogenize the text and ensure that the tokenizer is not mislead into creating sentence splits where there shouldn’t be any. This is the only modification to the text that we will be doing for the purposes of this lesson, but for a good introduction to the different steps you can take to prepare your text for multilingual analysis, please consult this article."

  • [x] Paragraph 25: Maybe add “and prepared it” or similar so it flows more with previous text: “Now that we’ve read the file and prepared our text, let’s begin to process it. “

  • [x] Paragraph 26: missing word: “Now that the libraries are imported”

  • [x] Paragraph 27: typo: “pecularities” > “peculiarities”

  • [x] Paragraph 28-31: I find the flow confusing here. Perhaps rearrange and move “Let’s print three sentences we’ll be working with: one entirely in Russian, one entirely in French, and one that is in both languages. The language of the sentences will become important as we apply different methods to the sentences later in the lesson.” and related code block and output after the explanation in 30 and 31. So at Paragraph 28 it would read: “The entire text is now accessible as a list of sentences within the variable nltk_sent_tokenized. We can easily figure out which sentences ….”.

  • [x] Paragraph 29: typo in code block: “in our list of sentences

  • [x] Paragraph 49: I’d suggest editing the language slightly for clarity, eg. “For our particular case study here, a workaround would be to detect non-Roman script and split the string into its component languages that way.

Otherwise I think the additional detail and including the outputs in the Exercises section make it a lot clearer and easier to follow. Thanks again for all your work on this lesson @ian-nai. Let me know if there’s anything you want to discuss or clarify further. I think we’re nearly there and once these final changes have been addressed we should be able to move the lesson on to the next step.

Thank you @charlottejmc for creating the notebook! The "Sample Code and Exercises" or even in the "Preparation" sections would be good places to link to the notebook. Laura

lachapot avatar Sep 12 '24 02:09 lachapot

Thank you very much, @charlottejmc! This looks great to me; I'll link to the notebook in ## Sample Code and Exercises. Thank you for offering to change the link to the text file; that sounds good to me, too.

ian-nai avatar Sep 12 '24 14:09 ian-nai

Thanks, @lachapot! I made these additional changes in the updated Markdown file.

ian-nai avatar Sep 13 '24 15:09 ian-nai