Keyphrase in text parser

Open manuelaugustin opened this issue 6 years ago • 0 comments

Goals

We need to build new researches that can provide data for all assessments that require keyphrase matching in the tree. This excludes all assessments that require keyphrase matching in meta data (e.g., the title, slug, meta description etc.), since these won't operate on the new tree structure.
The user-facing behavior of all assessments should be identical to the pre-tree behavior. For information on the current functionality, see the SEO scoring overview.
This is the list of assessments for which we need custom researches:
- Keyphrase in introduction
- Keyphrase density
- Keyphrase in subheading
- Keyphrase in image alt attributes
- Keyphrase distribution
- Text competing link assessment

Base `keyphraseResearch` providing keyphrase matches in sentences

This base keyphrase research can serve as the data source for all other keyphrase-dependent researches. It runs on given leaf nodes (e.g., paragraphs, headings) and returns sentences with found keyphrases within these leaf nodes.

Specifically, it provides the following information:

References to the sentence object
- this includes indices (required for markings)
References to the words matched
- we need references to the individual words because in some cases we need to aggregate over sentences (e.g., for the keyphrase in distribution research)
Percentage of the keyphrase matched in the sentence
- required to determine whether enough words of the keyphrase were used in the sentence to constitute a match

Example output:

Keyphrase: apple and banana

Text: "An apple an apple and a banana."

[
  {
    sentence: { Sentence Object "An apple an apple and a banana." },
    matchesKeyphrase: {
	  	apple: [
          { Word Object "apple" 1st instance },
          { Word Object "apple" 2nd instance }, 
		],
		banana:[
          { Word Object "banana" }
		],
		pear: []
    },
  matchesSynonyms: [ {
	  	orange: [
          { Word Object "orange" 1st instance },
          { Word Object "orange" 2nd instance }, 
		],
		mango:[
          { Word Object "mango" }
		]
    },
 ],
    percentWordMatchesKeyphrase: 100 (?)
    percentWordMatchesSynonyms: [ 100, ... ] (?)

 }
  ,
  ...
]

The matching mechanism can stay the same as the current implementation
mergeChildrenResults can use the default strategy
See findKeywordFormsInString.js for inspiration for e.g. how to calculate percentWordMatches.
Needs access to morphological forms

Researches assessments operating on leaf nodes

Keyphrase in introduction

Steps
- Get base research results for 1st paragraph
- Check whether there is at least 1 sentence with percentWordMatches: 100
- If there is no sentence with percentWordMatches: 100: merge the matches of all sentence objects per keyphrase word
- Check if all keyphrase words have at least one match
Needs to use aggregated keyphrase + synonym data.

Keyphrase density

Steps
- Get base research result for whole text
- For each sentence, caculate how many full keyphrase occurrences there are.
  - An occurrence is counted when all keywords of the key phrase are contained within the sentence.
  - A sentence can contain multiple key phrases (e.g., "The apple potato is an apple and a potato." has two occurrences of the key phrase "apple potato").
- Return the number of occurrences.
See keywordCount.js for inspiration.

Keyphrase in subheading

Steps
- Run the base research for each subheading.
- Merge results for subheadings containing multiple sentences.
- Calculate percentWordmatches for each subheading.
- If it's a language with function word support, return the number of subheadings with 100% matches; if it's a language without function word support, return the number of subheadings with >50% matches.
Needs to use aggregated keyphrase + synonym data.

Keyphrase distribution

Adapt the functionality in keyphraseDistribution.js to run it on the data returned by the new keyphrase base research. This includes:
- computing per-sentence score based on percentageWordMatches
- determining continuous stretches of sentences with low per-sentences scores
Needs to use aggregated keyphrase + synonym data.

Researches for individual assessments operating on formatting elements

The assessments below require similar functionality:

Get a certain type of formatting element.
Check whether there is at least 1 formatting element containing all the content words from the keyphrase.
The base class outlined above assumes that we always split text into sentences and that we check the keyphrase matches per sentence. For the assessments operating on non-leaf nodes, a pragmatic solution is to create a paragraph node with the contents of the formatting elements and use the base research on this paragraph node.
- Note: we either need to save a reference of the newly created paragraph data on the original formatting elements, or the other way around. It's necessary to maintain a reference between the original and the converted data, because the keyphrase research operates on the original data, but we want to return references to the original formatting elements in the results.

Keyphrase in image alt attributes

Steps
- Get all images from image research. (to-do: make issue for image research)
- Convert the alt tags of all images into paragraph nodes (see above).
- Run keyphrase base research on the newly created paragraph nodes.
- Return all images that have an alt tag with percentWordMatches = 100.
Needs to use aggregated keyphrase + synonym data.

Text competing link assessment

Steps
- Get all links from link statistics research.
- Convert the link text into paragraph nodes (see above).
- Run keyphrase base research on the newly created paragraph nodes.
- Return all links with percentWordMatches = 100.
Needs to use aggregated keyphrase + synonym data.

Keyphrase-synonym aggregator

We run the base research separately for the keyphrase and the associated synonyms. For the individual researches that also require synonyms, we need to aggregate this data. It's not necessary to know whether a match was a keyphrase or a synonym match, since we don't make a distinction between these two kinds of matches in the assessment results.
matches can be the combination of all matched synonym and keyphrase matches.
For percentWordMatches, the highest value can be used.

Example output:

Keyphrase: cat and dog Synonym: canine and feline

Text: Here's a cat and another cat and a dog and a canine.

Output base research keyphrase:

[
  {
    sentence: { Sentence Object "Here's a cat and another cat and a dog and a canine." },
    matches: {
      { Word Object "cat" 1st instance },
      { Word Object "cat" 2nd instance }, 
      { Word Object "dog" }
    },
    percentWordMatches: 100
 }
]

Output base research synonym:

[
  {
    sentence: { Sentence Object "Here's a cat and another cat and a dog and a canine." },
    matches: {
      { Word Object "canine" }
    },
    percentWordMatches: 50
 }
]

Aggregated keyphrase & synonym data:

[
  {
    sentence: { Sentence Object "Here's a cat and another cat and a dog and a canine." },
    matches: {
      { Word Object "cat" 1st instance },
      { Word Object "cat" 2nd instance }, 
      { Word Object "dog" }
      { Word Object "canine" }
    },
    percentWordMatches: 100
 }
]

Dec 20 '19 15:12 manuelaugustin

Keyphrase in text parser

Goals

Base keyphraseResearch providing keyphrase matches in sentences

Researches assessments operating on leaf nodes

Keyphrase in introduction

Keyphrase density

Keyphrase in subheading

Keyphrase distribution

Researches for individual assessments operating on formatting elements

Keyphrase in image alt attributes

Text competing link assessment

Keyphrase-synonym aggregator

Base `keyphraseResearch` providing keyphrase matches in sentences