javascript
javascript copied to clipboard
Keyphrase in text parser
Goals
- We need to build new researches that can provide data for all assessments that require keyphrase matching in the tree. This excludes all assessments that require keyphrase matching in meta data (e.g., the title, slug, meta description etc.), since these won't operate on the new tree structure.
- The user-facing behavior of all assessments should be identical to the pre-tree behavior. For information on the current functionality, see the SEO scoring overview.
- This is the list of assessments for which we need custom researches:
- Keyphrase in introduction
- Keyphrase density
- Keyphrase in subheading
- Keyphrase in image alt attributes
- Keyphrase distribution
- Text competing link assessment
Base keyphraseResearch providing keyphrase matches in sentences
This base keyphrase research can serve as the data source for all other keyphrase-dependent researches. It runs on given leaf nodes (e.g., paragraphs, headings) and returns sentences with found keyphrases within these leaf nodes.
Specifically, it provides the following information:
- References to the sentence object
- this includes indices (required for markings)
- References to the words matched
- we need references to the individual words because in some cases we need to aggregate over sentences (e.g., for the keyphrase in distribution research)
- Percentage of the keyphrase matched in the sentence
- required to determine whether enough words of the keyphrase were used in the sentence to constitute a match
Example output:
Keyphrase: apple and banana
Text: "An apple an apple and a banana."
[
{
sentence: { Sentence Object "An apple an apple and a banana." },
matchesKeyphrase: {
apple: [
{ Word Object "apple" 1st instance },
{ Word Object "apple" 2nd instance },
],
banana:[
{ Word Object "banana" }
],
pear: []
},
matchesSynonyms: [ {
orange: [
{ Word Object "orange" 1st instance },
{ Word Object "orange" 2nd instance },
],
mango:[
{ Word Object "mango" }
]
},
],
percentWordMatchesKeyphrase: 100 (?)
percentWordMatchesSynonyms: [ 100, ... ] (?)
}
,
...
]
- The matching mechanism can stay the same as the current implementation
-
mergeChildrenResultscan use the default strategy - See
findKeywordFormsInString.jsfor inspiration for e.g. how to calculatepercentWordMatches. - Needs access to morphological forms
Researches assessments operating on leaf nodes
Keyphrase in introduction
- Steps
- Get base research results for 1st paragraph
- Check whether there is at least 1 sentence with
percentWordMatches: 100 - If there is no sentence with
percentWordMatches: 100: merge the matches of all sentence objects per keyphrase word - Check if all keyphrase words have at least one match
- Needs to use aggregated keyphrase + synonym data.
Keyphrase density
- Steps
- Get base research result for whole text
- For each sentence, caculate how many full keyphrase occurrences there are.
- An occurrence is counted when all keywords of the key phrase are contained within the sentence.
- A sentence can contain multiple key phrases (e.g., "The apple potato is an apple and a potato." has two occurrences of the key phrase "apple potato").
- Return the number of occurrences.
- See
keywordCount.jsfor inspiration.
Keyphrase in subheading
- Steps
- Run the base research for each subheading.
- Merge results for subheadings containing multiple sentences.
- Calculate
percentWordmatchesfor each subheading. - If it's a language with function word support, return the number of subheadings with 100% matches; if it's a language without function word support, return the number of subheadings with >50% matches.
- Needs to use aggregated keyphrase + synonym data.
Keyphrase distribution
- Adapt the functionality in
keyphraseDistribution.jsto run it on the data returned by the new keyphrase base research. This includes:- computing per-sentence score based on
percentageWordMatches - determining continuous stretches of sentences with low per-sentences scores
- computing per-sentence score based on
- Needs to use aggregated keyphrase + synonym data.
Researches for individual assessments operating on formatting elements
The assessments below require similar functionality:
- Get a certain type of formatting element.
- Check whether there is at least 1 formatting element containing all the content words from the keyphrase.
- The base class outlined above assumes that we always split text into sentences and that we check the keyphrase matches per sentence. For the assessments operating on non-leaf nodes, a pragmatic solution is to create a paragraph node with the contents of the formatting elements and use the base research on this paragraph node.
- Note: we either need to save a reference of the newly created paragraph data on the original formatting elements, or the other way around. It's necessary to maintain a reference between the original and the converted data, because the keyphrase research operates on the original data, but we want to return references to the original formatting elements in the results.
Keyphrase in image alt attributes
- Steps
- Get all images from image research. (to-do: make issue for image research)
- Convert the alt tags of all images into paragraph nodes (see above).
- Run keyphrase base research on the newly created paragraph nodes.
- Return all images that have an alt tag with
percentWordMatches= 100.
- Needs to use aggregated keyphrase + synonym data.
Text competing link assessment
- Steps
- Get all links from link statistics research.
- Convert the link text into paragraph nodes (see above).
- Run keyphrase base research on the newly created paragraph nodes.
- Return all links with
percentWordMatches= 100.
- Needs to use aggregated keyphrase + synonym data.
Keyphrase-synonym aggregator
- We run the base research separately for the keyphrase and the associated synonyms. For the individual researches that also require synonyms, we need to aggregate this data. It's not necessary to know whether a match was a keyphrase or a synonym match, since we don't make a distinction between these two kinds of matches in the assessment results.
-
matchescan be the combination of all matched synonym and keyphrase matches. - For
percentWordMatches, the highest value can be used.
Example output:
Keyphrase: cat and dog
Synonym: canine and feline
Text: Here's a cat and another cat and a dog and a canine.
Output base research keyphrase:
[
{
sentence: { Sentence Object "Here's a cat and another cat and a dog and a canine." },
matches: {
{ Word Object "cat" 1st instance },
{ Word Object "cat" 2nd instance },
{ Word Object "dog" }
},
percentWordMatches: 100
}
]
Output base research synonym:
[
{
sentence: { Sentence Object "Here's a cat and another cat and a dog and a canine." },
matches: {
{ Word Object "canine" }
},
percentWordMatches: 50
}
]
Aggregated keyphrase & synonym data:
[
{
sentence: { Sentence Object "Here's a cat and another cat and a dog and a canine." },
matches: {
{ Word Object "cat" 1st instance },
{ Word Object "cat" 2nd instance },
{ Word Object "dog" }
{ Word Object "canine" }
},
percentWordMatches: 100
}
]