Text-Statistics SMOG calculation discrepancies

Hi,

Text: "June 23rd, 2015 How Cigna deal limits Anthem’s Blue Cross brand “When health plans operate using the Blue Cross and Blue Shield brand, they are generally limited to business in a specific state or region as part of a licensing agreement with their trade group, the Blue Cross and Blue Shield Association. So when Anthem (ANTM), a major operator of Blue Cross plans, made its $184-a-share offer for Cigna (CI) to grow both health insurance businesses, it created potential hurdles when it comes to Anthem’s valuable Blue Cross brands expanding."

On readability-score.com I'm getting value of 15.2 for SMOG, but with $textStatistics->smogIndex($input) only 9.4. This is big difference. Am I doing something wrong?

Jun 24 '15 14:06 srdjan-stojkovic

https://travis-ci.org/DaveChild/Text-Statistics

It appears SMOG is broken. Can anyone confirm this?

Mar 05 '16 19:03 gburtini

Yes. There seem to be many issues here.

The SMOG value here is always "normalized" (ie clamped) to the range [0, 12]. With that enabled you can never get that 15.2.

public $normalise = false;

The SMOG formula is implemented wrong. It is taking the square root of the sum and lastly multiplies, but actually the order should be: square root, multiplication and then the sum.

            Maths::bcCalc(
                Maths::bcCalc(
                    Maths::bcCalc(
                        Syllables::wordsWithThreeSyllables($strText, true, $this->strEncoding),
                        '*',
                        Maths::bcCalc(
                            30,
                            '/',
                            Text::sentenceCount($strText, $this->strEncoding)
                        )
                    ),
                    'sqrt',
                    0
                ),
                '*',
                1.043
            ),
            '+',
            3.1291
        );

When the input text is cleaned it is utf8_decoded. However, if you have an ASCII text, then some symbols get converted to "?" signs and those will be interpreted as terminators. So in your example text there are 2 sentences, but the script finds 5.

//$strText = utf8_decode($strText);

I'm not sure, but I also removed all the words that contain numbers. I dunno. It didn't make sense to me to count "23rd" or "$184-a-share" as words.

$strText = preg_replace('/([^\.\s]*[0-9][^\.\s]*)/', '', $strText); // Remove words with numbers
$strText = preg_replace('/\'/', '', $strText); // Remove ' symbol, dunno if helps.
$strText = preg_replace('`  `', ' ', $strText); // Remove double spaces (because for some reason you calculate words based on number of spaces)

Now, I don't have an account on readability-score.com, but I tried with other online calculators:

	Online-Utility	LearningAndWork	StoryToolz	Current TS	Improved TS
characters	437	-	436	427	425
words	94	92	94	94	92
poly-words	-	14	-	13	13
sentences	2	2	2	5	2
syl. per word	1.48	-	1.38	1.44	1.46
ARI	23.97	-	23.9	9.4	23.3
Gunning-F	23.06	-	23.9	12.6	23.6
Flesch-K	20.19	-	19.1	8.7	19.5
Coleman-L	10.94	-	10.8	10.9	11.4
SMOG	16.96	23.2	16.4	9.4	17.7

I also tried with my own test text, which is a bit longer.

	Online-Utility	LearningAndWork	StoryToolz	Current TS	Improved TS
characters	2919	-	2924	2899	2890
words	604	604	592	604	585
poly-words	-	90	-	108	108
sentences	32	32	32	32	32
syl. per word	1.65	-	1.58	1.64	1.68
ARI	10.77	-	11.1	10.6	11
Gunning-F	12.32	-	13.4	14	13.9
Flesch-K	11.2	-	10.3	11.2	11.4
Coleman-L	11.08	-	11.6	12	13.3
SMOG	12.8	17.7	12.1	10.7	13.6

But, yeah, there still seem to be problems. For example now the Coleman-Liau index went up compared to the other calculators.

Jul 05 '18 15:07 jee7