python-goose icon indicating copy to clipboard operation
python-goose copied to clipboard

top_node algorithm? (test case included)

Open ThiemNguyen opened this issue 10 years ago • 2 comments

Hi everyone, I've been working with Goose for a couple of weeks in an attempt to utilize it in my project. I'm diving in the source code and trying out some improvements. The top_node property (the one containing big stuffs) of an extracted article seems to be calculated in ContentExtractor::calculate_best_node. AFAIK, it searches for p,pre,td elements, rejects ones with insufficient texts or high link density, then it walks through nodes_with_text to find out the best top node with a helper method called is_boostable. The problem is I cant understand these lines of codes (line 90-line 130):

        nodes_number = len(nodes_with_text)
        negative_scoring = 0
        bottom_negativescore_nodes = float(nodes_number) * 0.25

        for node in nodes_with_text:
            boost_score = float(0)
            # boost
            if(self.is_boostable(node)):
                if cnt >= 0:
                    boost_score = float((1.0 / starting_boost) * 50)
                    starting_boost += 1
            # nodes_number
            if nodes_number > 15:
                if (nodes_number - i) <= bottom_negativescore_nodes:
                    booster = float(bottom_negativescore_nodes - (nodes_number - i))
                    boost_score = float(-pow(booster, float(2)))
                    negscore = abs(boost_score) + negative_scoring
                    if negscore > 40:
                        boost_score = float(5)

            text_node = self.parser.getText(node)
            word_stats = self.stopwords_class(language=self.get_language()).get_stopword_count(text_node)
            upscore = int(word_stats.get_stopword_count() + boost_score)

            # parent node
            parent_node = self.parser.getParent(node)
            self.update_score(parent_node, upscore)
            self.update_node_count(parent_node, 1)

            if parent_node not in parent_nodes:
                parent_nodes.append(parent_node)

            # parentparent node
            parent_parent_node = self.parser.getParent(parent_node)
            if parent_parent_node is not None:
                self.update_node_count(parent_parent_node, 1)
                self.update_score(parent_parent_node, upscore / 2)
                if parent_parent_node not in parent_nodes:
                    parent_nodes.append(parent_parent_node)
            cnt += 1
            i += 1

They are not documented yet. I did a lot of search on other source files, repo issues, even on original goose repo but still have not figured out an idea of how it works. And I found a case which the extractor failed to detect the top_node (it returned nothing): http://trendsread.com/articles/24

Any ideas? Thanks!

ThiemNguyen avatar Apr 09 '15 09:04 ThiemNguyen

I think it`s not a good algorithm. It fails on such simple page:

<html>
<head>
    <title>Some title</title>
    <link rel="canonical" href="http://example.org">
    <meta property="og:image" content="http://example.org/thumbnail.png">
</head>
<body>
    <div class="container">
        <div class="content">

        <div itemscope itemprop="http://schema.org/Article">
            <h1 itemprop="name">Some title</h1>
            <div itemprop="datePublished" datetime="2012-01-01T12:34:00">2012-01-01 12:34:00</div>
            <p itemprop="articleBody">
                Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque vitae justo nec tortor tincidunt dictum in in libero. Maecenas tempus, leo in vulputate tempus, ipsum libero imperdiet lectus, a congue mauris ante sed nisl. Sed sit amet ultricies orci. Curabitur sed orci libero. In viverra mi non lacus accumsan venenatis. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum sit amet porttitor nulla, vel placerat tortor. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Interdum et malesuada fames ac ante ipsum primis in faucibus. Pellentesque maximus eu justo eu tincidunt. Fusce euismod, mauris vitae fringilla rutrum, dui nisl dictum est, egestas faucibus sapien ipsum vitae justo. Maecenas ac aliquet tellus. Vivamus libero neque, volutpat quis tempor vitae, auctor vitae sapien. Mauris ultricies semper lorem, eu cursus metus dignissim non. Vivamus bibendum sem sed iaculis maximus.
            </p>
        </div>

        </div>
    </div>
</body>
</html>

After extraction: article.cleaned_text == "" :(

vetal4444 avatar Apr 21 '15 10:04 vetal4444

Does you configure Latin language and stopwords for above example? :)

muggot avatar Jul 10 '15 08:07 muggot