boilerpipe issues

How to debug the result?

``` I just interested to know if a block has been removed, what's the reason? As I see in the source code, each block is labelled for different conditions. How...

GoogleCodeExporter

Type-Defect

Priority-Medium

auto-migrated

Different result when using Web Api and the source api?

``` The result of a same page is different with the web api. For example consider the following link: http://boilerpipe-web.appspot.com/extract?url=http%3A%2F%2F1tajrobeh.blog.ir%2F& extractor=ArticleExtractor&output=html&extractImages= I used ArticleExtractor in version 1.2.0 but the result...

GoogleCodeExporter

Type-Defect

Priority-Medium

auto-migrated

Unsupported content type: null

1

``` Hi i am new to using this extractor while i am trying to run as simple extractor using only the boilerpipe-1.2.1.jar i am getting a unsupported Content type error....

GoogleCodeExporter

Type-Defect

Priority-Medium

auto-migrated

Boilerpipe is conflicting with CyberNeko library

1

``` What steps will reproduce the problem? 1. if boilerpipe is at a higher precedence than CyberNeko library, then it will cause parsing issue on user input with unbalanced tags...

GoogleCodeExporter

Type-Defect

Priority-Medium

auto-migrated

Performance issues with UnicodeTokenizer

``` What steps will reproduce the problem? 1. call ArticleExtractor.getInstance().getText() on the example data (Stability.html) What is the expected output? What do you see instead? The extraction takes a very...

GoogleCodeExporter

Type-Defect

Priority-Medium

auto-migrated

Missing ImageExtractor in downloabale 1.2 jar file

``` What steps will reproduce the problem? 1. Missing de.l3s.boilerpipe.sax.ImageExtractor What is the expected output? What do you see instead? Rebuilding jar from source has the missing de.l3s.boilerpipe.sax.ImageExtractor class file....

GoogleCodeExporter

Type-Defect

Priority-Medium

auto-migrated

IllegalArgumentException for many web pages

``` With boilerpipe-1.2.0.jar ArticleExtractor.INSTANCE.getText(new java.net.URL("http://t.co/3RplOLjc")) produces ERROR java.lang.IllegalArgumentException: protocol = http host = null at de.l3s.boilerpipe.sax.HTMLFetcher.fetch (HTMLFetcher.java:33) at de.l3s.boilerpipe.extractors.ExtractorBase.getText (ExtractorBase.java:87) This happens for many other URLs e.g. http://t.co/5vuYimwn http://t.co/Dy5yolLs http://t.co/ShWhtFjP...

GoogleCodeExporter

Type-Defect

Priority-Medium

auto-migrated

Fail to extract main content on some page, get footnote instead

``` What steps will reproduce the problem? 1. extract content from the page (in Chinese) with ArticleExtractor http://www.ccgp.gov.cn/cggg/zybx/zbgg/201407/t20140731_3655909.shtml What is the expected output? What do you see instead? Footnote is...

GoogleCodeExporter

Type-Defect

Priority-Medium

auto-migrated

Incomplete extraction of article

``` What steps will reproduce the problem? 1.Give the URL as : http://www.newyorker.com/news/amy-davidson/shattered-school-gaza-2 2.Keep the extractor strategy as artcle extractor 3.Extract What is the expected output? What do you see...

GoogleCodeExporter

Type-Defect

Priority-Medium

auto-migrated

its not working for a news site

1

``` What steps will reproduce the problem? 1.String content = CommonExtractors.DEFAULT_EXTRACTOR.getText(new URL("http://www.nytimes.com/2014/06/06/business/gm-ignition-switch-internal-reca ll-investigation-report.html?hp")); 2.System.out.println(content); 3.It prints nothing When I run with the above URL, its not extracting anything. I have...

GoogleCodeExporter

Type-Defect

Priority-Medium

auto-migrated

boilerpipe
boilerpipe copied to clipboard

Metadata

How to debug the result?

Different result when using Web Api and the source api?

Unsupported content type: null

Boilerpipe is conflicting with CyberNeko library

Performance issues with UnicodeTokenizer

Missing ImageExtractor in downloabale 1.2 jar file

IllegalArgumentException for many web pages

Fail to extract main content on some page, get footnote instead

Incomplete extraction of article

its not working for a news site

← Metadata

Owner

Metadata

boilerpipe boilerpipe copied to clipboard

Metadata

← Metadata

Owner

Metadata

boilerpipe
boilerpipe copied to clipboard