Extraction Issue

Open vinylrichie opened this issue 5 years ago • 1 comments

Hello @kohlschuetter ,

First off, I have to say, Boilerpipe is AMAZING! Thank you for your work on this.

In a few cases, I am having a bit of an extraction issue. With the github code, there are some articles where the extraction is starting late. For example, on https://en.wikipedia.org/wiki/New_York_City the output starts at "Further information: Police surveillance in New York City and Crime in New York City". However, when I check that same article on https://boilerpipe-web.appspot.com/, the web API is always getting the full text. I've been banging my head against the wall trying to figure out what I was doing wrong, and just figured I should message the inventor. The only two things I could think of are: 1) I am totally missing something or 2) the web api might slightly different version. Do you what might be going on here?

Hope you are having a great weekend!

Best, Kevin

Jan 31 '21 01:01 vinylrichie

I'm facing some issues with the ArticleExtractor producing completely different results for two pages that have really similar HTML:

https://www.posb.com.sg/personal/deposits/savings-accounts/emysavings-account https://www.dbs.com.sg/personal/deposits/savings-accounts/mysavings-account

When I use the DefaultExtractor, the response is 96% similar. But using ArticleExtractor is completely different, any ideas why?

Dec 11 '23 18:12 RenanMoreiraDK