soupscraper
soupscraper copied to clipboard
Perser error - unexpected character in JSON
Hi, I don't know if this is a problem with the soupscraper or the skyscraper framework. Maybe this information help to perfect the framework.
When scraping starwars.soup.io I get the following error.
console output:
fetchedException in thread "main" clojure.lang.ExceptionInfo: Handler threw an error {:since "568380106", :date #inst "2015-04-16T00:00:00.000-00:00", :content "LIVE NOW! \n<a href=\"https://www.youtube.com/watch?v=4UY64GfyovE\">https://www.youtube.com/watch?v=4UY64GfyovE</a>", :date-from-header "2015-04-16", :type :video, :pages-only nil, :skyscraper.core/response {:body #object["[B" 0x1789f152 "[B@1789f152"], :headers {"Vary" "Accept-Encoding", "Link" "<https://www.soup.io/wp-json/>; rel=\"https://api.w.org/\", <https://www.soup.io/>; rel=shortlink", "CF-Cache-Status" "MISS", "Transfer-Encoding" "chunked", "Date" "Fri, 24 Jul 2020 10:17:45 GMT", "cf-request-id" "0421ed3db40000dfcf2fbb0200000001", "Expect-CT" "max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"", "Cache-Control" "max-age=2678400", "Server" "cloudflare", "Content-Type" "text/html; charset=UTF-8", "Connection" "keep-alive", "X-Powered-By" ["PHP/7.4.8" "PleskLin"], "CF-RAY" "5b7ce4a92aa0dfcf-FRA"}}, :earliest nil, :id "568367933", :skyscraper.core/cache-key "soup/starwars/tv/568367933", :skyscraper.core/stage skyscraper.core/process-handler, :skyscraper.core/current-processor {:name :tv, :process-fn #object[soupscraper.core$fn__18660 0x613e4596 "soupscraper.core$fn__18660@613e4596"], :parse-fn #object[soupscraper.core$parse_json 0x252748de "soupscraper.core$parse_json@252748de"], :cache-template "soup/:soup/tv/:id"}, :url "https://starwars.soup.io/tv/show?id=568367933", :skyscraper.traverse/call-protocol :sync, :post #object[org.jsoup.nodes.Element 0x6db9be4c "<div id=\"post568367933\" class=\"post post_video author-member source-local f_nsfw f_nsfw f_post_nsfw f_blog_nsfw\" onmouseover=\"SOUP.Public.post_mouseover($(this), event);\" onmouseout=\"SOUP.Public.post_mouseout($(this), event);\"> \n <div class=\"meta\"> \n <div class=\"icons\"> \n <div class=\"icon type\">\n <a href=\"https://starwars.soup.io/post/568367933/LIVE-NOW-https-www-youtube-com-watch\" title=\"LIVE NOW! https://www.youtube.com/watch?v=4UY64GfyovE\"></a>\n </div> \n <div class=\"icon author\"> \n <span class=\"user_container user890628\" onmouseover=\"if(window.SOUP) SOUP.Public.bubble(this, { 'classname': 'user' })\"><a class=\"url\" href=\"https://starwars.soup.io/post/568367933/LIVE-NOW-https-www-youtube-com-watch\"><img src=\"https://asset.soup.io/asset/2968/0814_0de7_32-square.jpeg\" alt=\"Dennkost\" title=\"Dennkost\" class=\"photo fn\" width=\"32\" height=\"32\"></a>\n <!--shared _user_bubble.html --> \n <div class=\"hidden bubble\"> \n <h4><a href=\"https://Dennkost.soup.io\">Dennkost</a></h4> \n <div class=\"attribution\">\n <a href=\"https://starwars.soup.io/post/568367933/LIVE-NOW-https-www-youtube-com-watch\">over 5 years ago</a>\n </div> \n </div> </span> \n </div> \n </div> \n </div> \n <div class=\"content-container\"> \n <!--soup _post_content.html --> \n <!--soup _post_full.html --> \n <div class=\"content \"> \n <div class=\"embed\"> \n </div> \n <a class=\"tv_promo\" href=\"/tv#568367933/LIVE-NOW-https-www-youtube-com-watch\">Play fullscreen</a> \n <div class=\"body\">\n LIVE NOW! \n <a href=\"https://www.youtube.com/watch?v=4UY64GfyovE\">https://www.youtube.com/watch?v=4UY64GfyovE</a> \n </div> \n </div> \n <!--soup _post_actions.html --> \n <ul class=\"actionbar\"> \n <li class=\"first permalink\"><a href=\"https://starwars.soup.io/post/568367933/LIVE-NOW-https-www-youtube-com-watch\" title=\"Permalink\">#</a></li> \n <li class=\"repost\"><span class=\"inner\"> </span></li> \n <li class=\"last react\"><a href=\"#nojs\" onclick=\"SOUP.Public.open_reaction($(this), 'https://www.soup.io/remote/reaction/frame?parent_id=568367933&origin_host=' + location.host); return false\">React</a></li> \n </ul> \n </div> \n</div>"], :processor :tv, :reactions [], :soup "starwars", :reposts [], :skyscraper.traverse/handler skyscraper.core/sync-handler, :num-on-page -3}
at skyscraper.traverse$throw_handler_error_BANG_.invokeStatic(traverse.clj:250)
at skyscraper.traverse$throw_handler_error_BANG_.invoke(traverse.clj:247)
at skyscraper.traverse$wait_BANG_.invokeStatic(traverse.clj:258)
at skyscraper.traverse$wait_BANG_.invoke(traverse.clj:254)
at skyscraper.traverse$traverse_BANG_.invokeStatic(traverse.clj:275)
at skyscraper.traverse$traverse_BANG_.invoke(traverse.clj:270)
at skyscraper.core$scrape_BANG_.invokeStatic(core.clj:574)
at skyscraper.core$scrape_BANG_.doInvoke(core.clj:566)
at clojure.lang.RestFn.applyTo(RestFn.java:139)
at clojure.core$apply.invokeStatic(core.clj:665)
at clojure.core$apply.invoke(core.clj:660)
at soupscraper.core$scrape_BANG_.invokeStatic(core.clj:240)
at soupscraper.core$scrape_BANG_.invoke(core.clj:239)
at soupscraper.core$download_soup.invokeStatic(core.clj:336)
at soupscraper.core$download_soup.invoke(core.clj:328)
at soupscraper.core$_main.invokeStatic(core.clj:356)
at soupscraper.core$_main.doInvoke(core.clj:353)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at soupscraper.core.main(Unknown Source)
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character ('<' (code 60)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
at [Source: (StringReader); line: 1, column: 2]
at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1840)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:712)
at com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar(ParserMinimalBase.java:637)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1917)
at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:773)
at cheshire.parse$parse.invokeStatic(parse.clj:90)
at cheshire.parse$parse.invoke(parse.clj:88)
at cheshire.core$parse_string.invokeStatic(core.clj:208)
at cheshire.core$parse_string.invoke(core.clj:194)
at cheshire.core$parse_string.invokeStatic(core.clj:205)
at cheshire.core$parse_string.invoke(core.clj:194)
at soupscraper.core$parse_json.invokeStatic(core.clj:155)
at soupscraper.core$parse_json.invoke(core.clj:154)
at skyscraper.core$process_handler.invokeStatic(core.clj:434)
at skyscraper.core$process_handler.invoke(core.clj:428)
at clojure.lang.Var.invoke(Var.java:388)
at skyscraper.core$sync_handler.invokeStatic(core.clj:461)
at skyscraper.core$sync_handler.invoke(core.clj:457)
at clojure.lang.Var.invoke(Var.java:388)
at skyscraper.traverse$worker$fn__18380$fn__18389.invoke(traverse.clj:201)
at skyscraper.traverse$worker$fn__18380.invoke(traverse.clj:201)
at clojure.core.async$thread_call$fn__6604.invoke(async.clj:484)
at clojure.lang.AFn.run(AFn.java:22)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
In the log it looks like this:
20-07-24 10:48:37 edward-teach INFO [skyscraper.core:389] - [download] Downloading http://asset.soup.io/asset/14416/2980_43ba.jpeg
20-07-24 10:48:39 edward-teach INFO [skyscraper.core:389] - [download] Downloading http://asset.soup.io/asset/11463/7069_735e.png
20-07-24 10:48:39 edward-teach ERROR [skyscraper.traverse:168] - [worker 0] Handler threw an error
java.lang.Thread.run Thread.java: 834
java.util.concurrent.ThreadPoolExecutor$Worker.run ThreadPoolExecutor.java: 628
java.util.concurrent.ThreadPoolExecutor.runWorker ThreadPoolExecutor.java: 1128
...
clojure.core.async/thread-call/fn async.clj: 484
skyscraper.traverse/worker/fn traverse.clj: 201
skyscraper.traverse/worker/fn/fn traverse.clj: 201
...
skyscraper.core/sync-handler core.clj: 461
...
skyscraper.core/process-handler core.clj: 434
soupscraper.core/parse-json core.clj: 155
cheshire.core/parse-string core.clj: 205
cheshire.core/parse-string core.clj: 208
cheshire.parse/parse parse.clj: 90
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken ReaderBasedJsonParser.java: 773
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue ReaderBasedJsonParser.java: 1917
com.fasterxml.jackson.core.base.ParserMinimalBase._reportUnexpectedChar ParserMinimalBase.java: 637
com.fasterxml.jackson.core.base.ParserMinimalBase._reportError ParserMinimalBase.java: 712
com.fasterxml.jackson.core.JsonParser._constructError JsonParser.java: 1840
com.fasterxml.jackson.core.JsonParseException: Unexpected character ('<' (code 60)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
at [Source: (StringReader); line: 1, column: 2]
location: #object[com.fasterxml.jackson.core.JsonLocation 0x4a00395 "[Source: (StringReader); line: 1, column: 2]"]
originalMessage: "Unexpected character ('<' (code 60)): expected a valid value (JSON String, Number, Array, Object or token 'null', 'true' or 'false')"
processor: #object[com.fasterxml.jackson.core.json.ReaderBasedJsonParser 0x5f97278d "com.fasterxml.jackson.core.json.ReaderBasedJsonParser@5f97278d"]
An here is the file which causes the trouble 568380106.txt (hadd to add .txt for the upload to github)
Can confirm the same issue when trying to back soup.gaf.io