soupscraper icon indicating copy to clipboard operation
soupscraper copied to clipboard

stuck on all (?) assets fetched

Open Ragiss opened this issue 5 years ago • 8 comments

I'm on Windows - the scraper is stuck at '13342/13342 assets fetched', tried restarting it dozens of times already, in Administrator mode and not, nothing helps. The skyscraper.log's few last lines are:

20-07-21 20:39:58 DESKTOP-QDTFUSR INFO [skyscraper.core:389] - [download] Downloading https://ragis.soup.io/tv/show?id=103707932

20-07-21 20:39:58 DESKTOP-QDTFUSR WARN [soupscraper.core:206] - [download] Unexpected error clojure.lang.ExceptionInfo: clj-http: status 503 {:cached nil, :request-time 768, :repeatable? false, :protocol-version {:name "HTTP", :major 1, :minor 0}, :streaming? true, :http-client #object[org.apache.http.impl.nio.client.InternalHttpAsyncClient 0x6b9ade24 "org.apache.http.impl.nio.client.InternalHttpAsyncClient@6b9ade24"], :chunked? false, :type :clj-http.client/unexceptional-status, :reason-phrase "Server Error", :headers {"cache-control" "no-cache", "content-type" "text/html"}, :orig-content-encoding nil, :status 503, :length -1, :body #object["[B" 0x7ab7988f "[B@7ab7988f"], :trace-redirects []}, retrying

20-07-21 20:39:58 DESKTOP-QDTFUSR WARN [soupscraper.core:206] - [download] Unexpected error clojure.lang.ExceptionInfo: clj-http: status 503 {:cached nil, :request-time 751, :repeatable? false, :protocol-version {:name "HTTP", :major 1, :minor 0}, :streaming? true, :http-client #object[org.apache.http.impl.nio.client.InternalHttpAsyncClient 0x5a1ab189 "org.apache.http.impl.nio.client.InternalHttpAsyncClient@5a1ab189"], :chunked? false, :type :clj-http.client/unexceptional-status, :reason-phrase "Server Error", :headers {"cache-control" "no-cache", "content-type" "text/html"}, :orig-content-encoding nil, :status 503, :length -1, :body #object["[B" 0x6457482e "[B@6457482e"], :trace-redirects []}, retrying

20-07-21 20:39:58 DESKTOP-QDTFUSR INFO [skyscraper.core:389] - [download] Downloading https://ragis.soup.io/tv/show?id=103697413

Any idea what's wrong? Is it simply stuck on the very last file for some reason?

Ragiss avatar Jul 22 '20 22:07 Ragiss

Hey @Ragiss, thanks for the report!

The progress reporting is somewhat lacking here. As part of the "downloading assets" phase, soupscraper is also trying to reach out to the /tv/show endpoint to fetch YouTube IDs – it's not counting them as assets though.

You seem to have fallen into a loop of 503s, which suggests your IP might have been temporarily blacklisted. Try to abort and retry in a few minutes for now; I'll try to release a new version that copes better with this scenario later on during the week.

nathell avatar Jul 23 '20 11:07 nathell

Thanks for the answer! I've tried running it again this morning - so after a few hours of break - and still nothing changed.

Also, I'm a bit confused about the log files - it looks like there is one in the /log folder in the place where .jar file is located, but there is also another one in C:\Users<user>\log\ folder (this one is bigger and more up to date, but last entry is from yesterday, nothing from today). But I assume the bigger one is the correct one. ;)

Waiting patiently for the update. :)

Ragiss avatar Jul 23 '20 13:07 Ragiss

i have the same issue. i'm stuck at 35432/35434 since several days now. so, retrying didn't help (it did help before. It was stuck earlier and retrying made step by step progress). In the logs i get the same as above. 503 for tv/show?id=... funny enough it throws these on many more than the missing 2 assets.

Is there a way to create a working soup out of what i have already? because there are gigs of assets in the directory, but i cant see a way to access it

fasel avatar Jul 27 '20 19:07 fasel

Same here on Debian stretch. Stuck on "18855/18884 assets fetched" since yesterday now. Even restarting and waiting for an IP refresh from the provider did not help. I agree with @fasel, a sub-command to extract the already fetched assets to the output directory would already be great!

miri64 avatar Aug 07 '20 07:08 miri64

For the record, what worked for me was to just handle all errors as a 404 case (no clue of Clojure so this is just a bodge admittedly):

diff --git a/src/soupscraper/core.clj b/src/soupscraper/core.clj
index 712679c..5d5d396 100644
--- a/src/soupscraper/core.clj
+++ b/src/soupscraper/core.clj
@@ -198,27 +198,10 @@
   [error options context]
   (let [{:keys [status]} (ex-data error)
         retry? (or (nil? status) (>= status 500) (= status 429))]
-    (cond
-      (= status 404)
-      (do
         (warnf "[download] %s 404'd, dumping in empty file" (:url context))
         (core/respond-with {:headers {"content-type" "text/plain"}
                             :body (byte-array 0)}
-                           options context))
-
-      retry?
-      (do
-        (if (= status 429)
-          (do
-            (warnf "[download] Unexpected error %s, retrying after a nap" error)
-            (Thread/sleep 5000))
-          (warnf "[download] Unexpected error %s, retrying" error))
-        [context])
-
-      :otherwise
-      (do
-        (warnf "[download] Unexpected error %s, giving up" error)
-        (core/signal-error error context)))))
+                           options context)))
 
 (defn seed [{:keys [soup earliest pages-only]}]
   [{:url (format "https://%s.soup.io" soup),

miri64 avatar Aug 12 '20 19:08 miri64

@miri64 - Thanks, that helped me to finally get a browsable html page out of the assets.

MartinKei avatar Aug 19 '20 13:08 MartinKei

@miri64 - I have no idea what to do with this code :D Could you upload your .jar or explain how to do it (on Windows)? Soup is dead for real so I guess nothing more will happen in here.

Ragiss avatar Sep 20 '20 10:09 Ragiss

Sorry no :-( I already removed the code from my machine and I am not really used to working with JVM code, so not really that familiar with creating a JAR file. However, you can just copy-paste the patch I posted into a patch.txt and then use

git apply patch.txt

in a command line (assuming you cloned soupscraper with Git) and Git should do all the magic for you.

miri64 avatar Sep 20 '20 14:09 miri64