stuck on all (?) assets fetched
I'm on Windows - the scraper is stuck at '13342/13342 assets fetched', tried restarting it dozens of times already, in Administrator mode and not, nothing helps. The skyscraper.log's few last lines are:
20-07-21 20:39:58 DESKTOP-QDTFUSR INFO [skyscraper.core:389] - [download] Downloading https://ragis.soup.io/tv/show?id=103707932
20-07-21 20:39:58 DESKTOP-QDTFUSR WARN [soupscraper.core:206] - [download] Unexpected error clojure.lang.ExceptionInfo: clj-http: status 503 {:cached nil, :request-time 768, :repeatable? false, :protocol-version {:name "HTTP", :major 1, :minor 0}, :streaming? true, :http-client #object[org.apache.http.impl.nio.client.InternalHttpAsyncClient 0x6b9ade24 "org.apache.http.impl.nio.client.InternalHttpAsyncClient@6b9ade24"], :chunked? false, :type :clj-http.client/unexceptional-status, :reason-phrase "Server Error", :headers {"cache-control" "no-cache", "content-type" "text/html"}, :orig-content-encoding nil, :status 503, :length -1, :body #object["[B" 0x7ab7988f "[B@7ab7988f"], :trace-redirects []}, retrying
20-07-21 20:39:58 DESKTOP-QDTFUSR WARN [soupscraper.core:206] - [download] Unexpected error clojure.lang.ExceptionInfo: clj-http: status 503 {:cached nil, :request-time 751, :repeatable? false, :protocol-version {:name "HTTP", :major 1, :minor 0}, :streaming? true, :http-client #object[org.apache.http.impl.nio.client.InternalHttpAsyncClient 0x5a1ab189 "org.apache.http.impl.nio.client.InternalHttpAsyncClient@5a1ab189"], :chunked? false, :type :clj-http.client/unexceptional-status, :reason-phrase "Server Error", :headers {"cache-control" "no-cache", "content-type" "text/html"}, :orig-content-encoding nil, :status 503, :length -1, :body #object["[B" 0x6457482e "[B@6457482e"], :trace-redirects []}, retrying
20-07-21 20:39:58 DESKTOP-QDTFUSR INFO [skyscraper.core:389] - [download] Downloading https://ragis.soup.io/tv/show?id=103697413
Any idea what's wrong? Is it simply stuck on the very last file for some reason?
Hey @Ragiss, thanks for the report!
The progress reporting is somewhat lacking here. As part of the "downloading assets" phase, soupscraper is also trying to reach out to the /tv/show endpoint to fetch YouTube IDs – it's not counting them as assets though.
You seem to have fallen into a loop of 503s, which suggests your IP might have been temporarily blacklisted. Try to abort and retry in a few minutes for now; I'll try to release a new version that copes better with this scenario later on during the week.
Thanks for the answer! I've tried running it again this morning - so after a few hours of break - and still nothing changed.
Also, I'm a bit confused about the log files - it looks like there is one in the /log folder in the place where .jar file is located, but there is also another one in C:\Users<user>\log\ folder (this one is bigger and more up to date, but last entry is from yesterday, nothing from today). But I assume the bigger one is the correct one. ;)
Waiting patiently for the update. :)
i have the same issue. i'm stuck at 35432/35434 since several days now. so, retrying didn't help (it did help before. It was stuck earlier and retrying made step by step progress). In the logs i get the same as above. 503 for tv/show?id=... funny enough it throws these on many more than the missing 2 assets.
Is there a way to create a working soup out of what i have already? because there are gigs of assets in the directory, but i cant see a way to access it
Same here on Debian stretch. Stuck on "18855/18884 assets fetched" since yesterday now. Even restarting and waiting for an IP refresh from the provider did not help. I agree with @fasel, a sub-command to extract the already fetched assets to the output directory would already be great!
For the record, what worked for me was to just handle all errors as a 404 case (no clue of Clojure so this is just a bodge admittedly):
diff --git a/src/soupscraper/core.clj b/src/soupscraper/core.clj
index 712679c..5d5d396 100644
--- a/src/soupscraper/core.clj
+++ b/src/soupscraper/core.clj
@@ -198,27 +198,10 @@
[error options context]
(let [{:keys [status]} (ex-data error)
retry? (or (nil? status) (>= status 500) (= status 429))]
- (cond
- (= status 404)
- (do
(warnf "[download] %s 404'd, dumping in empty file" (:url context))
(core/respond-with {:headers {"content-type" "text/plain"}
:body (byte-array 0)}
- options context))
-
- retry?
- (do
- (if (= status 429)
- (do
- (warnf "[download] Unexpected error %s, retrying after a nap" error)
- (Thread/sleep 5000))
- (warnf "[download] Unexpected error %s, retrying" error))
- [context])
-
- :otherwise
- (do
- (warnf "[download] Unexpected error %s, giving up" error)
- (core/signal-error error context)))))
+ options context)))
(defn seed [{:keys [soup earliest pages-only]}]
[{:url (format "https://%s.soup.io" soup),
@miri64 - Thanks, that helped me to finally get a browsable html page out of the assets.
@miri64 - I have no idea what to do with this code :D Could you upload your .jar or explain how to do it (on Windows)? Soup is dead for real so I guess nothing more will happen in here.
Sorry no :-( I already removed the code from my machine and I am not really used to working with JVM code, so not really that familiar with creating a JAR file. However, you can just copy-paste the patch I posted into a patch.txt and then use
git apply patch.txt
in a command line (assuming you cloned soupscraper with Git) and Git should do all the magic for you.