twitter-archive-parser icon indicating copy to clipboard operation
twitter-archive-parser copied to clipboard

Base feature: Download tweets which are missing or incomplete

Open lenaschimmel opened this issue 3 years ago • 4 comments

We have several issues that can only be solved if we download some additional tweets:

  • #72
  • #73
  • #39
  • #20
  • #22

I already started working on an implementation for this on Nov 22th. I figured that we still didn't have an issue for it until now, but an issue might be helpful to keep track of the progress, especially if multiple PRs are created / updated / closed / merged to implement this feature.

What does this do?

These are the features that are (mostly) finished in the branch downloadtweets, but not yet available on main:

  • scan tweets in the archive for references to other tweets:
    • retweets
    • quoted tweets
    • tweets that the user replied to
  • scan for tweets which already are in the archive, but need to be downloaded again
    • tweets with media, because the archive does not contain the alt-text for images
  • download needed tweets, if the user consents
  • merge downloaded tweets with tweets from the archive and save to known_tweets.json
  • when the script is executed again, load these tweets so they will not be downloaded again (and again...)
  • don't scan tweets which were not part of the original archive

What's still missing

  • the last two items of the list above are not 100% finished - about 14% of the tweets are downloaded again each time
  • not sure if any of the liked tweets in like.js need re-downloading for some reason. They are ignored right now.
  • media from the additional tweets is not downloaded, so it's reported as missing when they are written to the output md and html
  • additional tweets are included in the output md and html as if they were part of the archive
  • the downloaded tweets are no really used right now to enhance the output

Where is the progress?

So there's the PR #97 which merged my first set of commits into the branch downloadtweets in this repo. And the PR #122 which tracks my current work on it, which happens in downloadtweets in my fork. None of this is currently merged into main.

The PR is already quite huge, and looks even bigger due to the many merge commits which just bring it up to date with main.

For several days, the online diff for #122 was broken, but now it works again.

Why is the PR so huge?

I under-estimated the complexity of the tweet json format. Many properties are quite similar and contain redundant data. There are slight differences in the format form the API and from the archive. Many number properties are sometimes encoded as numbers, and sometimes as strings, which makes equality checks and merging difficult.

Also, throwing all tweets (from the archive, and those downloaded for several different reasons) into a single dict / json file has some downside. But since every tweet can be referenced in multiple different ways and be part of the original archive, keeping them separate is also not trivial - maybe impossible.

How will this continue?

I think we could merge this really soon. The remaining problems are not as big as they might seem:

  • re-downloading some tweets over and over again only takes a few seconds, and the user can simply decline to do it
  • downloading additional media for those tweets should IMHO be separate issue / PR anyway
  • using the downloaded tweets belongs into separate issues (listed at the very beginning) and separate PRs. The script would already inform the user about this before downloading, as it prints:

"Please note that the downloaded tweets will not be included in the generated output yet. Anyway, we recommend to download the tweets now, just in case Twitter (or its API which we use), won't be available forever. A future version of this script will be able to include the downloaded tweets into the output, even if Twitter should not be available then."

What do you think, @timhutton? What should I do before we can merge #122 into downloadtweets? And what should be done before that result can be merged into main?

lenaschimmel avatar Nov 28 '22 11:11 lenaschimmel

I just tried your fork and got this error:

Retrying the ones that failed, with a longer sleep. 4 tries remaining.
[…]
Traceback (most recent call last):
lenaschimmel-parser.py", line 768, in download_larger_media
    for index, (local_media_path, media_url) in enumerate(media_sources.items()):
AttributeError: 'list' object has no attribute 'items'

weiweihuanghuang avatar Nov 29 '22 00:11 weiweihuanghuang

Ah, sorry. This error was recently found and fixed, but not in the correct branch. I just pushed this commit into downloadtweets. That should fix it.

lenaschimmel avatar Nov 29 '22 00:11 lenaschimmel

Also just wondering if the function for retrieving the full text of any retweets your archive is in this PR?

weiweihuanghuang avatar Nov 29 '22 00:11 weiweihuanghuang

not sure if any of the liked tweets in like.js need re-downloading for some reason. They are ignored right now.

I think the liked tweets need re-downloading, as the only fields in the archive are tweetId, fullText, and expandedUrl. We can tackle that in issue #22 after this gets merged though.

press-rouch avatar Nov 30 '22 22:11 press-rouch