twitter-archive-parser icon indicating copy to clipboard operation
twitter-archive-parser copied to clipboard

More work on `downloadtweets`

Open lenaschimmel opened this issue 3 years ago • 5 comments

Here's some work which I already did 3 days ago, but had not made into a PR before.

Yesterday I incorporated the newest stuff from timhutton/twitter-archive-parser/main into both timhutton/twitter-archive-parser/downloadtweets and lenaschimmel/twitter-archive-parser/downloadtweets because I thought that would make it easier to keep everything up to date and focus on the actual (non-merge) commits.

What's actually included

This is just more WIP on the download tweets feature:

  • Error handling in get_tweets - if it fails, return the tweets we already downloaded, plus the ids of tweets that are still missing
  • Add merge method, which can basically merge all kinds of python values, but contains some special treatments for dicts representing tweets. This should make sure that if a tweet is contained in the archive, and we download that same tweet via the API, we can merge them without losing any information.
  • add helper method has_path which simplifies those chained checks that we have all over the code, like: if 'entities' in tweet and 'user_mentions' in tweet['entities'] and tweet['entities']['user_mentions'] is not None
  • save downloaded tweets into known_tweets.json, and load them on next script execution, so that we don't reload the same tweets over and over again.
  • annotate downloaded tweets with attributes like from_api, download_with_user, download_with_alt_text so that collect_tweet_references can make better decisions on what do re-download, and whether to follow references or not
  • improved merging and comparing of tweet JSON objects. Those are some really nasty data structures!

lenaschimmel avatar Nov 25 '22 15:11 lenaschimmel

The online diff view is no longer broken 🎉

lenaschimmel avatar Nov 27 '22 22:11 lenaschimmel

Hi, any chance all this nice stuff gets integrated? Thanks!

slorquet avatar Feb 02 '23 16:02 slorquet

@slorquet I think this is still work in progress so can't be merged yet?

@lenaschimmel are you still working on this? Can you rebase it, so it's easier to review? I'd like to get as much out of the sinking blue ship as possible :-)

Sjors avatar Feb 18 '23 15:02 Sjors

@Sjors I'm not really working on that branch, or twitter-archive-parser, any more since mid December 22. I've since deleted my own twitter* account, so I don't have much motivation to continue, and it's harder to test. And even if I really wanted to, I currently don't have any time to continue working on it.

Until then, I focused on the weirdly-named branch lenaschimmel:archivemode, which already includes all commits from downloadtweets. I'd say the archivemode branch is in rather good shape now, and can be used and/or merged. It is also up-to-date with timhutton:main so it could even be fast-forwarded. But I don't know if @timhutton wants to go into that direction.


* To be precise, I deleted my main twitter account. I still have some secondary ones lying around, but with much less content, thus less edge cases that are interesting for testing new code

lenaschimmel avatar Feb 21 '23 14:02 lenaschimmel

@lenaschimmel Would you like to document the fact that the other branch is more worthy of being merged by closing/drafting this one, and opening a new PR?

This little step could also make it easier for third-parties, like me, to discover your work and rely on it instead, in case @timhutton would not be available for further deliberation.

almereyda avatar Oct 18 '23 22:10 almereyda