twitter-archive-parser More work on `downloadtweets`

Here's some work which I already did 3 days ago, but had not made into a PR before.

Yesterday I incorporated the newest stuff from timhutton/twitter-archive-parser/main into both timhutton/twitter-archive-parser/downloadtweets and lenaschimmel/twitter-archive-parser/downloadtweets because I thought that would make it easier to keep everything up to date and focus on the actual (non-merge) commits.

What's actually included

This is just more WIP on the download tweets feature:

Error handling in get_tweets - if it fails, return the tweets we already downloaded, plus the ids of tweets that are still missing
Add merge method, which can basically merge all kinds of python values, but contains some special treatments for dicts representing tweets. This should make sure that if a tweet is contained in the archive, and we download that same tweet via the API, we can merge them without losing any information.
add helper method has_path which simplifies those chained checks that we have all over the code, like: if 'entities' in tweet and 'user_mentions' in tweet['entities'] and tweet['entities']['user_mentions'] is not None
save downloaded tweets into known_tweets.json, and load them on next script execution, so that we don't reload the same tweets over and over again.
annotate downloaded tweets with attributes like from_api, download_with_user, download_with_alt_text so that collect_tweet_references can make better decisions on what do re-download, and whether to follow references or not
improved merging and comparing of tweet JSON objects. Those are some really nasty data structures!

Nov 25 '22 15:11 lenaschimmel

The online diff view is no longer broken 🎉

Nov 27 '22 22:11 lenaschimmel

Hi, any chance all this nice stuff gets integrated? Thanks!

Feb 02 '23 16:02 slorquet

@slorquet I think this is still work in progress so can't be merged yet?

@lenaschimmel are you still working on this? Can you rebase it, so it's easier to review? I'd like to get as much out of the sinking blue ship as possible :-)

Feb 18 '23 15:02 Sjors

@Sjors I'm not really working on that branch, or twitter-archive-parser, any more since mid December 22. I've since deleted my own twitter* account, so I don't have much motivation to continue, and it's harder to test. And even if I really wanted to, I currently don't have any time to continue working on it.

Until then, I focused on the weirdly-named branch lenaschimmel:archivemode, which already includes all commits from downloadtweets. I'd say the archivemode branch is in rather good shape now, and can be used and/or merged. It is also up-to-date with timhutton:main so it could even be fast-forwarded. But I don't know if @timhutton wants to go into that direction.

* To be precise, I deleted my main twitter account. I still have some secondary ones lying around, but with much less content, thus less edge cases that are interesting for testing new code

Feb 21 '23 14:02 lenaschimmel

@lenaschimmel Would you like to document the fact that the other branch is more worthy of being merged by closing/drafting this one, and opening a new PR?

This little step could also make it easier for third-parties, like me, to discover your work and rely on it instead, in case @timhutton would not be available for further deliberation.

Oct 18 '23 22:10 almereyda