More work on `downloadtweets`
Here's some work which I already did 3 days ago, but had not made into a PR before.
Yesterday I incorporated the newest stuff from timhutton/twitter-archive-parser/main into both timhutton/twitter-archive-parser/downloadtweets and lenaschimmel/twitter-archive-parser/downloadtweets because I thought that would make it easier to keep everything up to date and focus on the actual (non-merge) commits.
What's actually included
This is just more WIP on the download tweets feature:
- Error handling in
get_tweets- if it fails, return the tweets we already downloaded, plus the ids of tweets that are still missing - Add
mergemethod, which can basically merge all kinds of python values, but contains some special treatments for dicts representing tweets. This should make sure that if a tweet is contained in the archive, and we download that same tweet via the API, we can merge them without losing any information. - add helper method
has_pathwhich simplifies those chained checks that we have all over the code, like:if 'entities' in tweet and 'user_mentions' in tweet['entities'] and tweet['entities']['user_mentions'] is not None - save downloaded tweets into
known_tweets.json, and load them on next script execution, so that we don't reload the same tweets over and over again. - annotate downloaded tweets with attributes like
from_api,download_with_user,download_with_alt_textso thatcollect_tweet_referencescan make better decisions on what do re-download, and whether to follow references or not - improved merging and comparing of tweet JSON objects. Those are some really nasty data structures!
The online diff view is no longer broken 🎉
Hi, any chance all this nice stuff gets integrated? Thanks!
@slorquet I think this is still work in progress so can't be merged yet?
@lenaschimmel are you still working on this? Can you rebase it, so it's easier to review? I'd like to get as much out of the sinking blue ship as possible :-)
@Sjors I'm not really working on that branch, or twitter-archive-parser, any more since mid December 22. I've since deleted my own twitter* account, so I don't have much motivation to continue, and it's harder to test. And even if I really wanted to, I currently don't have any time to continue working on it.
Until then, I focused on the weirdly-named branch lenaschimmel:archivemode, which already includes all commits from downloadtweets. I'd say the archivemode branch is in rather good shape now, and can be used and/or merged. It is also up-to-date with timhutton:main so it could even be fast-forwarded. But I don't know if @timhutton wants to go into that direction.
* To be precise, I deleted my main twitter account. I still have some secondary ones lying around, but with much less content, thus less edge cases that are interesting for testing new code
@lenaschimmel Would you like to document the fact that the other branch is more worthy of being merged by closing/drafting this one, and opening a new PR?
This little step could also make it easier for third-parties, like me, to discover your work and rely on it instead, in case @timhutton would not be available for further deliberation.