FakeNewsNet icon indicating copy to clipboard operation
FakeNewsNet copied to clipboard

problem with retweet files

Open alijrhi opened this issue 6 years ago • 8 comments

Hi, contents of almost all the retweet files are {"retweets": []}, without any data. what is the problem? Is it normal?

alijrhi avatar Sep 05 '19 19:09 alijrhi

Look into the your data_collection.out log file. This might be a key issue with your twitter keys. At least that was it, when it happened to me. I had to restart the collection process, as all my data was empty as well.

SaschaStenger avatar Sep 06 '19 14:09 SaschaStenger

@SaschaStenger Hi! I just want to check it with my problem and see if you are having the same. So what is happening in my case is that due to errors most retweets aren't being collected but a few are (as described in this issue). Are you also having the same troubles with retweets or have you solved them?

rlleshi avatar Nov 15 '19 05:11 rlleshi

Hi The issue that you are describing is not unusual. I asked the same question more or less at the Twitter development forum. The reason being, that lots of tweets decay over time, especially in a context like fake news (with decay meaning, that they either get deleted, or hidden by the user). This then in turn leads to them not being available anymore. The same goes for, if the original tweet has been deleted or hidden. This can also lead to errors thrown by the twitter API. Lastly, just a fraction of all tweets have been retweeted. So this might be a reason, as to why your resulting .json data is empty. Check the corresponding tweet .json and look under the key: retweet_count. This will tell you, if the download code has missed any retweets due to errors or anything similar.

SaschaStenger avatar Nov 15 '19 07:11 SaschaStenger

Thanks a lot! I will take a look into it.

rlleshi avatar Nov 15 '19 13:11 rlleshi

@rlleshi Have you checked if it worked? because im having the same problem as your last issue Thanks.

kosty4 avatar Nov 21 '19 15:11 kosty4

@Dahabium as @SaschaStenger mentioned, this is due to the fact that this code is also crawling tweets which have decayed (deleted, hid). So this is normal. However, she seems to have created a version of the repository which skips these tweets and its crawling the whole dataset much faster.

rlleshi avatar Nov 21 '19 17:11 rlleshi

@SaschaStenger's, what num_processes do you set? (does it matter on the amount of tokens you are using). By the way, thanks for your mods in your repo!

kosty4 avatar Nov 22 '19 14:11 kosty4

I'm using one process less, then I use keys. And i don't know if that makes any difference, but my thought behind it was, so that if every process is using one key, then there is a backup key when the others are on timeout. But that is really your own preference. Personally, i wouldn't go with more processes then keys, just seeing as the keys are the bottleneck and not the number of processes that you are running. Maybe except for when your machine is really slow.

SaschaStenger avatar Nov 22 '19 14:11 SaschaStenger