Wrong content-length header for datasets
Describe the bug
When downloading the datasets from https://x.com/i/communitynotes/download-data using wget, it hangs, not receiving more data, because the content-length header is too big (566M) for the file being served (185M).
To Reproduce
wget https://ton.twimg.com/birdwatch-public-data/2024/09/07/notes/notes-00000.tsv
wget https://ton.twimg.com/birdwatch-public-data/2024/09/07/notes/notes-00000.tsv
--2024-09-07 13:43:04-- https://ton.twimg.com/birdwatch-public-data/2024/09/07/notes/notes-00000.tsv
Resolving ton.twimg.com (ton.twimg.com)... 2606:2800:233:7ee2:97c:ab4c:6c70:be36, 152.199.21.140
Connecting to ton.twimg.com (ton.twimg.com)|2606:2800:233:7ee2:97c:ab4c:6c70:be36|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 593120688 (566M) [text/tab-separated-values]
Saving to: ‘notes-00000.tsv’
notes-00000.tsv 32%[=======> ] 185.19M --.-KB/s eta 2m 58s
Expected behavior The content-length header should be set to the file size.
Interesting, I can repro this. Thanks
~Same thing, in Safari and wget, today, with https://ton.twimg.com/birdwatch-public-data/2024/09/21/notes/notes-00000.tsv
% wget https://ton.twimg.com/birdwatch-public-data/2024/09/21/notes/notes-00000.tsv --2024-09-21 19:39:33-- https://ton.twimg.com/birdwatch-public-data/2024/09/21/notes/notes-00000.tsv Resolving ton.twimg.com (ton.twimg.com)... 152.199.24.184 Connecting to ton.twimg.com (ton.twimg.com)|152.199.24.184|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 611776258 (583M) [text/tab-separated-values] Saving to: ‘notes-00000.tsv’
notes-00000.tsv 32%[===================================> ] 190.74M --.-KB/s eta 3m 51s
(BUT, at least wget does sort-of-work/fail gracefully, eventually:
2024-09-21 19:43:08 (912 KB/s) - Connection closed at byte 200002828. Retrying.
--2024-09-21 19:43:09-- (try: 2) https://ton.twimg.com/birdwatch-public-data/2024/09/21/notes/notes-00000.tsv Connecting to ton.twimg.com (ton.twimg.com)|152.199.24.184|:443... connected. HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable
The file is already fully retrieved; nothing to do.)