Fix dateformat of timestamp field for the flights.json.gz test dataset
Problem:
I recently used the flights.json.gz dataset for testing Elasticsearch queries in another project. When indexing the data into Elasticsearch I noticed that the timestamp causes some parsing errors. The date format for this field, which is specified in tests/__init__.py, is strict_date_hour_minute_second. In the dataset the timestamp is sometimes set to something like this "2018-02-10", which leads to parsing errors.
I wrote this short bash "script" to search all timestamp fields in the dataset that don't contain the "T" separator. It's not that efficient but it proves the point.
gunzip --stdout flights.json.gz | while read -r line; do
echo $line | jq '.timestamp' | grep -v "T"
done
The output is:
"2018-01-02"
"2018-01-03"
"2018-01-04"
"2018-01-05"
"2018-01-06"
"2018-01-07"
"2018-01-08"
"2018-01-09"
"2018-01-10"
"2018-01-11"
"2018-01-12"
"2018-01-12"
"2018-01-12"
"2018-01-13"
"2018-01-14"
"2018-01-15"
"2018-01-16"
"2018-01-17"
"2018-01-18"
"2018-01-19"
"2018-01-20"
"2018-01-21"
"2018-01-22"
"2018-01-23"
"2018-01-24"
"2018-01-25"
"2018-01-26"
"2018-01-27"
"2018-01-28"
"2018-01-29"
"2018-01-30"
"2018-01-31"
"2018-02-01"
"2018-02-02"
"2018-02-03"
"2018-02-04"
"2018-02-05"
"2018-02-06"
"2018-02-07"
"2018-02-08"
"2018-02-09"
"2018-02-09"
"2018-02-09"
"2018-02-10"
"2018-02-11"
BTW, these "invalid" timestamps only occur in the flights.json.gz dataset but not in flights_small.json.gz.
Solution:
In order to support these timestamps I changed the date format to strict_date_optional_time.
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?
jenkins test this please
buildkite test this please
buildkite test this please