Question: Does the validator rely on a specific charset?
Describe the bug
I truncate a GTFS feed with multiple agencies
head -n5 ~/Downloads/mblthk-cnnct-dhid-gtfs-dir/agency.txt
"agency_id","agency_name","agency_url","agency_timezone","agency_lang","agency_phone"
66,"Stadtwerke Verkehrsbetriebe Wilhelmshaven GmbH","https://swwv.de/","Europe/Berlin","de","+49 4421 291257"
81,"Stadtwerke Osnabrück AG - Verkehrsbetriebe","http://www.stadtwerke-osnabrueck.de","Europe/Berlin","de","+49 541 20022211"
106,"Verkehr und Wasser GmbH (VWG)","http://www.vwg.de/","Europe/Berlin","de","+49 441 93660"
121,"Delmenhorst-Harpstedter Eisenbahn GmbH","http://www.dhe-reisen.de/","Europe/Berlin","de","+49 4244 93550"
to a GTFS feed with a single agency
head -n5 ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg/agency.txt
agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone
231,Braunschweiger Verkehrs-GmbH,http://www.verkehr-bs.de/,Europe/Berlin,de,+49 531 28639555
using this
java -jar -Xms20G -server ./target/onebusaway-gtfs-transformer-cli.jar --transform=./mblthk-cnnct-dhid-gtfs-bsvg.txt ~/Downloads/mblthk-cnnct-dhid-gtfs ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg
instruction.
At the end I am calling the validator on the truncated GTFS feed like this
java -jar ~/Downloads/gtfs-validator-4.2.0-cli.jar -i ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg.zip -o ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg-output
and get a reply like this.
Mar 13, 2024 8:54:10 AM org.mobilitydata.gtfsvalidator.runner.ValidationRunner printSummary
INFO: Validation took 0.557 seconds
Mar 13, 2024 8:54:10 AM org.mobilitydata.gtfsvalidator.runner.ValidationRunner printSummary
INFO: agency.txt INVALID_HEADERS
calendar.txt INVALID_HEADERS
calendar_dates.txt INVALID_HEADERS
routes.txt INVALID_HEADERS
shapes.txt INVALID_HEADERS
stop_times.txt INVALID_HEADERS
stops.txt INVALID_HEADERS
trips.txt INVALID_HEADERS
System errors:
cat mblthk-cnnct-dhid-gtfs-bsvg-output/system_errors.json
{"notices":[]}
JSON report:
[{"code":"csv_parsing_failed","severity":"ERROR","totalNotices":8,"sampleNotices":[{"filename":"calendar.txt","charIndex":0,"columnIndex":0,"lineIndex":0,"message":"java.lang.NullPointerException - Cannot invoke \"java.io.InputStream.read()\" because \"this.in\" is null\nParser Configuration: CsvParserSettings:\n\tAuto configuration enabled\u003dtrue\n\tAuto-closing enabled\u003dtrue\n\tAutodetect column delimiter\u003dfalse\n\tAutodetect quotes\u003dfalse\n\tColumn reordering enabled\u003dtrue\n\tDelimiters for detection\u003dnull\n\tEmpty value\u003dnull\n\tEscape unquoted values\u003dfalse\n\tHeader extraction enabled\u003dtrue\n\tHeaders\u003dnull\n\tIgnore leading whitespaces\u003dtrue\n\tIgnore leading whitespaces in quotes\u003dfalse\n\tIgnore trailing whitespaces\u003dtrue\n\tIgnore trailing whitespaces in quotes\u003dfalse\n\tInput buffer size\u003d1048576\n\tInput reading on separate thread\u003dtrue\n\tKeep escape sequences\u003dfalse\n\tKeep quotes\u003dfalse\n\tLength of content displayed on error\u003d-1\n\tLine separator detection enabled\u003dfalse\n\tMaximum number of characters per column\u003d4096\n\tMaximum number of columns\u003d512\n\tNormalize escaped line separators\u003dtrue\n\tNull value\u003dnull\n\tNumber of records to read\u003dall\n\tProcessor\u003dnone\n\tRestricting data in exceptions\u003dfalse\n\tRowProcessor error handler\u003dnull\n\tSelected fields\u003dnone\n\tSkip bits as whitespace\u003dtrue\n\tSkip empty lines\u003dtrue\n\tUnescaped quote handling\u003dnullFormat configuration:\n\tCsvFormat:\n\t\tComment character\u003d#\n\t\tField delimiter\u003d,\n\t\tLine separator (normalized)\u003d\\n\n\t\tLine separator sequence\u003d\\n\n\t\tQuote character\u003d\"\n\t\tQuote escape character\u003d\"\n\t\tQuote escape escape character\u003dnull\nInternal state when error was thrown: line\u003d0, column\u003d0, record\u003d0","parsedContent":""},
...
I am wondering, does the validator reply on a specific charset?
The original GTFS feed can be validated with utf-8 as charset.
file -i ~/Downloads/mblthk-cnnct-dhid-gtfs-dir/agency.txt
./agency.txt: text/csv; charset=utf-8
The truncated GTFS feed can not be validated with us-ascii charset.
file -i ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg/agency.txt
./agency.txt: text/csv; charset=us-ascii
Steps/Code to Reproduce
java -jar -Xms20G -server ./target/onebusaway-gtfs-transformer-cli.jar --transform=./mblthk-cnnct-dhid-gtfs-bsvg.txt ~/Downloads/mblthk-cnnct-dhid-gtfs ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg
java -jar ~/Downloads/gtfs-validator-4.2.0-cli.jar -i ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg.zip -o ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg-output
Expected Results
I expected a validation report for the truncated GTFS feed the same way I got a validation report for the original GTFS feed.
Actual Results
System errors:
cat mblthk-cnnct-dhid-gtfs-bsvg-output/system_errors.json
{"notices":[]}
JSON report:
[{"code":"csv_parsing_failed","severity":"ERROR","totalNotices":8,"sampleNotices":[{"filename":"calendar.txt","charIndex":0,"columnIndex":0,"lineIndex":0,"message":"java.lang.NullPointerException - Cannot invoke \"java.io.InputStream.read()\" because \"this.in\" is null\nParser Configuration: CsvParserSettings:\n\tAuto configuration enabled\u003dtrue\n\tAuto-closing enabled\u003dtrue\n\tAutodetect column delimiter\u003dfalse\n\tAutodetect quotes\u003dfalse\n\tColumn reordering enabled\u003dtrue\n\tDelimiters for detection\u003dnull\n\tEmpty value\u003dnull\n\tEscape unquoted values\u003dfalse\n\tHeader extraction enabled\u003dtrue\n\tHeaders\u003dnull\n\tIgnore leading whitespaces\u003dtrue\n\tIgnore leading whitespaces in quotes\u003dfalse\n\tIgnore trailing whitespaces\u003dtrue\n\tIgnore trailing whitespaces in quotes\u003dfalse\n\tInput buffer size\u003d1048576\n\tInput reading on separate thread\u003dtrue\n\tKeep escape sequences\u003dfalse\n\tKeep quotes\u003dfalse\n\tLength of content displayed on error\u003d-1\n\tLine separator detection enabled\u003dfalse\n\tMaximum number of characters per column\u003d4096\n\tMaximum number of columns\u003d512\n\tNormalize escaped line separators\u003dtrue\n\tNull value\u003dnull\n\tNumber of records to read\u003dall\n\tProcessor\u003dnone\n\tRestricting data in exceptions\u003dfalse\n\tRowProcessor error handler\u003dnull\n\tSelected fields\u003dnone\n\tSkip bits as whitespace\u003dtrue\n\tSkip empty lines\u003dtrue\n\tUnescaped quote handling\u003dnullFormat configuration:\n\tCsvFormat:\n\t\tComment character\u003d#\n\t\tField delimiter\u003d,\n\t\tLine separator (normalized)\u003d\\n\n\t\tLine separator sequence\u003d\\n\n\t\tQuote character\u003d\"\n\t\tQuote escape character\u003d\"\n\t\tQuote escape escape character\u003dnull\nInternal state when error was thrown: line\u003d0, column\u003d0, record\u003d0","parsedContent":""},
...
Screenshots
No response
Files used
No response
Validator version
4.2.0
Operating system
Debian 12
Java version
openjdk version "17.0.10" 2024-01-16
Additional notes
No response
Side note: We've had Unicode-supporting tool for a loong time now, and given that GTFS inherently has an international scope, I think it should be defined in the GTFS Schedule spec that the charset should be UTF-8.
@derhuerst To clarify, it looks like it's already a "should" within the spec under File Requirements: Files should be encoded in UTF-8 to support all Unicode characters.
@dancesWithCycles The validator does currently only rely on UTF-8 encoding. Can you clarify why you're using us-ascii?
Hi folks, Thanks for clarification!
I am using this tool to truncate a many agency GTFS feed
to a GTFS feed with a single agency
head -n5 ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg/agency.txt agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone 231,Braunschweiger Verkehrs-GmbH,http://www.verkehr-bs.de/,Europe/Berlin,de,+49 531 28639555
Unfortunately the resulting single agency GTFS feed
using this
java -jar -Xms20G -server ./target/onebusaway-gtfs-transformer-cli.jar --transform=./mblthk-cnnct-dhid-gtfs-bsvg.txt ~/Downloads/mblthk-cnnct-dhid-gtfs ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg
does use a character encoding different to utf-8. I would love to get to know about a different truncation tool that reduces a many agency GTFS feed to a single agency GTFS feed but keeping the character encoding utf-8 intact to be compatible with the GTFS validator.
Cheers!
A few of the details here do not seem to add up???the actual error reported by the validator is java.lang.NullPointerException - Cannot invoke "java.io.InputStream.read()" because "this.in" is null, and this seems to be because the validator was asked to validate ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg.zip, whereas the OBA GTFS transformer was asked to write output to ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg (which means it will produce a directory of loose CSV files, not a Zip!).
Without a Byte Order Mark present (which would, in fact, be unusual to find in a UTF-8-encoded file), there is no apparent difference between a plain-ASCII file and one encoded in UTF-8, so long as the file contains only ASCII characters (more precisely, characters in the Unicode Basic Latin block). In other words, the only way file knows that a file is UTF-8 is because it sees UTF-8-encoded characters. In this case, the operation to remove three agencies also removes the only character in the agency.txt file outside the Unicode Basic Latin block (an ??), and so file rightly concludes that it contains ASCII text (as in the output which I have quoted below). (Given the nature of UTF-8, such a file is inherently also valid as a UTF-8-encoded file.)
file -i ~/Downloads/mblthk-cnnct-dhid-gtfs-bsvg/agency.txt
./agency.txt: text/csv; charset=us-ascii
Original agency.txt (note ??):
"agency_id","agency_name","agency_url","agency_timezone","agency_lang","agency_phone"
66,"Stadtwerke Verkehrsbetriebe Wilhelmshaven GmbH","https://swwv.de/","Europe/Berlin","de","+49 4421 291257"
81,"Stadtwerke Osnabr??ck AG - Verkehrsbetriebe","http://www.stadtwerke-osnabrueck.de","Europe/Berlin","de","+49 541 20022211"
106,"Verkehr und Wasser GmbH (VWG)","http://www.vwg.de/","Europe/Berlin","de","+49 441 93660"
121,"Delmenhorst-Harpstedter Eisenbahn GmbH","http://www.dhe-reisen.de/","Europe/Berlin","de","+49 4244 93550"
Transformed agency.txt (note absence of any characters outside the Unicode Basic Latin block):
agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone
231,Braunschweiger Verkehrs-GmbH,http://www.verkehr-bs.de/,Europe/Berlin,de,+49 531 28639555
To make a long story short, I think (without actually having these files at hand to inspect) that the matter of character encoding is a red herring; it is odd that the validator does not provide a more specific error when it cannot open the source file but given the error message reported and the fact that the paths in the excerpts in the comment above do not match, I strongly suspect that was the underlying problem.
I have just tested round-tripping the STM's GTFS through the OBA GTFS transformer, and indeed it properly preserves characters outside the Unicode Basic Latin block. More to the point, inspection of the code in onebusaway-csv-entities for writing to loose files as well as to a Zip archive shows that in both cases files are written as UTF-8.
@kurtraschke Thank you for digging into this issue on the OBA side! From your investigation, it's clear this isn't an issue with encoding with the GTFS transformer. We do have a specific error message in the case of an invalid ZIP file. It looks like this is a case we don't sufficiently address. @dancesWithCycles, we haven't been able to reproduce your issue when we validate a folder.
For next steps:
- I'll close the issue on the OBA repo (since this is a specific problem on the validator)
- @dancesWithCycles, it would be great if you could provide a feed example so we can test it on our side.