📄🚀 – options for defining requirements for end-uses of TIDES data
Describe the feature you want and how it meets your needs or solves a problem
People want a way of standardizing which optional fields are required for various end-uses of TIDES data. How do I tell a vendor I need data in TIDES format, with at least the fields required for NTD service supplied reporting?
Describe the solution you'd like
I prefer Fork Repo to Require Everything to Feature Flags.
Describe alternatives you've considered
-
Fork Repo: fork the TIDES repo, change the spec to make the fields you need required.
- Pros:
- forker maintains control over requirements
- validation is easy with existing tools
- Cons:
- requirements not standardized across agencies
- Pros:
-
Feature Flags: add a
featuresproperty to each field in the table specs.featuresis an array of strings describing the features that require the field, e.g,"features": [ "Playback", "NTDServiceSupplied" ].- Pros:
- standardizes requirements for common end-uses
- Cons:
- more difficult to add or remove fields from the spec
- requires building a validator that supports the feature flags
- must know the requirements a priori
- tools that produce the same output (e.g., NTD service supplied) could have different requirements depending on methods, or their own optional features (e.g., a departure prediction engine that predicts dwell time from APC data has very different requirements from one that predicts dwell time from historical dwells)
- Pros:
-
Require Everything: require all tables and fields unless the vendor can demonstrate that they are not applicable to the system.
- Pros:
- simple
- Cons:
- self-certification of compliance can be problematic
- validation would require forking the TIDES spec and setting the required constraint based on the vendor-negotiated requirements
- Pros:
Additional context and sample data
Describing the features required for a playback tool is a good example of the pitfalls of setting requirements based on features.
A Playback tool can use every field of the vehicle_locations, passenger_events and fare_transactions tables, as well as additional event data that aren't (yet) part of the TIDES spec, and it doesn't require some of the required fields, like trip_id_performed. The only absolutely required fields of vehicle_locations are probably timestamp and vehicle_id, since vehicle position may not always be available (position is optional in GTFS-realtime VehiclePositions).
It may be the case that you want a field to be required, but allow nulls when information isn't available, for example, you might want to require latitude and longitude, but allow them to be nullable when GPS is unavailable. Frictionless doesn't allow this, nulls/missing values are not allowed in required fields.
Finally, adding feature flags complicates changes to the spec. If a field has a feature flag and we decide it should be removed, does that mean the feature will break? If we want to add a field do we need to figure out what features would require it? How do feature flags interact with versioning? There's a desire for a stable document for RFP requirements, but what happens when you discover an optional field is required for a feature. Do you have to update the version?
Another option is to have Feature Flags defined inside Field Profile files.
The name of the file would reflect the feature flag name, i.e. NTDServiceSupplied.csv (single column). The contents of the file would include a list of field names required (materialized paths could be used for naming for tree node locations). In the case of needing multiple features, a merge-sort would be used to combine multiple files.
- Pros: vendor or team can create their own files so their needs don't mix with standard, can be used along with Spec Feature Flags for overriding spec flag properties.
- Cons: mostly the same as Feature Flags.
Another option to consider is separately listing the required files and fields in a file that defines a TIDES "profile". This could take the form of a JSON or CSV file. (Maybe CSV would be better for less technical users to define and read, such as writers of an RFP requiring data in TIDES format.) The absence of a file or a field would imply that the file or field is not required (but could still be optionally included). For example, in tabular format, one could have something like this in an RFP for a basic AVL system that isn't connected to doors, APC, or AFC, in a bus-only transit agency:
| File | Field | Notes |
|---|---|---|
| stop_visits | service_date | |
| stop_visits | trip_id_performed | |
| stop_visits | stop_sequence | |
| stop_visits | vehicle_id | |
| stop_visits | pattern_id | can be null when unknown |
| stop_visits | stop_id | can be null if the vehicle stops at an undefined location |
| stop_visits | actual_arrival_time | |
| stop_visits | actual_departure_time | |
| stop_visits | schedule_relationship | |
| vehicle_locations | location_ping_id | |
| vehicle_locations | service_date | |
| vehicle_locations | event_timestamp | records at least every 10 seconds when unit is on and vehicle is in motion |
| vehicle_locations | trip_id_performed | can be null when not in a defined trip |
| vehicle_locations | stop_sequence | can be null when not at a stop |
| vehicle_locations | vehicle_id | |
| vehicle_locations | pattern_id | can be null when unknown or not in a trip |
| vehicle_locations | stop_id | only defined when serving a stop |
| vehicle_locations | latitude | |
| vehicle_locations | longitude | |
| vehicle_locations | in_service | |
| vehicle_locations | schedule_relationship | |
| trips_performed | service_date | |
| trips_performed | trip_id_performed | |
| trips_performed | vehicle_id | |
| trips_performed | trip_id_scheduled | |
| trips_performed | route_id | |
| trips_performed | pattern_id | |
| trips_performed | direction_id | |
| trips_performed | block_id | |
| trips_performed | schedule_relationship |
If we standardize the format in which requirements to apps or vendors are specified, then it will be possible to define a collection of TIDES "profiles" that people can refer to and that can be loaded to a program that checks a dataset against the profile (well not the notes, but at least the presence of files and fields).
@e-lo , @jlstpaul , what do you think?
@gabriel-korbato is the profile you describe similar to the one @mpaine-act described above (on Dec 19)? They seem similar, but if not I'd like to understand the difference.
Overall I think this is a reasonable approach in the short term. In the long term, if we get a lot of different uses of TIDES data, it may become difficult to centrally manage these profiles, but for now it provides a good framework to develop the concept.
@jlstpaul Yes, very similar. I suggested a 3-column table instead of a single column table, but that's just a formatting difference and either would work. My third column with notes adds the possibility of defining extra requirements or clarifications that humans have to read, analyze, and check.