gtfs-validator icon indicating copy to clipboard operation
gtfs-validator copied to clipboard

Implement threshold for triggering trip_distance_exceeds_shape_distance

Open emmambd opened this issue 2 years ago • 12 comments

Describe the problem

We've seen trip_distance_exceeds_shape_distance trigger as an error from anywhere between 6-7% of feeds in the Mobility Database. Several producers have expressed concern with this new error.

Analysis report

Python script

Proposed solution

Implement a threshold where this ERROR only generates if the difference between the max trips.txt shape_dist_traveled and the max shapes.txt shape_dist_traveled is greater than 1.11m, similar to our plans for equal_shape_dist_traveled_diff_coordinates. Rerun the acceptance tests to see how many feeds trigger this rule.

A warning should be added for when trip_distance_exceeds_shape_distance is less than 1.11m.

Alternatives you've considered

No response

Additional context

No response

emmambd avatar Nov 27 '23 18:11 emmambd

I'm curious what the specific concern from producers is. In my opinion, if there's an error, it should be stated in the validator report. The error itself does not depend on whether the distance traveled passes a certain threshold or not, so I'd suggest not putting a threshold in whether the error is reported.

As a producer, I would want to know that the feeds I produce have the error (and some of the feeds my company has produced do in fact have this error) so that I can fix it, and not have the validator refrain from telling me it if it's "not bad enough."

stevenmwhite avatar Dec 01 '23 23:12 stevenmwhite

A few of our project managers working directly with customer agencies have noted that transit agencies are getting spooked because the NTD Program Manager is forwarding them validation reports that show this error. While they specifically state that the reports are offered "solely [as a] courtesy and the resolution of the above errors are not required for FTA to accept your 2023 NTD report" the agencies have a fear that it may impact NTD report acceptance in the future.

Perhaps this is also why some other producers are concerned?

If that's the case, maybe we adjust the suggestion of excluding errors when the difference is less than 1% to instead report a warning when the difference is less than 1%. That way, we can still get the information out of the validator regardless of the size of the difference, but don't need to spook anyone with a minor, non-impactful error?

I wouldn't say this is my preference (an error is an error and if this is truly "canonical" it should call it an error), but because the error doesn't have much impact in those minor cases, I'm certainly open to it being a warning if that helps ease concerns.

EDIT: On the other hand, if this doesn't have much impact for consumers perhaps it should just be a warning all the time? I'm not sure if that's canonically accurate, but would also be fine for me. I just don't want to fully exclude any instances.

stevenmwhite avatar Dec 07 '23 05:12 stevenmwhite

Thanks for sharing your thoughts publicly @stevenmwhite! We dug into some old conversations about distance thresholds on the validator, and plan to work on changing the rule this quarter so

  • any difference above 1.11m = ERRROR
  • any difference below 1.11m = WARNING

We'll see if we should make the threshold more permissive than that based on the number of feeds we see still triggered by the ERROR and what their differences are.

emmambd avatar Jan 12 '24 16:01 emmambd

@emmambd The one issue I see with this is that the distance along route does not have a unit specified in GTFS. Some feeds may be in meters, some in kilometers, some in miles... it could even be in "average lengths of Steve's stride" (I hope nobody's feed using that as their distance measurements, haha!)

Our feeds happen to be in meters, but that's not part of the spec and we can't assume that all feeds are.

stevenmwhite avatar Jan 12 '24 16:01 stevenmwhite

Please share if we already have a pre-release msi available for this fix. We are seeing a ton of errors due to the trip distance exceeding the shape distance by 0.2 meters or less and we would like to get this fixed before testing the same infront of the transit authority. Is it already available, if not any approximate date on when it will be?

sivakarthik07 avatar Feb 01 '24 17:02 sivakarthik07

@stevenmwhite Thanks for that note Steve! We've been exploring ways we can check the distance between the lat longs of the last stop and the end of the shape in cases where the threshold is met.

@sivakarthik07 There is no PR or fix for this issue yet, but we are planning to deliver it for our next release in the end of March. How are you using the validator?

emmambd avatar Feb 01 '24 18:02 emmambd

@emmambd Thanks for responding! I am using the windows based 4.2 version of the validator.
I guess end of March should be OK, but a pre-release version before would be much appreciated.

sivakarthik07 avatar Feb 01 '24 20:02 sivakarthik07

Thanks @sivakarthik07 ! We're currently working on a fix and plan to provide package installers for every merge to main within the next month as well.

Could you share some sample feeds to see how our new threshold impacts them?

emmambd avatar Feb 21 '24 20:02 emmambd

cc @stevenmwhite @sivakarthik07

An update on the threshold that will be used in this rule — we initially set the threshold to 1.11m between the last shape point and the last stop point, but were still concerned about it being too restrictive given the high number of concerns we've received with this notice. After doing several distribution analyses of the data, we've concluded that 11.1m is permissive enough while still being useful to consumers to flag major problems in the shape_dist_traveled values.

Why 11.1m?

  • This is an extension of our logic for setting a threshold for equal_shape_dist_traveled_diff_coordinates, capturing differences between same values that resulting from rounding to 4 decimal places in lat/long values.

  • All 50th percentiles in the min distance distributions fell under 11.1m, indicating it's not restrictive for real world use cases.

We'll proceed with this threshold and seek feedback after the upcoming release this month for any other feedback on this notice change.

emmambd avatar Mar 08 '24 18:03 emmambd

Thanks @emmambd, I know this will relieve the concern from our project teams and customers in most cases.

Just two questions:

  1. Are you measuring the meter distance using lat/long values, knowing that the shape_distance_traveled value is unit-less?
  2. Has there been consideration to still throw a warning (as opposed to an error) for values that are beyond the shape but within the threshold? As a producer, it's really helpful to know that we can trust the validator to be as clear as possible and know that we'll be able to discover any/every case where a feed we produce doesn't match the spec. With this change, we're now adding subjective judgement to the validator and there'll be cases where feeds are "incorrect" but we won't learn that from the validator.

stevenmwhite avatar Mar 08 '24 18:03 stevenmwhite

@stevenmwhite Hi Steve,

  1. Correct - the meter distance calculation is based on the lat/longs precisely because there is no unit defined for shape_distance_traveled.

  2. yes, included in the PR is a warning, so anything below this threshold will trigger a WARNING rather than nothing. We agree that all cases should be flagged when we're applying a threshold, just with varying severity.

emmambd avatar Mar 08 '24 18:03 emmambd

Perfect! That seems to cover all of my concerns!

stevenmwhite avatar Mar 08 '24 18:03 stevenmwhite

@sivakarthik07 You can add this fix into your pipeline now by using the package installers associated with the most recent commit to master.

Please note that there's been a breaking change to the JSON report to improve the validatedAt timestamp: https://github.com/MobilityData/gtfs-validator/pull/1691. More info will be included about this in the release later this month.

emmambd avatar Mar 11 '24 18:03 emmambd