[BUG] Timing data contains laps with incorrect duplicate lap times
Describe the issue:
For the Qualifying of the 2023 Canadian GP, the timing data for some drivers contains laps that have the exact same lap time as a previous lap.
For example: Perez' first two laps, Verstappen's last two laps
Reference: https://www.fia.com/sites/default/files/2023_09_can_f1_q0_timing_qualifyingsessionlaptimes_v01.pdf
Edit after first investigation:
The laps that have incorrect lap times (and sector 3 times) are laps during which the session was red-flagged. The lap time and sector 3 time of the previous lap is then received again from the API. I.e. the incorrectly duplicated data already exists in the source data.
Expected Behaviour
FastF1 should detect that these values are incorrect and ignore them.
Reproduce the code example:
import fastf1
session = fastf1.get_session(2023, 'Canada', 'Q')
session.load(telemetry=False)
ver = session.laps.pick_driver('VER')
print(ver.loc[:, ('LapNumber', 'Time', 'LapTime')])
Error message:
core INFO Loading data for Canadian Grand Prix - Qualifying [v3.0.4]
req INFO Using cached data for driver_info
req INFO Using cached data for session_status_data
req INFO Using cached data for track_status_data
req INFO Using cached data for timing_data
req INFO Using cached data for timing_app_data
core INFO Processing timing data...
req INFO Using cached data for weather_data
req INFO Using cached data for race_control_messages
core INFO Finished loading data for 20 drivers: ['1', '27', '14', '44', '63', '31', '4', '55', '81', '23', '16', '11', '18', '20', '77', '22', '10', '21', '2', '24']
LapNumber Time LapTime
0 1.0 0 days 00:25:02.786000 NaT
1 2.0 0 days 00:26:36.836000 NaT
2 3.0 0 days 00:28:00.942000 0 days 00:01:24.106000
3 4.0 0 days 00:29:23.785000 0 days 00:01:22.843000
4 5.0 0 days 00:30:54.954000 0 days 00:01:31.169000
5 6.0 0 days 00:32:16.942000 0 days 00:01:21.988000
6 7.0 0 days 00:33:56.601000 0 days 00:01:39.659000
7 8.0 0 days 00:35:22.770000 0 days 00:01:26.169000
8 9.0 0 days 00:36:44.509000 0 days 00:01:21.739000
9 10.0 0 days 00:38:41.690000 0 days 00:01:57.181000
10 11.0 0 days 00:40:02.541000 0 days 00:01:20.851000
11 12.0 0 days 00:47:02.809000 NaT
12 13.0 0 days 00:48:34.656000 NaT
13 14.0 0 days 00:49:54.791000 0 days 00:01:20.135000
14 15.0 0 days 00:51:39.433000 0 days 00:01:44.642000
15 16.0 0 days 00:53:10.331000 0 days 00:01:30.898000
16 17.0 0 days 00:54:30.708000 0 days 00:01:20.377000
17 18.0 0 days 00:55:49.800000 0 days 00:01:19.092000
18 19.0 0 days 00:57:16.184000 0 days 00:01:26.384000
19 20.0 0 days 00:58:42.033000 0 days 00:01:25.849000
20 21.0 0 days 01:10:02.660000 NaT
21 22.0 0 days 01:11:38.752000 NaT
22 23.0 0 days 01:13:05.811000 0 days 00:01:27.059000
23 24.0 0 days 01:14:31.669000 0 days 00:01:25.858000
24 25.0 0 days 01:21:59.694000 0 days 00:01:25.858000
25 26.0 0 days 01:23:44.223000 NaT
I was thinking if we can check the race_control_messages for RED FLAG, and try to figure out whether there are any laps that have red_flag_time < Time < immediate_green_flag. This will require some processing and helper methods to do further analysis on the laps and race_control_messages.
I also noticed, that the lap "Time" (start time) i think is in GMT and race control messages are maybe in race local time. I think race control messages time should be converted to GMT ?
import fastf1
session = fastf1.get_session(2023, 'Canada', 'Q')
session.load(telemetry=False)
ver = session.laps.pick_driver('VER')
ver_df = ver.loc[:, ('LapNumber', 'Time', 'LapTime')]
#create a column that calculates the difference with previous finish lap time
ver_df['sub_time'] = ver_df['Time'].diff()
#create a boolean column to check if 'sub_time' equals 'LapTime'
ver_df['bool_check'] = ver_df['sub_time'] == ver_df['LapTime']
#create a boolean column to check if 'LapNumber' equals 'LapNumber' of previous row
ver_df['bool_previous_lap'] = ver_df['LapTime'] == ver_df['LapTime'].shift(1)
#if "bool_check" False and "bool_previous_lap" True, then set "LapTime" to None
ver_df['LapTime'] = ver_df['LapTime'].mask((ver_df['bool_check'] == False) & (ver_df['bool_previous_lap'] == True), None)
#remove the columns that were used to remove the duplicates
ver_df.drop(['sub_time', 'bool_check', 'bool_previous_lap'], axis=1, inplace=True)
I tried to do something like this, but obviously you can correct me if this could lead to ignore "useful" laps. This piece of code only adds some kind of temporary column to check two conditions:
- The first condition checks if the difference with the previous lap "Time" equals with the current "LapTime". For a normal lap, it should always return True;
- The second condition checks if the current "LapTime" equals the previous "LapTime". This should give more strenght to the first condition, checking if it's also a possible duplicate.
So, if the first condition if False and the second condition is True, we can set None to that lap time. Finally, temp columns are removed from the dataframe.
@d-tomasino this seems to work, although I'm not entirely happy with a solution like this because it just assumes that any lap time that matches these criteria is incorrect. F1 drivers surprisingly often set two successive laps with exactly the same time (can happen multiple times per race actually).
So this would need some more extensive testing on multiple session where it is manually verified whether the removed laps were correctly detected.
Additionally, your first check is in theory already implemented in the API parser. It should warn the user about "timing integrity errors", but apparently it is not triggered here. Before fixing this we should figure out why this warning is not shown because there has to be something else that's going on.
@theOehrly thanks for the reply! You're right, it's understandable that could happen not so rarely to have two straight laps with same exact time. However, in that case (as far as I understood) the difference in "Time" between the two adjacent rows should match the "LapTime" value, which is why, in the case of two consecutive real laps, the two conditions should report True and True instead of False and True as in this case (the red flag issue), but obviously I could be wrong, so please correct me if I said some inaccuracy.
In any case, as soon as I can, I could try to take a look first at the "timing integrity errors" warning that is not shown, so that we can try to solve everything step by step
Noting that this may not just be limited to laps near red flags, see #612. Also remember to investigate potential relation with #473