trackintel write csv function allows writing non-trackintel format csv files
At the moment it is possible to write a .csv file that does not correspond to trackintel standards with the write_csv functions. I think this is a problem because then you can not open the file later using a read_csv function. Said differently, I think that every file that is written using a trackintel write_csv function should be readable using a trackintel read_csv function and this is not the case at the moment if the write_csv function is called without the accessor (e.g., ti.io.write_staypoints_csv).
I think the problem is that the dataframe is not checked before writing and an easy solution would be to include a call of the accessor before writing the dataframe in order to validate it.
Here is some sample code:
import trackintel as ti
from shapely.geometry import Point
import pandas as pd
import geopandas as gpd
import datetime
p1 = Point(8.5067847, 47.4)
p2 = Point(8.5067847, 47.5)
p3 = Point(8.5067847, 47.6)
t1 = pd.Timestamp("1971-01-01 00:00:00", tz="utc")
t2 = pd.Timestamp("1971-01-01 05:00:00", tz="utc")
t3 = pd.Timestamp("1971-01-02 07:00:00", tz="utc")
one_hour = datetime.timedelta(hours=1)
list_dict = [
{"user_id": 0, "started_at": t1, "finished_at": t2, "geom": p1},
{"user_id": 0, "started_at": t2, "finished_at": t3, "geom": p2},
{"user_id": 1, "started_at": t3, "finished_at": t3 + one_hour, "geom": p3},
]
sp = gpd.GeoDataFrame(data=list_dict, geometry="geom", crs="EPSG:4326")
sp.index.name = "id"
sp.rename(inplace=True, columns={"geom":"geometry"})
sp.drop(columns=['finished_at'], inplace=True)
ti.io.write_staypoints_csv(sp, "test2.csv")
sp2 = ti.io.read_staypoints_csv("test2.csv", geom_col="geometry")
This produces the following error:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\graphentropy\lib\site-packages\pandas\core\indexes\base.py", line 3621, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'finished_at'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\envs\graphentropy\lib\site-packages\IPython\core\interactiveshell.py", line 3441, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-13-818f93779dfc>", line 1, in <module>
sp2 = ti.io.read_staypoints_csv("test2.csv", geom_col="geometry")
File "C:\ProgramData\Anaconda3\envs\graphentropy\lib\site-packages\trackintel\io\file.py", line 30, in wrapper
return func(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\graphentropy\lib\site-packages\trackintel\io\file.py", line 293, in read_staypoints_csv
df["finished_at"] = pd.to_datetime(df["finished_at"])
File "C:\ProgramData\Anaconda3\envs\graphentropy\lib\site-packages\pandas\core\frame.py", line 3505, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\ProgramData\Anaconda3\envs\graphentropy\lib\site-packages\pandas\core\indexes\base.py", line 3623, in get_loc
raise KeyError(key) from err
KeyError: 'finished_at'
I also think this is a problem, but more of a problem with the way the library is set up. We have two ways to access this function
ti.io.write_positionfixes(sp, "test2.csv")
sp.as_positionfixes.to_csv("test2.csv")
and in the latter the attributes are checked, which is the preferred way to access this function.
Now if we were to add a check to write_positionfixes, we would have this overhead twice.
But maybe this is not a big problem and we should add it anyway? What do you think?
Hm... its true that this is a problem in the architecture and I don't really see an easy way out. Maybe if there is an easy way to tell the write_positionfixes function where the user is coming from and whether or not we can skip the validation?
If you have no specific idea, I would just add it anyways.
We could use inspect.currentframe().f_back (stackoverflow) to get the callers frame, but I have to say that is really hacky. :D
Has one of us ever tested how much work that check is? Else I'll check the difference and just add the extra check if it isn't too much.
The check runs assert obj.geometry.is_valid.all() which isn't great in terms of performance but I think we should just add the check to the write_csv functions and optimize performance later if necessary.
Is this solved with #490? @bifbof