trackintel icon indicating copy to clipboard operation
trackintel copied to clipboard

trackintel write csv function allows writing non-trackintel format csv files

Open henrymartin1 opened this issue 3 years ago • 5 comments

At the moment it is possible to write a .csv file that does not correspond to trackintel standards with the write_csv functions. I think this is a problem because then you can not open the file later using a read_csv function. Said differently, I think that every file that is written using a trackintel write_csv function should be readable using a trackintel read_csv function and this is not the case at the moment if the write_csv function is called without the accessor (e.g., ti.io.write_staypoints_csv).

I think the problem is that the dataframe is not checked before writing and an easy solution would be to include a call of the accessor before writing the dataframe in order to validate it.

Here is some sample code:

import trackintel as ti

from shapely.geometry import Point
import pandas as pd
import geopandas as gpd
import datetime
p1 = Point(8.5067847, 47.4)
p2 = Point(8.5067847, 47.5)
p3 = Point(8.5067847, 47.6)

t1 = pd.Timestamp("1971-01-01 00:00:00", tz="utc")
t2 = pd.Timestamp("1971-01-01 05:00:00", tz="utc")
t3 = pd.Timestamp("1971-01-02 07:00:00", tz="utc")
one_hour = datetime.timedelta(hours=1)

list_dict = [
    {"user_id": 0, "started_at": t1, "finished_at": t2, "geom": p1},
    {"user_id": 0, "started_at": t2, "finished_at": t3, "geom": p2},
    {"user_id": 1, "started_at": t3, "finished_at": t3 + one_hour, "geom": p3},
]
sp = gpd.GeoDataFrame(data=list_dict, geometry="geom", crs="EPSG:4326")
sp.index.name = "id"

sp.rename(inplace=True, columns={"geom":"geometry"})
sp.drop(columns=['finished_at'], inplace=True)


ti.io.write_staypoints_csv(sp, "test2.csv")
sp2 = ti.io.read_staypoints_csv("test2.csv", geom_col="geometry")

This produces the following error:

Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\graphentropy\lib\site-packages\pandas\core\indexes\base.py", line 3621, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'finished_at'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\envs\graphentropy\lib\site-packages\IPython\core\interactiveshell.py", line 3441, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-13-818f93779dfc>", line 1, in <module>
    sp2 = ti.io.read_staypoints_csv("test2.csv", geom_col="geometry")
  File "C:\ProgramData\Anaconda3\envs\graphentropy\lib\site-packages\trackintel\io\file.py", line 30, in wrapper
    return func(*args, **kwargs)
  File "C:\ProgramData\Anaconda3\envs\graphentropy\lib\site-packages\trackintel\io\file.py", line 293, in read_staypoints_csv
    df["finished_at"] = pd.to_datetime(df["finished_at"])
  File "C:\ProgramData\Anaconda3\envs\graphentropy\lib\site-packages\pandas\core\frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\ProgramData\Anaconda3\envs\graphentropy\lib\site-packages\pandas\core\indexes\base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 'finished_at'

henrymartin1 avatar Dec 01 '22 15:12 henrymartin1

I also think this is a problem, but more of a problem with the way the library is set up. We have two ways to access this function

ti.io.write_positionfixes(sp, "test2.csv")
sp.as_positionfixes.to_csv("test2.csv")

and in the latter the attributes are checked, which is the preferred way to access this function. Now if we were to add a check to write_positionfixes, we would have this overhead twice. But maybe this is not a big problem and we should add it anyway? What do you think?

bifbof avatar Dec 04 '22 17:12 bifbof

Hm... its true that this is a problem in the architecture and I don't really see an easy way out. Maybe if there is an easy way to tell the write_positionfixes function where the user is coming from and whether or not we can skip the validation? If you have no specific idea, I would just add it anyways.

henrymartin1 avatar Dec 05 '22 08:12 henrymartin1

We could use inspect.currentframe().f_back (stackoverflow) to get the callers frame, but I have to say that is really hacky. :D Has one of us ever tested how much work that check is? Else I'll check the difference and just add the extra check if it isn't too much.

bifbof avatar Dec 05 '22 10:12 bifbof

The check runs assert obj.geometry.is_valid.all() which isn't great in terms of performance but I think we should just add the check to the write_csv functions and optimize performance later if necessary.

henrymartin1 avatar Dec 05 '22 10:12 henrymartin1

Is this solved with #490? @bifbof

hongyeehh avatar Aug 26 '23 19:08 hongyeehh