Kerberos // Failed to extract principal from ticket cache: Bad format in credentials cache
Hi there, We got a clustered kerberos with SSL, and I basically used the same keytab that I'm using for other services to try out hdfs3 like this :
from hdfs3 import HDFileSystem
conf={'hadoop.security.authentication': 'kerberos'}
ticket_path='/home/dazer/mykey.keytab'
hdfs = HDFileSystem(host='hdfs://myhost', port=8020,pars=conf,ticket_cache=ticket_path)
and got the error :
ConnectionError: Connection Failed: HdfsIOException: FileSystem: Failed to extract principal from ticket cache: Bad format in credentials cache (filename: /home/dazer/mykey.keytab)
Any clue ? Thx!
I suppose PeriodIndex is to DatetimeIndex as RangeIndex is to integers? I don't know if there is a spec for how to write this into the pandas metadata tag of the parquet output (check at pyarrow? or even pandas itself?) but it certainly seems like something we ought to be able to handle.
I suppose PeriodIndex is to DatetimeIndex as RangeIndex is to integers? I don't know if there is a spec for how to write this into the pandas metadata tag of the parquet output (check at pyarrow? or even pandas itself?) but it certainly seems like something we ought to be able to handle.
Hi @martindurant I am sorry, I have no knowledge about this. I tried 2 things with pandas/pyarrow, and I am surprised with the results.
Case 1: keeping PeriodIndex within a column: write: ok // read: ok
import os
import pandas as pd
path = os.path.expanduser('~/Documents/code/draft/data/')
file = path + 'weather_data'
datetime_index = pd.date_range(start = pd.Timestamp('2020/01/02 01:00:00'), end = pd.Timestamp('2020/01/02 12:00:00'), freq='2H')
period_index = pd.period_range(start = pd.Timestamp('2020/01/02 01:00:00'), end = pd.Timestamp('2020/01/02 12:00:00'), freq='2H')
df1 = pd.DataFrame({'humidity': [0.3, 0.8, 0.9, 0.3, 0.8, 0.9],
'pressure': [1e5, 1.1e5, 0.95e5, 1e5, 1.1e5, 0.95e5],
'location': ['Paris', 'Paris', 'Milan', 'Paris', 'Paris', 'Milan'],
'period': period_index
},
index = datetime_index)
df1.to_parquet(file, engine='pyarrow')
df = pd.read_parquet(file)
df['period']
2020-01-02 01:00:00 2020-01-02 01:00
2020-01-02 03:00:00 2020-01-02 03:00
2020-01-02 05:00:00 2020-01-02 05:00
2020-01-02 07:00:00 2020-01-02 07:00
2020-01-02 09:00:00 2020-01-02 09:00
2020-01-02 11:00:00 2020-01-02 11:00
Name: period, dtype: period[2H]
Case 2: having PeriodIndex as an index: write: ok // read: nook
import os
import pandas as pd
path = os.path.expanduser('~/Documents/code/draft/data/')
file = path + 'weather_data'
period_index = pd.period_range(start = pd.Timestamp('2020/01/02 01:00:00'), end = pd.Timestamp('2020/01/02 12:00:00'), freq='2H')
df1 = pd.DataFrame({'humidity': [0.3, 0.8, 0.9, 0.3, 0.8, 0.9],
'pressure': [1e5, 1.1e5, 0.95e5, 1e5, 1.1e5, 0.95e5],
'location':['Paris', 'Paris', 'Milan', 'Paris', 'Paris', 'Milan']},
index = period_index)
df1.to_parquet(file, engine='pyarrow')
df = pd.read_parquet(file)
Traceback (most recent call last):
File "<ipython-input-25-7de11581bfac>", line 12, in <module>
df = pd.read_parquet(file)
File "/home/pierre/anaconda3/lib/python3.8/site-packages/pandas/io/parquet.py", line 317, in read_parquet
return impl.read(path, columns=columns, **kwargs)
File "/home/pierre/anaconda3/lib/python3.8/site-packages/pandas/io/parquet.py", line 141, in read
result = self.api.parquet.read_table(
File "pyarrow/array.pxi", line 742, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 1583, in pyarrow.lib.Table._to_pandas
File "/home/pierre/anaconda3/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 788, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
File "/home/pierre/anaconda3/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1127, in _table_to_blocks
result = pa.lib.table_to_blocks(options, block_table, categories,
File "pyarrow/table.pxi", line 1031, in pyarrow.lib.table_to_blocks
File "stringsource", line 111, in set.from_py.__pyx_convert_unordered_set_from_py_std_3a__3a_string
File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string
TypeError: expected bytes, NoneType found
So not sure pyarrow exactly supports PeriodIndex. But how comes when it is a column, it works?
So it seems fastparquet simply doesn't handle the periodic type - but neither does parquet (not the same as INTERVAL). So the right thing to do is convert to datetimes, but store the dtype (which includes the size of the periods, which might overlap). This will fail for business periods like '4Q2005'!
I suspect that pyarrow is trying to write range information for the index case, but we shouldn't worry about that. Note that, with the exception of RangeIndex, any index is stored as a normal column, except that the pandas metadata tag contains the information of which column should be made back into the index on load. Any of the columns may be sorted, this is a separate issue.
I should have been more specific: apparently this could be considered a pandas bug
pd.api.types.infer_dtype(df1.index) # Exception you were getting
pd.api.types.infer_dtype(df1.index.values) # Works as expected
although I suspect that fastparquet would still need to special-case the resultant "period" type.