hdfs3 icon indicating copy to clipboard operation
hdfs3 copied to clipboard

Kerberos // Failed to extract principal from ticket cache: Bad format in credentials cache

Open mehd-io opened this issue 8 years ago • 5 comments

Hi there, We got a clustered kerberos with SSL, and I basically used the same keytab that I'm using for other services to try out hdfs3 like this :

from hdfs3 import HDFileSystem
conf={'hadoop.security.authentication': 'kerberos'}
ticket_path='/home/dazer/mykey.keytab'
hdfs = HDFileSystem(host='hdfs://myhost', port=8020,pars=conf,ticket_cache=ticket_path)

and got the error :

ConnectionError: Connection Failed: HdfsIOException: FileSystem: Failed to extract principal from ticket cache: Bad format in credentials cache (filename: /home/dazer/mykey.keytab)

Any clue ? Thx!

mehd-io avatar Apr 20 '18 14:04 mehd-io

I suppose PeriodIndex is to DatetimeIndex as RangeIndex is to integers? I don't know if there is a spec for how to write this into the pandas metadata tag of the parquet output (check at pyarrow? or even pandas itself?) but it certainly seems like something we ought to be able to handle.

martindurant avatar Jan 02 '21 18:01 martindurant

I suppose PeriodIndex is to DatetimeIndex as RangeIndex is to integers? I don't know if there is a spec for how to write this into the pandas metadata tag of the parquet output (check at pyarrow? or even pandas itself?) but it certainly seems like something we ought to be able to handle.

Hi @martindurant I am sorry, I have no knowledge about this. I tried 2 things with pandas/pyarrow, and I am surprised with the results.

Case 1: keeping PeriodIndex within a column: write: ok // read: ok

import os
import pandas as pd

path = os.path.expanduser('~/Documents/code/draft/data/')
file = path + 'weather_data'
datetime_index = pd.date_range(start = pd.Timestamp('2020/01/02 01:00:00'), end = pd.Timestamp('2020/01/02 12:00:00'), freq='2H')
period_index = pd.period_range(start = pd.Timestamp('2020/01/02 01:00:00'), end = pd.Timestamp('2020/01/02 12:00:00'), freq='2H')

df1 = pd.DataFrame({'humidity': [0.3, 0.8, 0.9, 0.3, 0.8, 0.9],
                    'pressure': [1e5, 1.1e5, 0.95e5, 1e5, 1.1e5, 0.95e5],
                    'location': ['Paris', 'Paris', 'Milan', 'Paris', 'Paris', 'Milan'],
                    'period': period_index
                   },
                   index = datetime_index)
df1.to_parquet(file, engine='pyarrow')
df = pd.read_parquet(file)
df['period']

2020-01-02 01:00:00    2020-01-02 01:00
2020-01-02 03:00:00    2020-01-02 03:00
2020-01-02 05:00:00    2020-01-02 05:00
2020-01-02 07:00:00    2020-01-02 07:00
2020-01-02 09:00:00    2020-01-02 09:00
2020-01-02 11:00:00    2020-01-02 11:00
Name: period, dtype: period[2H]

Case 2: having PeriodIndex as an index: write: ok // read: nook

import os
import pandas as pd

path = os.path.expanduser('~/Documents/code/draft/data/')
file = path + 'weather_data'
period_index = pd.period_range(start = pd.Timestamp('2020/01/02 01:00:00'), end = pd.Timestamp('2020/01/02 12:00:00'), freq='2H')
df1 = pd.DataFrame({'humidity': [0.3, 0.8, 0.9, 0.3, 0.8, 0.9],
                    'pressure': [1e5, 1.1e5, 0.95e5, 1e5, 1.1e5, 0.95e5],
                    'location':['Paris', 'Paris', 'Milan', 'Paris', 'Paris', 'Milan']},
                    index = period_index)
df1.to_parquet(file, engine='pyarrow')
df = pd.read_parquet(file)
Traceback (most recent call last):

  File "<ipython-input-25-7de11581bfac>", line 12, in <module>
    df = pd.read_parquet(file)

  File "/home/pierre/anaconda3/lib/python3.8/site-packages/pandas/io/parquet.py", line 317, in read_parquet
    return impl.read(path, columns=columns, **kwargs)

  File "/home/pierre/anaconda3/lib/python3.8/site-packages/pandas/io/parquet.py", line 141, in read
    result = self.api.parquet.read_table(

  File "pyarrow/array.pxi", line 742, in pyarrow.lib._PandasConvertible.to_pandas

  File "pyarrow/table.pxi", line 1583, in pyarrow.lib.Table._to_pandas

  File "/home/pierre/anaconda3/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 788, in table_to_blockmanager
    blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)

  File "/home/pierre/anaconda3/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1127, in _table_to_blocks
    result = pa.lib.table_to_blocks(options, block_table, categories,

  File "pyarrow/table.pxi", line 1031, in pyarrow.lib.table_to_blocks

  File "stringsource", line 111, in set.from_py.__pyx_convert_unordered_set_from_py_std_3a__3a_string

  File "stringsource", line 15, in string.from_py.__pyx_convert_string_from_py_std__in_string

TypeError: expected bytes, NoneType found

So not sure pyarrow exactly supports PeriodIndex. But how comes when it is a column, it works?

yohplala avatar Jan 03 '21 18:01 yohplala

So it seems fastparquet simply doesn't handle the periodic type - but neither does parquet (not the same as INTERVAL). So the right thing to do is convert to datetimes, but store the dtype (which includes the size of the periods, which might overlap). This will fail for business periods like '4Q2005'!

I suspect that pyarrow is trying to write range information for the index case, but we shouldn't worry about that. Note that, with the exception of RangeIndex, any index is stored as a normal column, except that the pandas metadata tag contains the information of which column should be made back into the index on load. Any of the columns may be sorted, this is a separate issue.

martindurant avatar Jan 04 '21 14:01 martindurant

I should have been more specific: apparently this could be considered a pandas bug

pd.api.types.infer_dtype(df1.index) # Exception you were getting
pd.api.types.infer_dtype(df1.index.values) # Works as expected

although I suspect that fastparquet would still need to special-case the resultant "period" type.

martindurant avatar Jan 04 '21 15:01 martindurant