python-bigquery-pandas pyarrow.lib.ArrowTypeError: Expected bytes, got a 'dict' object

Hi! there is a problem when trying to load using pandas-gbq which using pyarrow a column of the list (array) or dictionary (json) type into the table, while the GBQ documentation says that structure types such as array or json are supported,

df = pd.DataFrame(
                {
                    "my_string": ["a", "b", "c"],
                    "my_int64": [1, 2, 3],
                    "my_float64": [4.0, 5.0, 6.0],
                    "my_bool1": [True, False, True],
                    "my_bool2": [False, True, False],
                    "my_struct": [{"test":"str1"},{"test":"str2"},{"test":"str3"}],
                }
            )
pandas_gbq.to_gbq(df, **gbq_params)

as a result, a stacktrace error occurs:

in bq_to_arrow_array
return pyarrow.Array.from_pandas(series, type=arrow_type)
File "pyarrow/array.pxi", line 913, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 311, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'dict' object

Can anyone help with it please?

Dec 28 '21 13:12 ivanpugachtd

Is this writing to an existing table? Could you share the schema of the destination table?

Jan 04 '22 16:01 tswast

Hi, @tswast I am trying to upload data to the new table, more precisely I tried both versions: pyarrow==6.0.1 pandas-gbq==0.16.0

In fact I was able to upload data, only if I using json.dumps() on the column which has list or dict type in there

Jan 12 '22 12:01 ivanpugachtd

any updates on this? getting the same error. the strange thing is that the code works well locally and in compute engine, but fails in cloud run (even though the same service account is being used for both)

Jan 18 '22 03:01 grzesir

I am trying to upload data to the new table, more precisely I tried both

Ah, that probably explains it. Currently, pandas-gbq attempts to determine a schema locally based on the dtypes it detects. It likely gets this wrong for the struct/array data.

I believe we can avoid this problem with https://github.com/googleapis/python-bigquery-pandas/issues/339 where instead of pandas-gbq creating the table, we create the table as part of the load job.

Jan 19 '22 18:01 tswast

Has there been any progress on updating this issue? I am seeing the same error message.

Could we elaborate on:

I believe we can avoid this problem with https://github.com/googleapis/python-bigquery-pandas/issues/339 where instead of pandas-gbq creating the table, we create the table as part of the load job.

As I am seeing the same issue even with a created table, and using (if_exists='replace'):

pandas_gbq.to_gbq(dataframe, table_id, project_id=project_id, if_exists='replace')

The work-around that helped me to successfully load my table was casting the dataframe column to string data type.

As an example GCP Cloud Function:

import pandas as pd
import pandas_gbq

def gbq_write(request):

  # TODO: Set project_id to your Google Cloud Platform project ID.
  project_id = "project-id"

  # TODO: Set table_id to the full destination table ID (including the dataset ID).
  table_id = 'dataset.table'

  df = pd.DataFrame(
      {
          "my_string": ["a", "b", "c"],
          "my_int64": [1, 2, 3],
          "my_float64": [4.0, 5.0, 6.0],
          "my_bool1": [True, False, True],
          "my_dates": pd.date_range("now", periods=3),
          "my_struct": [{"test":"str1"},{"test":"str2"},{"test":"str3"}],
      }
  )

  pandas_gbq.to_gbq(df, table_id, project_id=project_id, if_exists='replace')

  return f'Successfully Written'

This produces the error mentioned in this thread:

pyarrow.lib.ArrowTypeError: Expected bytes, got a 'dict' object

With requirements.txt as

pandas==1.4.1
pandas-gbq==0.17.4

When pushing the column casting I added a single line and ended up with:

import pandas as pd
import pandas_gbq

def gbq_write(request):

  # TODO: Set project_id to your Google Cloud Platform project ID.
  project_id = "project-id"

  # TODO: Set table_id to the full destination table ID (including the dataset ID).
  table_id = 'dataset.table'

  df = pd.DataFrame(
      {
          "my_string": ["a", "b", "c"],
          "my_int64": [1, 2, 3],
          "my_float64": [4.0, 5.0, 6.0],
          "my_bool1": [True, False, True],
          "my_dates": pd.date_range("now", periods=3),
          "my_struct": [{"test":"str1"},{"test":"str2"},{"test":"str3"}],
      }
  )

  # Column conversion added to load table
  df['my_struct'] = df['my_struct'].astype("string")

  pandas_gbq.to_gbq(df, table_id, project_id=project_id, if_exists='replace')

  return f'Successfully Written'

This helps to successfully load the table into BigQuery with schema:

Field name	Type
my_string	STRING
my_int64	INTEGER
my_float64	FLOAT
my_bool1	BOOLEAN
my_dates	TIMESTAMP
my_struct	STRING

If you need the my_struct to be an actual struct consider:

SELECT
  *
   # retrieve value from struct
  ,json_value(my_struct, '$.test') AS test
   # recreate struct using value for each row
  ,struct(json_value(my_struct, '$.test') AS test) AS my_created_struct
FROM `project-id.dataset.table` order by my_int64

Row	my_string	my_int64	my_float64	my_bool1	my_dates	my_struct	test	my_created_struct.test
1	a	1	4.0	true	2022-03-24 04:14:28.267319 UTC	{'test': 'str1'}	str1	str1
2	b	2	5.0	false	2022-03-25 04:14:28.267319 UTC	{'test': 'str2'}	str2	str2
3	c	3	6.0	true	2022-03-26 04:14:28.267319 UTC	{'test': 'str3'}	str3	str3

Mar 23 '22 20:03 nabor-slalom-greenparksports