extensions icon indicating copy to clipboard operation
extensions copied to clipboard

[firestore-bigquery-change-tracker] Can't partition table on schema's `timestamp` column

Open bergold opened this issue 3 years ago • 5 comments

[READ] Step 1: Are you in the right place?

Yes

[REQUIRED] Step 2: Describe your configuration

  • Extension name: firestore-bigquery-export
  • Extension version: 0.1.24
  • Configuration values (redacted all other values unrelated to partitioning):
    • TABLE_PARTITIONING=MONTH
    • TIME_PARTITIONING_FIELD=timestamp
    • TIME_PARTITIONING_FIELD_TYPE=TIMESTAMP
    • TIME_PARTITIONING_FIRESTORE_FIELD is left empty on purpose

Note I want to use the timestamp column, which is already part of the BigQuery schema (and which contains the date and time of the function event). I don't want to extract some field from the Firestore document itself.

[REQUIRED] Step 3: Describe the problem

Steps to reproduce:

  1. Deploy the extension with the above partitioning configuration.
Expected result

I expect the extension to create a partitioned table.

Actual result

The extensions creates a table without partitioning applied.

Here is the extension logs of the first invocation (which created the table):

Screenshot 2022-06-24 at 20 30 51

(One thing to note: We had the extension already installed without partitioning enabled. I updated the extension and its configuration and deleted only the table (not the view) from BigQuery)

Hypothesis

I think the bug lays in the fact, that the firestore-bigquery-change-tracker either expects all or none of the custom partitioning options to be set. Even though I would expect it to also work with one of the other columns, that are already part of the schema (in my use case the timestamp column) and don't need to be extracted from the document data.

https://github.com/firebase/extensions/blob/f053443834d5ef4521100966d16367799e8340b8/firestore-bigquery-export/firestore-bigquery-change-tracker/src/bigquery/partitioning.ts#L93-L97

bergold avatar Jun 24 '22 18:06 bergold

HI @bergold. There is also a scenario where we cannot partition an already existing table.

We are currently tracking this related issue https://github.com/firebase/extensions/issues/796 included are instructions on how to manually get around the issue.

dackers86 avatar Jun 29 '22 08:06 dackers86

Hey! Thanks for your response!

I just revalidated the issue I'm having with a completely fresh Firebase project.

The steps I took are:

  1. Create a new Firebase project
  2. Create a Firestore database therein
  3. In a fresh directory on my machine:
    1. Run firebase init
    2. Setup firestore with default options
    3. Run firebase ext:install firebase/firestore-bigquery-export
    4. Run firebase deploy and wait for it to finish
  4. Trigger a write to the Firestore posts collection
  5. See the newly created BigQuery dataset and table

Here is a link to the terminal output of step 3.

I will also update the issue description above to make this more clear. (and upload a link to a reproduction repo in a minute)

In my understanding, the description of this config field of the BigQuery extension suggests, that my intended use should be supported:

BigQuery Time Partitioning column name (Optional): BigQuery table column/schema field name for TimePartitioning. You can choose schema available as timestamp OR new custom defined column that will be assigned to the selected Firestore Document field below. Defaults to pseudo column _PARTITIONTIME if unspecified. Cannot be changed if Table is already partitioned.

So my question is: How do I choose the timestamp schema field for partitioning and not the default _PARTITIONTIME pseudo column?

bergold avatar Jun 29 '22 10:06 bergold

Thanks for the detailed update @bergold

To sync a partitioned column, you will also need the relevant Firestore field to be defined.

Firestore Document field name for BigQuery SQL Time Partitioning field option (Optional): This parameter will allow you to partition the BigQuery table  created by the extension based on selected. The Firestore Document field value must be a top-level TIMESTAMP, DATETIME, DATE field BigQuery string format or Firestore timestamp(will be converted to BigQuery TIMESTAMP). Cannot be changed if Table is already partitioned.
 example: postDate
? Enter a value for Firestore Document field name for BigQuery SQL Time Partitioning field option: 

This looks to be unspecified in your configuration example?

dackers86 avatar Jun 29 '22 12:06 dackers86

However, this is not what I want to do. I want to use the timestamp column, which is already part of the BigQuery schema (and which contains the date and time of the function event).

I now, that it's nearly the same as using ingestion time partitioning with the pseudo column. However, because there is already the timestamp column and we are already using it for many of our queries, it would be nice if we could use that one. And, I thought that this is supported based on the description of the config field, that I quoted in my last comment.

Thank you for your help and patience.

bergold avatar Jun 29 '22 13:06 bergold

@bergold I see what you mean now!

We would need to consider if we require a matching firestore column to partition a BQ field.

At the very least, this would be an edge case to consider as timestamp is a part of the default schema.

Developer note: Do we need a firestore equivalent field, if we use the timestamp data type in BigQuery?

dackers86 avatar Jun 29 '22 14:06 dackers86

Hi @bergold.

Apologies for the delay on this feature. Having now added it our backlog for development, there doesn't seem to be an issue in using the timestamp field.

Did you experience any issues setting when trying to configure the timestamp field?

dackers86 avatar Feb 08 '23 16:02 dackers86

I can confirm, that with the following configuration the table is still not partitioned:

BIGQUERY_PROJECT_ID=my-project-123
COLLECTION_PATH=users/{userid}/{collection}
DATASET_ID=firestore_export
DATASET_LOCATION=europe-west3
LOCATION=europe-west3
TABLE_ID=users
TABLE_PARTITIONING=MONTH
TIME_PARTITIONING_FIELD=timestamp
TIME_PARTITIONING_FIELD_TYPE=omit
USE_NEW_SNAPSHOT_QUERY_SYNTAX=yes
WILDCARD_IDS=true

bergold avatar Feb 09 '23 09:02 bergold

The TIME_PARTITIONING_FIELD_TYPE is a required field for adding partitioning. Have you tried setting this value to be TIMESTAMP to match the timestamp field value.

image

dackers86 avatar Feb 10 '23 08:02 dackers86

Hi @dackers86,

Thank you for looking into this issue!

I tried that just now (uninstalled the extension, deleted the BigQuery dataset, reinstalled the extension), but it is still not partitioned.

I think, TIME_PARTITIONING_FIRESTORE_FIELD is also a required field for adding partitioning and this is the discrepancy of how it's currently working and what I'm trying to accomplish.

bergold avatar Feb 11 '23 13:02 bergold

The Firebase console can be confusing with the optional fields. As we cannot have conditional parameters, we list them as optional.

However, for partitioning, the following fields are all required:

image

The above configuration, should create a partitioned table based on the timestamp field, perhaps we need better documentation for this also?

dackers86 avatar Feb 13 '23 10:02 dackers86

Using hour / timestamp / timestamp / TIMESTAMP in these four fields give me this error:

"Error caught creating table Field timestamp already exists in schema"

iurisilvio avatar May 08 '24 21:05 iurisilvio

For future reference, hour / not set / not set / omit works for me. The partitions are created and it uses the timestamp field. I have a _PARTITIONTIME virtual column too.

iurisilvio avatar May 09 '24 07:05 iurisilvio