bigquery icon indicating copy to clipboard operation
bigquery copied to clipboard

Add field comparable to firstHtml to the har.request tables

Open rviscomi opened this issue 8 years ago • 2 comments

The runs.request tables include a firstHtml field to indicate that the request is for the parent document.

Queries on the har.request tables must join on the corresponding runs table to get this info. There are tens of millions of requests in each table, so the join is expensive.

To simplify queries and make them less expensive, add a boolean field comparable to firstHtml to the har.request tables. It should share the same logic as the runs table; first 200 response with HTML mime type.

rviscomi avatar Aug 28 '17 21:08 rviscomi

Would this be a step in the Dataflow pipeline?

igrigorik avatar Aug 28 '17 21:08 igrigorik

Yes, it should be annotated during the iteration over the requests in the HAR file: https://github.com/HTTPArchive/bigquery/blob/0489d8e96a7b733e475af5eebfd937f92a20c2f1/dataflow/java/src/main/java/com/httparchive/dataflow/BigQueryImport.java#L252

This field would also be valuable on the har.bodies tables.

rviscomi avatar Aug 28 '17 21:08 rviscomi