minigo icon indicating copy to clipboard operation
minigo copied to clipboard

Automate export of data as a BigQuery table

Open brilee opened this issue 7 years ago • 11 comments

This will probably start as a private BQ dataset; will have to consult to figure out how to offer data publicly/how to allow the general public to query over a ~1TB dataset for free-ish.

General schema of dataset:

  • A table with one row per game (so something like 10 million rows in total by the end of a run). That row will describe things like worker_id[string], completed_time[time], model_num[int], result[float], was_resign[bool], sgf_url[string], move_seq[array]
  • A table with one row per move in a game (so something like 1 billion moves in total by the end of a run). That row will describe things like worker_id[string], completed_time[time], model_num[int], result[float], was_resign[bool], move_num[int], turn_to_play[int], played_move[int], policy_prior[array], mcts_visit_counts_normalized[array], q_prior [float], average_q[float], resign_threshold[float].

This allows queries like " select model_num, first 5 moves of game, count(*) group by model_number, moves", which would show the most popular opening by model number. Or stuff like "select mcts_visit_counts_normalized[array_ordinal(move_num)] / policy_prior[array_ordinal(move_num)]" would allow you to rank moves by how unexpected they were.

(And by bigquery magic, these queries would probably complete in a few minutes, instead of the hour-long analyses we've been running so far...)

brilee avatar Mar 13 '18 12:03 brilee

How to upload data - can retroactively parse all the full-debug log SGFs we have, and also update selfplay workers to stream data into the table. Might consider separate tables for evaluation zoo games vs. selfplay workers; for 20x less data we could also just do the holdout games.

brilee avatar Mar 13 '18 12:03 brilee

The dataset of evaluation games should be much smaller and much easier to make available in its entirety.

amj avatar Mar 13 '18 23:03 amj

It seems like we're duplicating of per-game info (worker_id, completed_time, model_num, result, was_resign, resign_threshold) in the per-move table. Could we instead generate a unique key for each game and store that in the move table?

tommadams avatar Mar 14 '18 02:03 tommadams

Agreed, we're duplicating some of the info. I think (worker_id, completed_time) should form a primary key, and model_num is too useful; may as well denormalize that. Result/resign/resign threshold could be normalized, but I suspect those are too useful to want to join all the time.

brilee avatar Mar 14 '18 02:03 brilee

@tommadams what do you think this should look like, given that you would have to reimplement some form of this in the C++ version?

brilee avatar Mar 14 '18 03:03 brilee

Well at that point there's probably not much to gain by normalizing the remaining data because it sounds like the primary game key would be almost as large as the data we're normalizing. So storing everything per-move does make sense.

Your schema sounds good to me. I wonder what the most reasonable import format is. I'm somewhat surprised/disappointed that BigQuery doesn't support proto as far as I can tell.

tommadams avatar Mar 14 '18 04:03 tommadams

Right, BQ doesn't support protos because they are not self-describing. Avro is a proto-like binary format which is self-describing, and BQ does support Avro. Other import options I'm familiar with are JSON and CSV. (CSV will be utterly incapable of handling the arrays we'd be generating)

brilee avatar Mar 14 '18 04:03 brilee

find sgf/ -path '*/full/*sgf' | xargs grep -o '];[BW]\[[a-s]\{2\}\]C\[-\?[0-9.]\{6\}*$' | tqdm | awk -F ":" '{ if ($1 != cf) { print cf": "tok; cf = $1; tok = ""; } } { sub(/^.*\[/, "", $2); tok = tok" "$2; } END { print cf": "tok }' | tee game_qs | wc

find sgf/ -path '*/full/*sgf' find all sgf files

xargs grep -o '];[BW]\[[a-s]\{2\}\]C\[-\?[0-9.]\{6\}*$' find all comments after a move (not this doesn't find the first Q on line 4)

tqdm To go fast one must measure

awk -F ":" '{ if ($1 != cf) { print cf": "tok; cf = $1; tok = ""; } } { sub(/^.*\[/, "", $2); tok = tok" "$2; } END { print cf": "tok }' Split each line on semi-colon (":];W[cd]C[-0.6372") if this is first line from new file, print the final status of past line add token to list of tokens

tee games_qs temp holding file till these get imported into dataframes or bigquery

wc verify assumptions about number of moves...


As compared to python file reading and parsing this command runs at light speed I process roughly 200 games/second, so all of v5 in an hour. Limitting factor is HDD reads :/

sethtroisi avatar Mar 24 '18 22:03 sethtroisi

Changing the title to more accurately reflect the needed work

amj avatar May 21 '18 17:05 amj

I'm working on this as part of minigo-pub VM

sethtroisi avatar May 21 '18 20:05 sethtroisi

@sethtroisi @brilee do we still want to do anything with this?

amj avatar Oct 08 '18 22:10 amj