Fix!: exclude Semicolon expressions from model state
Fixes #4252
I reproduced the issue locally by following the steps outlined in the above issue:
>>> ctx.models['"db"."sqlmesh_example"."incremental_model"'].json()
'{"name":"sqlmesh_example.incremental_model","project":"","start":"2020-01-01","cron":"@daily","tags":[],"dialect":"duckdb","kind":{"name":"INCREMENTAL_BY_TIME_RANGE","on_destructive_change":"ERROR","dialect":"duckdb","forward_only":false,"disable_restatement":false,"time_column":{"column":"\\"event_date\\"","format":"%Y-%m-%d"},"partition_by_time_column":true},"partitioned_by":[],"clustered_by":[],"default_catalog":"db","audits":[],"grains":["(id, event_date)"],"references":[],"allow_partials":false,"signals":[],"enabled":true,"python_env":{},"jinja_macros":{"packages":{},"root_macros":{},"global_objs":{},"create_builtins_module":"sqlmesh.utils.jinja","top_level_packages":[]},"audit_definitions":{},"mapping_schema":{"\\"db\\"":{"\\"sqlmesh_example\\"":{"\\"seed_model\\"":{"id":"INT","item_id":"INT","event_date":"DATE"}}}},"extract_dependencies_from_query":true,"pre_statements":[],"post_statements":["SELECT\\n id,\\n item_id,\\n event_date,\\nFROM\\n sqlmesh_example.seed_model\\nWHERE\\n event_date BETWEEN @start_date AND @end_date\\n;"],"on_virtual_update":[],"query":"SELECT\\n id,\\n item_id,\\n event_date,\\nFROM\\n sqlmesh_example.seed_model\\nWHERE\\n event_date BETWEEN @start_date AND @end_date\\n;","source_type":"sql"}'
I observed that we hit this line for the Semicolon expression and store the query's SQL in its meta dict. This explains the duplication of the model's query in the above serialized post_statements key.
My initial approach was to avoid setting meta["sql"] for Semicolon expressions, but then it occurred to me that we shouldn't be storing any Semicolon expressions in the model, because they can also affect the data/metadata hash.
I verified for the issue's examples that if I ran plan, removed the semicolon+comment, and then ran plan again then a diff was shown. This is because Semicolon is gen'd since it's in the post_statements list and we get back a "SEMICOLON" string in the _data_hash_values. I added a test for this.
I think we need a migration for this change. I'll take it to the finish line ~on Monday~ soon.