Support providing a filter (or custom query) which is used when analyzing a table
Currently soda analyze works on a small dataset to create the analyzer templates and it does it by applying a LIMIT clause to the table. Running analysis on a limited dataset might not always be a good idea. Let's say that I have a table containing data from 2010 and over 11 years some columns have changed the format. For example, the user_id was a hash code and we changed it to uuid. Limit 1000 will always give the earliest 1000 rows (not guaranteed as a RDBMS standard but it is the way how it works on redshift). Extracting patterns based on the data that was generated 1000 years ago might not be accurate.
It would be very helpful to provide a way to let the data owner define a query to be used for sampling. something like the following option in warehouse.yml would do the job.
connection:
type: redshift
host: HOST
username: env_var(REDSHIFT_USERNAME)
password: env_var(REDSHIFT_PASSWORD)
region: XXXX
database: db
schema: schema1
analyze_templates:
table1: "select * from table1 where id > 1000 limit 10"
Some databases also have sampling capabilities in SQL like snowflake https://docs.snowflake.com/en/sql-reference/constructs/sample.html
Eg select * from testtable tablesample bernoulli (20.3);
IIRC there are other warehouses supporting the same or similar.
If we support table level configurations they should go in the scan YAML file. A generic analyze filter or such config should go in the warehouse file.
I suggest we make an inventory first of which databases we know that still do a full table scan when LIMIT 1000 is specified. And for each of those (if we know) what the best strategy to deal with this.
Pasting discussion for reference:
murat migdisoglu Yesterday at 9:52 AM I got another question related to the analyze phase: When I run soda analyze, the DatasetAnalyzer.analyze (dataaset_analyzer.py) runs some queries using a limit 10000 statement. But when I check the query plan for the query being executed, It executes the limit on top of the result of the query. In other words, I executes COUNT(CASE WHEN) queries using a full scan, calculates the result and applies limit on top of it. This is a very costly operation for big tables XN Limit (cost=243595200.05..243595200.10 rows=1 width=31) -> XN Aggregate (cost=243595200.05..243595200.10 rows=1 width=31) -> XN Seq Scan on session_info_no_bot (cost=0.00..38975232.00 rows=3897523200 width=31) (edited) white_check_mark eyes raised_hands
murat migdisoglu 23 hours ago I wonder if the intent of the limit was that..
Vijay:soda: 23 hours ago We just noticed the performance degradation yesterday (on Redshift) - can you please open an issue?
murat migdisoglu 23 hours ago If I change the query to something like SELECT COUNT(CASE WHEN test_id ~* '^\-?\\d+$' THEN 1 END), .... .... FROM (select * from "myschema"."testtable" LIMIT 1000) t
murat migdisoglu 23 hours ago the query plan is much better
murat migdisoglu 23 hours ago XN Aggregate (cost=72.55..72.60 rows=1 width=72) -> XN Network (cost=0.00..20.00 rows=1000 width=72) Distribute Round Robin -> XN Subquery Scan t (cost=0.00..20.00 rows=1000 width=72) -> XN Limit (cost=0.00..10.00 rows=1000 width=31) -> XN Seq Scan on session_info_no_bot (cost=0.00..38975232.00 rows=3897523200 width=31)
murat migdisoglu 23 hours ago I'm happy to open an issue and a PR but I wanted to verify if the LIMIT was intended to avoid the full scan
Vijay:soda: 23 hours ago limit was indeed added to prevent full scan since analyze is not going to do full-table scan ..
Vijay:soda: 23 hours ago let me page @Tom Baeyens for more info
murat migdisoglu 23 hours ago if that is the case, Redshift does not executes the query that way unfortunately :disappointed:
Vijay:soda: 23 hours ago yup, agreed .. we did notice that - it is good that you also tested to confirm it
murat migdisoglu 23 hours ago I've tested with postgresql and same problem is there too
Vijay:soda: 23 hours ago yes - redshift and postgres are pretty much the same
murat migdisoglu 23 hours ago https://github.com/sodadata/soda-sql/issues/134
mmigdisommigdiso sodadata/soda-sql#276 [bug] Analyze does not push down the LIMIT clause and results a full scan Describe the bug
In the analyze phase, the DatasetAnalyzer runs some count queries using a LIMIT clause to avoid the full scan.
But when I check the query plan, I see that the limit is applied after the results are calculated and that causes a big performance issue for big tables.
Limit should be applied before executing the count/sum operators To Reproduce
Steps to reproduce the behavior:
- run soda analyze Show more Labels bug, soda-sql https://github.com/sodadata/soda-sql|sodadata/soda-sqlsodadata/soda-sql | Yesterday at 10:10 AM | Added by GitHub
murat migdisoglu 23 hours ago for now, I'll be fixing it on my fork repo and continue my tests
Vijay:soda: 22 hours ago Thank you for the PR :pray: !
Tom Baeyens 22 hours ago Yes! thanks reporting and submitting the PR. Super!
Tom Baeyens 22 hours ago Running analyze on the limited dataset might not be a good idea. That should indeed become a configuration option in the analyze. But I thought that limiting to 1000 would be the best default until we introduce the config property
Another problem related to that ^ is: Even if we fix the performance issue, Running analyze on the limited dataset might not be a good idea. Let's say that I've a table containing data from 2010 and over 11 years some columns have changed the format. For example, the user_id was a hash code and we changed it to uuid. the Limit 1000 will always give the earliest 1000 rows (not guaranteed as a RDBMS standard but it is the way it works on redshift). Extracting patterns based on the data that was generated 1000 years ago might not be accurate. white_check_mark eyes raised_hands
murat migdisoglu 22 hours ago Ideally, I would prefer to be able to pass some filters (such as date > XXXX) to the analyzer
Vijay:soda: 22 hours ago analyze is just supposed to give a small scaffold to get started, probably not a production ready scan files …
murat migdisoglu 22 hours ago yep, I understand that. But creating the initial template is the hardest part from user point of view(this is what our survey shows us). And now that you've mentioned scaffold, Great expectation is allowing to define a custom query to be used for the analyze phase.
I was thinking about the sampling as well. But for some databases, sampling is heavier than fullscan. For example: Redshift needs an ordering for sampling:
select *
from sales
order by random()
limit 10;
I would say that we can sstill move forward with the PR https://github.com/sodadata/soda-sql/pull/135 as it is remediating an existing problem. Event If there is a database that does not trigger full scan without subquery, there will be no harm of having a subquery using the limit.
For this feature request, I still believe that we should provide a way to users to configure how the example data will be selected. When I run analyze on a table that stores 10 years of data, I would definitely want to control how the sampling is done.
If we support table level configurations they should go in the scan YAML file. A generic analyze filter or such config should go in the warehouse file.
I suggest we make an inventory first of which databases we know that still do a full table scan when
LIMIT 1000is specified. And for each of those (if we know) what the best strategy to deal with this.
@tombaeyens is there a SCAN yml file? or do you mean warehouse.yml?
@tombaeyens is there a SCAN yml file? or do you mean warehouse.yml?
You're right. My bad. I realized a bit later that indeed there is no scan YAML file yet. That is probably why you proposed the config in the warehouse.yml :)
Throwing out an idea: what if we split up the analyze in 2 phases:
- analyze tables: it would generate the scan YAML files, but not yet inspect each column
- analyze columns: pattern matching on all text columns to find the
valid_format
That way, as a user you can specify a filter between step 1 and 2 in the scan YAML.
Still need to think of alternatives as I don't have considered all the options here.
In the meantime, I'm ok with adopting the subquery solution, but would only apply it for redshift, not for all warehouses. I assume it must be redshift specific problem that it still does a full scan when given a LIMIT 1000
@mmigdiso, wdyt?
This would be super useful! Our biggest problem is tables with a large number of date partitions. Soda analyze fails on these because it tries to do a full table scan and retrieves too many partitions.
The partition column name is almost always the same. If we could supply a default filter (something like "partition_date >= '2022-01-01'") that would probably get us 90% of the way there.