select rows based on simple criteria
Would it be possible to select rows based in very simple criteria?
I mean, I have a database with a very large table and I would like to
select all rows for all tables except data table where I want 1000000 rows ORDER by timestamp DESC;
Thanks for the library.
I could see that. What do you think the command syntax should look like for that?
I'm currently using:
./pg_sample --limit="*=*, nodes_data=1000000"
I guess something like:
./pg_sample --limit="*=*, nodes_data=1000000;order by timestamp DESC"
or
./pg_sample --limit="*=*, nodes_data=1000000(order by timestamp DESC)"
should be easy to parse and is extensible in case other criteria is added.
You should be able to specify a where condition after the =. e.g.,
--limit="users=(user_id < 10)"
@mla Had a similar question like this, is it possible to select EVERY table in DESC order? I think all (most?) tables in rails for example have "created_at", so it'd be nice to sample rows with ORDER BY created_at DESC as the default since usually early rows in a big database have a bunch of inactive rows. I'm trying with --random but it might be too slow for my purposes
Hey @lustickd. Sorry for the delay in responding.
You can try this patch, which should just force that ORDER BY for every table.
diff --git a/pg_sample b/pg_sample
index a73af39..a1b5ec8 100755
--- a/pg_sample
+++ b/pg_sample
@@ -630,6 +630,7 @@ while (my $row = lower_keys($sth->fetchrow_hashref)) {
notice "No candidate key found for '$table'; ignoring --ordered";
}
}
+ $order = 'created_at DESC';
We'd have to look at how we can express that for general use. Rails doesn't automatically create an index on all created_at columns, does it? That would be my worry, if you have really large tables.
You might try this:
--- a/pg_sample
+++ b/pg_sample
@@ -624,7 +624,11 @@ while (my $row = lower_keys($sth->fetchrow_hashref)) {
} elsif ($opt{ordered}) {
my @cols = find_candidate_key($table);
if (@cols) {
- my $cols = join ', ', map { $dbh->quote_identifier($_) } @cols;
+ my $cols = join ', ',
+ map { "$_ DESC" }
+ map { $dbh->quote_identifier($_) }
+ @cols
+ ;
$order = "ORDER BY $cols";
} else {
notice "No candidate key found for '$table'; ignoring --ordered";
And pass the --ordered option. We order by the first candidate key we find. Rails usually has its "id" column, which should roughly match created_at, I would think. Patch above just adds DESC to those columns. Seems like a reasonable default anyway for that option.
Ah that makes sense thanks. Yeah I think created_at doesn't have an index so I'll go with the id method 👍
I did mess around a little bit with tsm_system_rows for random sampling and it's significantly faster than using SORT BY random() in a table with 40 million rows. Runs in 300 milliseconds per table instead of 30 seconds. Apparently the random() function in postgres loads the entire table into memory which makes it extremely slow.