pg_sample Is it possibe to use the --limit query based on the contents of one of the sample tables

Say I had tables school, class.

If I dumped 100 classes from the class table and it depends on schools. Can I then put a limit such as school = pg_sample.classes = school_ids (meaning I want to get only the schools that exist in the classes that were pulled by pg_sample already)?

Nov 12 '21 00:11 mehulp118

I think you should be able to achieve that with --limit='schools=0' which says not to include any rows from the schools table. If the resulting sample data has FKs to school rows though, then those will still be pulled in.

Nov 12 '21 00:11 mla

Ah, what I mean was what if there is no FK on schools for the classes table? I have a situation like this where the tables don't have that dependency unfortunately but I would still like to pull the data properly as such

Nov 12 '21 00:11 mehulp118

I see. Hmmm. The problem is going to be the order of the fetching, b/c we don't know when the schools table will be sampled.

So, idk... you could try something like --limit="schools=school_id IN (SELECT school_id FROM _pg_sample.public_classes)" but I doubt that's going to work great.

We have --keep and --data-only options. So theoretically you could run it the first time and --keep the sample tables around. Set --limit="schools=0" on that run. And then on the second run, specify --limit="schools=school_id IN (SELECT DISTINCT school_id FROM _pg_sample.public_classes)" --data-only

But I don't see a way to restrict to a single table for that second run. We allow you to only sample one schema, but not a specific table AFAIK. We could add that.

Nov 12 '21 03:11 mla