Extracting table names from more complex queries
I've been working with the extract_table_names.py example for about a day now to try and parse the following queries. I have been able to get sqlparse to return all the table names but it also returns the order by and group by IdentifierLists. Any thoughts would be appreciated.
SELECT a.time_updated_server/1000,
content,
nick,
name
FROM table1 a
JOIN table2 b ON a.sender_id = b.user_id
JOIN table3 c ON a.channel_id = c.channel_id
JOIN table4 d ON c.store_id = d.store_id
WHERE sender_id NOT IN
(SELECT user_id
FROM table5
WHERE store_id IN ('agent_store:1',
'ask:1'))
AND to_timestamp(a.time_updated_server/1000)::date >= '2014-05-01'
GROUP BY 1,2,3,4
HAVING sum(1) > 500
ORDER BY 1 ASC
CREATE TEMPORARY TABLE test_table1
AS (select * from test_table2)
SELECT count(*) from test_table1
WHERE id in (SELECT id from test_table2)
I have something that 'works' for my test cases (including the example sql). I'd love some feedback as I know this probably isn't the 'right' way. This also handles multiple statements in one string.
import itertools
import sqlparse
from sqlparse.sql import IdentifierList, Identifier
from sqlparse.tokens import Keyword, DML
def is_subselect(parsed):
if not parsed.is_group():
return False
for item in parsed.tokens:
if item.ttype is DML and item.value.upper() == 'SELECT':
return True
return False
def extract_from_part(parsed):
from_seen = False
for item in parsed.tokens:
if item.is_group():
for x in extract_from_part(item):
yield x
if from_seen:
if is_subselect(item):
for x in extract_from_part(item):
yield x
elif item.ttype is Keyword and item.value.upper() in ['ORDER', 'GROUP', 'BY', 'HAVING']:
from_seen = False
StopIteration
else:
yield item
if item.ttype is Keyword and item.value.upper() == 'FROM':
from_seen = True
def extract_table_identifiers(token_stream):
for item in token_stream:
if isinstance(item, IdentifierList):
for identifier in item.get_identifiers():
value = identifier.value.replace('"', '').lower()
yield value
elif isinstance(item, Identifier):
value = item.value.replace('"', '').lower()
yield value
def extract_tables(sql):
# let's handle multiple statements in one sql string
extracted_tables = []
statements = list(sqlparse.parse(sql))
for statement in statements:
if statement.get_type() != 'UNKNOWN':
stream = extract_from_part(statement)
extracted_tables.append(set(list(extract_table_identifiers(stream))))
return list(itertools.chain(*extracted_tables))
Thanks I will be using this :+1:
Great job, however during time something must be changed and I had some problems. This is a revisited code, it seems to work, but I didn't study it very deeply:
import itertools
import sqlparse
from sqlparse.sql import IdentifierList, Identifier
from sqlparse.tokens import Keyword, DML
def is_subselect(parsed):
if not parsed.is_group:
return False
for item in parsed.tokens:
if item.ttype is DML and item.value.upper() == 'SELECT':
return True
return False
def extract_from_part(parsed):
from_seen = False
for item in parsed.tokens:
if item.is_group:
for x in extract_from_part(item):
yield x
if from_seen:
if is_subselect(item):
for x in extract_from_part(item):
yield x
elif item.ttype is Keyword and item.value.upper() in ['ORDER', 'GROUP', 'BY', 'HAVING', 'GROUP BY']:
from_seen = False
StopIteration
else:
yield item
if item.ttype is Keyword and item.value.upper() == 'FROM':
from_seen = True
def extract_table_identifiers(token_stream):
for item in token_stream:
if isinstance(item, IdentifierList):
for identifier in item.get_identifiers():
value = identifier.value.replace('"', '').lower()
yield value
elif isinstance(item, Identifier):
value = item.value.replace('"', '').lower()
yield value
def extract_tables(sql):
# let's handle multiple statements in one sql string
extracted_tables = []
statements = list(sqlparse.parse(sql))
for statement in statements:
if statement.get_type() != 'UNKNOWN':
stream = extract_from_part(statement)
extracted_tables.append(set(list(extract_table_identifiers(stream))))
return list(itertools.chain(*extracted_tables))
@DuccioFabbri
Thank you! This was so helpful for parsing ~150 SQL queries, some of which are rather complex.
I found two scenarios that didn't work well
select a from foo join (select a from bar) as b on b.id=foo.id
with mycte as (select id from foo) select b.id from bar as b join foo as f on b.id=f.id
I enhanced your code (that you posted here) by adding unit tests.
Thank you for contribute your code! I'm also working on a project which need to extract the tables form sql queries. I quickly test your code with a simple query, but could not get any output. Can you give me some help on what I missed?Appreciate on your help!``
I think we are getting off topic
extract_tables(sql)
Have you tried printing the return value?
Again, I think this discussion should continue elsewhere. If you are new to Python, try a forum for beginners.
Hi Guys,
Sorry this might be an old thread. I am trying to parse a statement similar to the following
SELECT * FROM (SELECT * from t) s
But the output using the code #https://github.com/andialbrecht/sqlparse/issues/157#issuecomment-518572450 is as follows:
['(select']
Have tried a lot of different changes to the code, but no success.