Are LLMs overkill for SQL? What 20,000 queries reveal about database complexity

A Cornell study of 20,000 queries across 376 databases suggests that most SQL generation is a pattern-matching problem — not a reasoning one

By MK · March 2026

I want you to think about the last ten SQL queries you wrote at work. Not the ones you wish you wrote — not the elegant recursive CTEs or the window functions you were proud of. The actual queries. The ones that hit production.

How many of them were just a SELECT with a WHERE clause? Maybe a JOIN and a GROUP BY if you were feeling ambitious? If you're honest, probably most of them. And that's exactly the point a team at Cornell just made with data to back it up.

Yue Li, David Mimno, and Unso Eun Seo Jo collected 376 databases, generated over 20,000 natural language questions with matching SQL queries, and arrived at a conclusion that should make every LLM-for-SQL startup uncomfortable: about 600 templates can cover 70% of all SQL queries. The most frequent template? SELECT COUNT(\) FROM variable*. That's it. The billion-parameter model you're paying per-token for is, in most cases, filling in a Mad Libs sheet.

The text-to-SQL gold rush

The text-to-SQL task has been around since the 1970s — Woods was doing it in 1973, Harris in 1977. But it exploded when LLMs entered the picture. Look at the BIRD benchmark leaderboard today: almost every top-ranking system is either agent-based or LLM-driven. These systems are impressive. They can handle ambiguous questions, reason over complex schemas, and self-correct.

But they're also expensive. We're talking several dollars in token cost per query in some cases. And they're unpredictable — you can't guarantee the same question will produce the same SQL twice. For enterprise databases, where queries touch financial data, medical records, or compliance-sensitive tables, that unpredictability isn't a feature. It's a liability.

So the authors asked a question that I think more people should be asking: how complex is SQL, really? Not in theory. In practice. When real humans ask real questions about real databases, how much of SQL's power do they actually use?

How they tested it

The team pulled database schemas from four established text-to-SQL benchmarks — BIRD, Spider 1.0, Spider 2.0-lite, and KaggleDBQA — plus 88 schemas from drawSQL, an open-source repository. That gives them 376 databases with an average of about 8 tables each.

Source	#DB	#T/DB	#Q
Bird23-train-filtered (Li et al., 2023)	69	7.57	6,601
Spider 1.0 (Yu et al., 2018)	196	5.15	11,245
Spider 2.0-lite (Lei et al., 2024)	15	14.87	287
KaggleDBQA (Lee et al., 2021)	8	2.12	244
drawSQL	88	14.34	2,112
Overall	376	8.07	20,489

Table 1: Database schemas from five sources — 376 databases, 20,489 queries total

For the drawSQL schemas, which didn't come with pre-existing queries, they used Claude Sonnet 4.6 to generate natural language questions at three difficulty levels (easy, medium, hard) and then manually verified every single NLQ–SQL pair for correctness. That's a nice touch — it means the generated queries aren't just plausible, they're executable and semantically correct.

Then they did something clever. They turned every SQL query into a template by stripping out the specific table names, column names, and literal values. They created two flavors: hard templates that preserve alias structure and schema roles, and soft templates that collapse everything into generic "variable" tokens. The soft templates are more aggressive — they only keep the SQL skeleton.

SQL has a complexity ceiling

Here's the first finding that made me sit up. The authors defined six proxies for query complexity. Table 2 lays them out.

Proxy	Definition
Num_tables	# of distinct tables referenced in the query
Num_joins	# of JOIN operations (cross-table relational reasoning)
Num_subqueries	# of nested subqueries (hierarchical reasoning)
Max_nesting_depth	Maximum depth of nested queries
Num_aggs_plus_group_by	# of aggregation ops (COUNT, SUM) and GROUP BY clauses
Advanced_feature_count	# of window functions, FILTER, set ops, CTEs

Table 2: Six structural proxies for SQL query complexity