A Cornell study of 20,000 queries across 376 databases suggests that most SQL generation is a pattern-matching problem — not a reasoning one
By MK · March 2026
I want you to think about the last ten SQL queries you wrote at work. Not the ones you wish you wrote — not the elegant recursive CTEs or the window functions you were proud of. The actual queries. The ones that hit production.
How many of them were just a SELECT with a WHERE clause? Maybe a JOIN and a GROUP BY if you were feeling ambitious? If you're honest, probably most of them. And that's exactly the point a team at Cornell just made with data to back it up.
Yue Li, David Mimno, and Unso Eun Seo Jo collected 376 databases, generated over 20,000 natural language questions with matching SQL queries, and arrived at a conclusion that should make every LLM-for-SQL startup uncomfortable: about 600 templates can cover 70% of all SQL queries. The most frequent template? SELECT COUNT(\) FROM variable*. That's it. The billion-parameter model you're paying per-token for is, in most cases, filling in a Mad Libs sheet.
The text-to-SQL task has been around since the 1970s — Woods was doing it in 1973, Harris in 1977. But it exploded when LLMs entered the picture. Look at the BIRD benchmark leaderboard today: almost every top-ranking system is either agent-based or LLM-driven. These systems are impressive. They can handle ambiguous questions, reason over complex schemas, and self-correct.
But they're also expensive. We're talking several dollars in token cost per query in some cases. And they're unpredictable — you can't guarantee the same question will produce the same SQL twice. For enterprise databases, where queries touch financial data, medical records, or compliance-sensitive tables, that unpredictability isn't a feature. It's a liability.
So the authors asked a question that I think more people should be asking: how complex is SQL, really? Not in theory. In practice. When real humans ask real questions about real databases, how much of SQL's power do they actually use?
The team pulled database schemas from four established text-to-SQL benchmarks — BIRD, Spider 1.0, Spider 2.0-lite, and KaggleDBQA — plus 88 schemas from drawSQL, an open-source repository. That gives them 376 databases with an average of about 8 tables each.
| Source | #DB | #T/DB | #Q |
|---|---|---|---|
| Bird23-train-filtered (Li et al., 2023) | 69 | 7.57 | 6,601 |
| Spider 1.0 (Yu et al., 2018) | 196 | 5.15 | 11,245 |
| Spider 2.0-lite (Lei et al., 2024) | 15 | 14.87 | 287 |
| KaggleDBQA (Lee et al., 2021) | 8 | 2.12 | 244 |
| drawSQL | 88 | 14.34 | 2,112 |
| Overall | 376 | 8.07 | 20,489 |
Table 1: Database schemas from five sources — 376 databases, 20,489 queries total
For the drawSQL schemas, which didn't come with pre-existing queries, they used Claude Sonnet 4.6 to generate natural language questions at three difficulty levels (easy, medium, hard) and then manually verified every single NLQ–SQL pair for correctness. That's a nice touch — it means the generated queries aren't just plausible, they're executable and semantically correct.
Then they did something clever. They turned every SQL query into a template by stripping out the specific table names, column names, and literal values. They created two flavors: hard templates that preserve alias structure and schema roles, and soft templates that collapse everything into generic "variable" tokens. The soft templates are more aggressive — they only keep the SQL skeleton.
Here's the first finding that made me sit up. The authors defined six proxies for query complexity. Table 2 lays them out.
| Proxy | Definition |
|---|---|
| Num_tables | # of distinct tables referenced in the query |
| Num_joins | # of JOIN operations (cross-table relational reasoning) |
| Num_subqueries | # of nested subqueries (hierarchical reasoning) |
| Max_nesting_depth | Maximum depth of nested queries |
| Num_aggs_plus_group_by | # of aggregation ops (COUNT, SUM) and GROUP BY clauses |
| Advanced_feature_count | # of window functions, FILTER, set ops, CTEs |
Table 2: Six structural proxies for SQL query complexity