Google Releases Gemini-SQL2: Gemini 3.1 Pro Text-to-SQL Scores 80.04% on BIRD Single-Model Leaderboard

Google Research team has announced the launch of Gemini-SQL2 on X. They described this system as a breakthrough text-to-SQL capability powered by Gemini 3.1 Pro. Gemini-SQL2 posted 80.04% execution accuracy on the BIRD Text-to-SQL Leaderboard (Single Model). Google’s chart places it above its own Gemini-SQL, the prior top entry. The metric measures whether generated SQL runs and returns correct results, not whether it looks valid.

https://x.com/GoogleResearch/status/2065475343205740911

Gemini-SQL2

Gemini-SQL2 is a text-to-SQL capability, not a standalone foundation model release. It translates natural language questions into what Google calls ‘execution-ready SQL queries.’ The capability is built on Gemini 3.1 Pro.

Per the announcement on X, “data subtlety & complex business contexts make generating accurate SQL from natural language notoriously hard.” The X Post also stated that “improved SQL understanding can elevate natural language skills across Google’s data services.” That points toward integration targets like BigQuery Studio, AlloyDB AI, and Cloud SQL Studio, which already ship Gemini-based SQL generation. Google has not yet confirmed which products will receive Gemini-SQL2.

Benchmarks

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) is an industry standard for this task. It contains 12,751 question-SQL pairs across 95 databases spanning 37 professional domains, totaling 33.4GB. The databases include dirty values and require external knowledge grounding, unlike older benchmarks such as Spider.

BIRD measures execution accuracy (EX): the generated SQL must run and return results matching the gold query. Google stated this directly. “Per the BIRD benchmark, which measures execution-verified accuracy, GeminiSQL-2’s SQL doesn’t just look right, it also runs successfully.”

The Single Trained Model Track restricts the preprocessing, retrieval, and agentic frameworks that ensembles use to boost scores. It measures the model’s core text-to-SQL ability. Google Cloud’s prior record on this track, reported November 15, 2025, was 76.13. Google benchmarks human performance at 92.96, leaving a 12.92-point gap from 80.04.

How the Leaderboard Stacks Up

Google’s chart, on X post, shows Gemini-SQL2 ahead of eight named competitors, along with several unlabeled points. Only 80.04% is stated as text. The values below are read from the chart’s position and are approximate; dates reflect each point’s horizontal placement.

SystemOrganizationBIRD Execution Accuracy (Single Model)Chart DateGemini-SQL2Google80.04% (stated)Jun 2026Gemini-SQLGoogle~77.2%Mar 2026Q-SQLAWS~76.5%Dec 2025Databricks RLVR 32BDatabricks~75.7%Jul 2025SiriusAI-Text2SQL-32B-v2Tencent~75.0%Dec 2025Arctic-Text2SQL-R1-32BSnowflake~73.9%Jun 2025GPT-5.5-xhighOpenAI~72.5%Apr 2026SQLWeaver-32BAlibaba~71.7%May 2026Claude Opus 4.6Anthropic~70.1%Feb 2026

Two patterns are visible. Google now holds the top two named positions, Gemini-SQL2 and Gemini-SQL. Several specialized 32B SQL models also sit above some general frontier models on this chart.

Use Cases with Examples

Self-service analytics: A revenue manager asks for monthly recurring revenue by region, for accounts that churned within 90 days of upgrade. This needs joins, window logic, and date arithmetic. Execution-verified generation catches SQL that runs but returns wrong rows.

Data engineering drafts: Devs can draft BigQuery transformations from English, then review rather than write from scratch. Google’s November 2025 work identified schema understanding as the hard part. Higher BIRD scores reflect better handling of ambiguous columns and messy values.

Embedded “ask your data” features: SaaS teams adding natural-language query interfaces still need human review at 80% accuracy. One in five queries can be wrong. The score sets expectations, not a removal of review.

Gemini-SQL2 Launch: Community Reception Dashboard

Verified public engagement on Google Research’s announcement posts • first ~3 hours • Jun 12, 2026

X views

X likes

X bookmarks

Reposts (X + LinkedIn)

BIRD Single-Model Leaderboard • Execution Accuracy

Platform Engagement Breakdown

X / Twitter (main post)

Views144.4K

Likes2,800

Reposts267

Bookmarks1,300

Replies64

Engagement rate3.1%

LinkedIn (main post)

Reactions349+

Comments12

Reposts27

Reception signal

9.3 : 1

Bookmark-plus-like to reply ratio on X. A high save rate with few replies typically signals approval over controversy. Comment-level sentiment not yet measurable; replies still loading at capture time.

Data verified Jun 12, 2026 from Google Research posts on X (9:44 AM PT, 144.4K views) and LinkedIn (348 reactions + author, 3h after posting). Leaderboard values besides 80.04% are read from Google’s published chart and marked approximate (~). Dashboard by Marktechpost.

“+
“

“+d.txt+”

“;
box.appendChild(row);
});
function fmt(n,f){
if(f===”k”&&n>=1000){return (n/1000).toFixed(n>=100000?1:1).replace(/.0$/,””)+”K”;}
return n.toLocaleString();
}
function animate(){
Array.prototype.forEach.call(document.querySelectorAll(“#mtp-gsql2-dash .dd-bar-fill”),function(b){
b.style.setProperty(“width”,b.getAttribute(“data-w”)+”%”,”important”);
});
Array.prototype.forEach.call(document.querySelectorAll(“#mtp-gsql2-dash .dd-num”),function(el){
var end=parseInt(el.getAttribute(“data-count”),10),f=el.getAttribute(“data-fmt”),
start=null,dur=1100;
function step(ts){if(!start)start=ts;var p=Math.min((ts-start)/dur,1);
el.textContent=fmt(Math.round(end*(p*(2-p))),f);
if(p<1)requestAnimationFrame(step);}
requestAnimationFrame(step);
});
var g=document.getElementById(“dd-gauge”);
if(g){setTimeout(function(){
g.style.transition=”stroke-dashoffset 1.2s ease”;
g.style.strokeDashoffset=(188.5*(1-0.82)).toFixed(1);
},150);}
}
if(“IntersectionObserver” in window){
var seen=false,io=new IntersectionObserver(function(es){
es.forEach(function(e){if(e.isIntersecting&&!seen){seen=true;animate();io.disconnect();}});
},{threshold:0.25});
io.observe(document.getElementById(“mtp-gsql2-dash”));
} else { animate(); }
})();

Breaking

Google Releases Gemini-SQL2: Gemini 3.1 Pro Text-to-SQL Scores 80.04% on BIRD Single-Model Leaderboard

Gemini-SQL2

Benchmarks

How the Leaderboard Stacks Up

Use Cases with Examples

Gemini-SQL2 Launch: Community Reception Dashboard

By

Leave a Reply Cancel reply

You missed

‘Tell Him He’s a Piece of Shit’: Meta’s New AI Unit Is a Total Mess

Google Releases Gemini-SQL2: Gemini 3.1 Pro Text-to-SQL Scores 80.04% on BIRD Single-Model Leaderboard

SpaceX president Gwynne Shotwell just gave another hint at a Tesla merger

Chinese Drivers Are Using Tiny Plastic Heads to Fool Tesla’s Autopilot Safeguards

Google Releases Gemini-SQL2: Gemini 3.1 Pro Text-to-SQL Scores 80.04% on BIRD Single-Model Leaderboard

Gemini-SQL2

Benchmarks

How the Leaderboard Stacks Up

Use Cases with Examples

Gemini-SQL2 Launch: Community Reception Dashboard

By

Related post

Build with Cursor and deploy production-ready AI agents on DataRobot

This Week in AI: The Next-Gen Recommendation Experience

Pairing Claude Code with Local Models

Leave a Reply Cancel reply

You missed

‘Tell Him He’s a Piece of Shit’: Meta’s New AI Unit Is a Total Mess

Google Releases Gemini-SQL2: Gemini 3.1 Pro Text-to-SQL Scores 80.04% on BIRD Single-Model Leaderboard

SpaceX president Gwynne Shotwell just gave another hint at a Tesla merger

Chinese Drivers Are Using Tiny Plastic Heads to Fool Tesla’s Autopilot Safeguards