A peer-reviewed paper from researchers at University of Limerick and University College Cork recently tested Irish language grammar with LLMs. When I compared their results to mine, something remarkable emerged: despite using completely different methods, we’d both discovered the same barrier—LLMs just don’t get the nuances of Gaeilge.
Two Ways to Test a Language Model’s Irish
Let’s talk about what these studies actually measured. It turns out there are two fundamentally different ways to test whether an AI “knows” Irish:
Receptive Competence: Can you recognise correct Irish when you see it?
Think of this as a multiple-choice test. The Irish-BLiMP researchers created 1,020 “minimal pairs”—sentences that differ by just one grammatical feature:
✅ Correct: Baineadh geit aisti ar maidin. (”She got a fright in the morning.”)
❌ Incorrect: Bhaineadh geit aisti ar maidin.
The autonomous past tense doesn’t use lenition (that ‘h’ after the ‘b’). The model just has to pick which sentence is grammatically correct.
Productive Competence: Can you generate correct Irish from scratch?
This is what I tested in Parts 1 and 2. Instead of choosing between options, models had to produce the correct form themselves, through fill-in-the-blank or full sentence translation:
Fill in the blank: It’s a pleasure to meet you: “Is deas bualadh __________.” (Answer: leat)
Translate: “She got a fright in the morning.” (Answer: Baineadh geit aisti ar maidin.)
Joseph McInerney, a researcher at Trinity College Dublin working on LLM development for Irish as part ABAIR, took a similar approach. He used Gaelchultúr’s existing 100-question Irish grammar assessment which consists of 100 multiple-choice fill-in-the-blank questions that get progressively more difficult (I encourage you to take it yourself to see how you do!).
The Convergence
Here’s where it gets interesting. Using three different methods, we all found essentially the same result:
Irish-BLiMP (Discrimination Task - Zero-shot)
- GPT-5: 73.5%
- Gemini-2.5-flash-lite: 67.1%
- Claude-Sonnet-4.5: 65.9%
My Study Part 1 (Generation Task)
- Claude 3.5: 73.08%
- GPT-4.1: 71.81%
- GPT-4o: 70.44%
- Gemini 2.5 Pro: 67.04%
Joseph McInerney’s Gaelchultúr MCQ Test
- Claude: 73%
- Gemini: 72%
- GPT: 64%
Look at those numbers across all three studies: Despite completely different methodologies (minimal pairs MCQ vs. fill-in-the-blanks/translation vs. fill-in-the-blank MCQ) the top performers cluster tightly around 70% mark—coincidence? I think not. This is revealing something fundamental about how well (poorly) current AI models understand Irish.
What the Glass Ceiling Reveals
What they found aligns with my results: models handle regular patterns well but struggle with Irish’s more idiosyncratic features like the autonomous verb form, personal numbers and genitive noun phrase production.
The Irish-BLiMP researchers tested this with human participants too: native speakers achieved 90.1% accuracy with remarkable consistency (standard deviation of just 4.29). This sets a high bar: even on a discrimination task with only two options, humans still significantly outperform the best AI models—the best AI models showed high variance (8-12 points), jumping from 40% on one feature to nearly 90% on another.
This variance reveals something crucial: models haven’t internalised systematic grammatical rules. They’re pattern-matching, and when patterns are irregular or rare in training data, performance collapses.
A Reasoning Breakthrough?
Here’s where my Part 2 research adds something new to the conversation. While the Irish-BLiMP researchers tested their models with minimised reasoning capabilities to focus on baseline grammatical knowledge, I took the opposite approach with OpenAI’s reasoning models (o1, o3) by letting them reason fully.
It’s worth noting that the Irish-BLiMP researchers also tested their models with additional support: five example pairs (few-shot prompting) and grammatical explanations (grammar-book context) beyond zero-shot prompting. My tests on reasoning models, by contrast, achieved 78-79% accuracy using only zero-shot prompting—no examples, no grammatical explanations.
The difference came purely from the models’ ability to reason through multi-step transformations: There was a 6-point jump over the best non-reasoning models at the time (May 2025). More importantly, reasoning models showed dramatic improvements on specific features that require multi-step transformations like the Prepositional Pronouns and Copula “Is” form. Why does reasoning help so much with these features? Because forming correct prepositional pronoun forms isn’t a single step, you need to:
1. Identify the English prepositional phrase (e.g., “of him”). 2. Map it to the correct Irish preposition (e.g., “of” in “afraid of” often maps to roimh). 3. Select the correct Irish pronoun (“sé” for “him”). 4. Form the correct prepositional pronoun (roimh + sé = roimhe). 5. Integrate it into a grammatically correct Irish sentence structure.
Reasoning models can chain these steps together. Non-reasoning models often get the first step right but fail on subsequent transformations.
Interestingly, reasoning sometimes hurts performance on simple tasks. For straightforward features like basic prepositions or adjectives, reasoning models actually scored lower—they overthought problems that pattern-matching solved just fine.
What This Means for Irish Language Technology
The convergence of these three independent studies paints a clear picture of where we are with AI and Irish:
The Good News
- Reasoning models show promise, but much more work is needed
- The gap between AI (~70%) and humans (~90%) is narrowing but still wide.
The Challenges
- Certain Irish features remain hard for all models (autonomous verbs, personal numbers). Models struggle differently than humans and they haven’t internalised the grammar.
- Open-source models lag significantly (see Irish-BLiMP paper), which raises questions for accessibility
- Blindly using these models to produce content will ultimately produce Irish language slop (had an interesting thread on how to translate this very term).
The Path Forward
It’s worth noting that all three studies focus on standard Irish (An Caighdeán Oifigiúil); the real-world challenge of handling Munster, Connacht, Ulster and many other dialects with their distinct grammatical features and vocabulary remains largely unexplored. The Irish-BLiMP researchers suggest several directions including expanding to dialectal variations, improving training approaches for low-resource languages, and developing better multilingual transfer techniques. Personally, I’m looking forward to exploring the results in the paper and including some of the techniques and context engineering.
Now we know where the ceiling is—and we’ve seen reasoning models start to crack it. It is up to the Irish-language community to decide what will happen next.
Read Part 1 for the full breakdown of non-reasoning model performance, and Part 2 for the reasoning model analysis. I hope to do a part 3 soon that builds on some of the work of the Irish-BLiMP paper.
The Irish-BLiMP paper is available at arxiv.org/pdf/2510.20957.
Joseph McInerney’s benchmarking can be found on LinkedIn.
