Our world is becoming more digital by the day. The pandemic normalised much of it, and now nearly everything we consume is online. But like the physical world, it is still very unequal online.
Right now, 59% of the Internet is in English. The next is Russian with 5.3%, and the twentieth is Czech with 0.5%. This is the result of colonisation and globalisation. As we know, Irish has suffered from these also, and finds itself classified as ‘definitely endangered’ by UNESCO. Accordingly, performance of these AI models with Irish and Czech is much lower than English, which reduces the positive societal impact for those communities.
The most popular type of AI right now is the large language model, which is trained on the Internet. We see it everywhere now: in ChatGPT, or in realistic chatbots for language learning on Duolingo, or as a companion on Snapchat, or writing blog posts, marketing and reviews... and this is just the beginning.
Indeed, much of the focus has been on producing more and more content for the Internet in these popular languages because the AI understands what good internet content is. And because of this, there is a fear that language models won't improve further, because they will be trained on language-model-produced content. But these rumours are greatly exaggerated. Every week in the AI space, we see more and more cutting-edge developments. And, one area of research where the development is ever-growing is with minority languages.
AI helps us develop AI faster. We are learning how to leverage less input for more output. In the earlier years, every piece of data was used to build these models, but as we now know: garbage in, garbage out; instead, we use heuristics and tricks to make the model produce what we want it to. In this way, AI and the internet are a variance amplifier: allowing those things on the periphery to gain disproportionate benefits. Additionally, with AI now being in the hands of individuals, we all have the ability to shape and share the future of the language we want; we don’t have to just consume, but we can also be builders.
There is massive opportunity for smaller languages. The fact that a language has x10 or x1000 less data is now not so relevant. In the case of Irish, despite it being endangered—and thus low-resourced—the efforts by groups like The ADAPT Centre, ABAIR and universities across Ireland as well as through general technological developments at companies like OpenAI and Hugging Face, will bring about the rebirth of Irish online. And these developments will aid other languages, in the same way the development of other minority language technology will help Irish.
AI may indeed be built on 1's and 0's. But it is not binary. It is not a zero-sum game. It is a variance amplifier where the success and developments made by others can be honed and used to benefit your own community. And this is why I am so excited about the future of Irish online.