Languages on Google TranslateAI papers published on arXiv per year
It turns out that the number of languages Google has decided to translate and the number of papers researchers have decided to publish about artificial intelligence move together with the kind of eerie synchronicity usually reserved for twins separated at birth or stock markets before a crash. One would think these two things—one a mercenary exercise in making the internet legible to speakers of Icelandic and Tagalog, the other an academic arms race conducted largely in English—would drift apart like continental plates, but instead they have clung to each other with a correlation of 0.973, which is the sort of number that makes statisticians nervous. Apparently humanity, faced with a tool that can understand language, decided simultaneously to teach it more languages and write more papers about why we had taught it more languages.
The real culprit here is almost certainly not conspiratorial but boringly infrastructural: both trends are riding on the same wave of cheap computing power and venture capital enthusiasm that has been washing over the technology sector since the early 2010s. As GPU costs plummeted and cloud computing became something you could actually afford, suddenly translating seventeen new languages became feasible the same year it became feasible to train a model on a hundred million parameters and publish the results. Consider that in 2010, arXiv received around 600 AI papers per year, a figure that has since grown to over 10,000—and Google Translate has added roughly two dozen languages in that same period, each one a small miracle of statistical inference that would have been computational fantasy in 2008. The money and the silicon were going to flow toward language problems, whether those problems were about understanding human languages or understanding how to build systems that understand human languages.
What we are witnessing is not correlation but rather two different manifestations of the same underlying technological intoxication—the belief that if you can measure something, digitize it, and feed it to a sufficiently large model, something interesting will happen. Whether you are Google trying to make translation a solved problem or a researcher trying to prove that translation is a problem worth publishing about, you are fundamentally riding the same bull market in computational linguistics. The data simply reflects our species doing what it does best: finding two completely unrelated things and discovering they move together, then writing a paper about it.
As an Amazon Associate, getspurious.com earns from qualifying purchases. Learn more.
Want to learn more about why correlations like “Languages on Google Translate” vs “AI papers published on arXiv per year” don't prove causation? Read our guide to statistical thinking.