The Statistics Behind Google Translate

Author: Carlos Alberto Gómez Grajales

Statistics are usually regarded as a numbers science. Statisticians work with numbers; that is what we are good at. It is therefore curious that one of the most outstanding and recognizable breakthroughs that statistical techniques brought to science happened within the linguistics discipline, a “word” field. Suddenly, to a place where numbers didn’t really have much of a space, some thought of introducing statistical analysis and changed it forever.

Google Translate is a well-known app from Google, which works as a multi-language functional translator. At first, the system was based on SYSTRAN, a software engine which is still used by several other online translation services such as Yahoo! Babel Fish, AOL, and Yahoo. The revolution started in 2007, when Google announced a new, proprietary algorithm, based on statistical models, which would improve the accuracy of its translation efforts by far [1]. The most powerful and successful translator was based on statistics.

Using statistics to tackle problems involving words and texts wasn’t something new at all by 2007. In fact, mathematicians had already tackled a few “word” problems in the past. Thomas Corwin Mendenhall, a 19th century physicist, worked on the quantitative analysis of writing style, something that would later be known as stylometry. Mendenhall attempted to characterize and compare the style of different authors through the frequency distribution of three-letter words, four-letter-words, five-letter-words and so on, trying to detect patterns in each author’s writing style. By using simple word counts, he raised doubts on the authorship of many of Shakespeare’s written works, igniting a debate that would last until the very XXI century [2]. René Descartes, French philosopher and mathematician, discussed why a universal language would be impossible to create, based mostly on his experience with mathematical analysis and interpretations [3].

The automated translation of texts became a research subject for the first time at MIT in 1951, followed by a Georgetown University Machine Translation team the same year.  The team produced a system grounded on large bilingual dictionaries, where entries for words of the source language gave one or more equivalents in the target language, as well as some rules for producing the correct word order in the output. The dictionary and grammar rules were rather limited, yet it was impressive enough to bring massive funding to the table [4]. The field grew, much research followed, but by the start of the sixties, the romance ended: progress made was really slow and the funding stopped flowing. A 1966 report by the Automatic Language Processing Advisory Committee put the final nail in the coffin: it concluded that Machine based Translation tools were slower, less accurate and twice as expensive as human translation and that “there is no immediate or predictable prospect of useful machine translation” [4].

Warren Weaver, co-author with Claude Shannon of The Mathematical Theory of Communication, the book that would start the digital revolution, had enough spare time also to revolutionize the field of Machine Translation. In July 1949, Weaver wrote a memorandum, simply entitled “Translation”, in which he proposed the use of statistical modeling techniques as a way to improve the translation algorithms. Inspired by work on an early type of neural networks, Weaver hypothesized that translation could be addressed as a problem of formal logic, deducing logical “conclusions” in the target language from “premises” in the source language [5]. First researchers didn’t even consider Weaver’s ideas, these concepts were lost until the 1990s when IBM researchers reconsidered them in the frame of statistical models. The “premise” would be information gathered from data related to the source language, while the logical “conclusion” would be the model’s prediction. This is the basic notion behind Google’s groundbreaking algorithm, an idea devised more than sixty years ago.

Google Translate is the most recognized example of statistical machine translation (SMT). Being powered by SMT algorithms means that all translations are the result of a statistical model. Google has not incorporated in its translator any sort of language experts’ advice nor any grammatical rules or dictionaries. In a broader sense, Google Translate doesn’t know about language at all. This is, roughly speaking, how Google Translate works: Google has gathered a huge database of human-revised translations of millions of documents. They include records since 1957 by the EU in two dozen languages including official communications made by the UN and its agencies in its six official languages. These texts were the basis of the first Google Translate algorithms, yet more recent versions have also incorporated records of international tribunals, company reports, besides all the articles and books in bilingual form that have been put up on the web by individuals, libraries, booksellers, authors and academic departments. The diversity of input content is what allowed Google to include more than 90 languages in its catalog, including Catalan, Hebrew, Mongolian, Urdu and Zulu among many others [6]. The algorithm is designed to look for patterns among its huge collection of translated texts, in order to find the translation most likely to be associated with the text you entered [7] [8]. The idea is that the text you wish to translate may have probably been written and translated before, somewhere among the huge data set of translated documents that Google has collected. It may even appear many times. By detecting patterns in documents that have already been translated by human translators, the algorithm calculates probabilities as to what an appropriate translation should be. The algorithm then provides the most likely translation [8].

The exact process is a secret, yet there are two methods commonly used in SMT that may be used by Google. One is a Bayesian algorithm that calculates the probability of a translation given a source text. Based on Bayes’ theorem, the process calculates this probability considering the likelihood that the source string is the translation of the target string, and the probability of seeing the text in the language you wish to translate to [9]. Another likely algorithm is based on a hidden Markov Model, a Markov Process with unobserved states. Hidden Markov models are especially known for their application in speech, handwriting and gesture recognition [10]. The model itself was created by Franz Josef Och, a German computer scientist, who was the chief architect of Google Translate. He headed Google’s translation effort up until last year, after his research won the DARPA contest for speed machine translation in 2003 [11].

But no matter what the algorithm is, Google Translate has some great advantages that no other translators have. First of all, any user around the world can correct Google’s translation, thus adding additional information to the corpus that feeds the translation model. Remember that the model’s input are collections of texts already translated by humans so every single correction counts as a new piece of data. Another unique tool that Google can use to improve its translation is its web search. Any translated website or article or text can be gathered to improve the pattern detection model used by Google. The model is then updated daily with new corrections, texts and additional languages.

The fact that Google Translate relies mostly on previously translated texts is also its weak point. Google does not directly translate from Language 1 to Language 2, at least not most of the time.  Consider that many of the translation pairs offered, Korean to Urdu for instance, have no history of translation between them, and therefore no paired texts, neither on the web nor anywhere else. As a result, the algorithm usually translates the entered text to English, the language with the highest number of translation relationships. The text is then translated from English to any language you want [8]. In some cases, there’s a fourth language involved. For instance, if you wish to translate a phrase in Catalan, the text would be first translated to Spanish, then to English and then to any language you may want. This intricate process is very prone to error, which explains why Google Translate is good with some languages, yet barely capable with others. The algorithms rely on the amount and quality of the translated documents that work as input for the process, the same way that the quality of raw data becomes key to the success of any model.

So far, Google Translate is not capable of replacing human translators, a task that still looks far ahead. But due to its intricate algorithms, based on grounded statistical concepts, it is formally recognized as one of the greatest breakthroughs in machine translation. By relying on sophisticated analytical tools, Google changed the landscape of the translation business forever. Yet another field revolutionized thanks to statistics.


[1] Google Switches to Its Own Translation System – Google Operating System Website (October, 2007)

[2] Mendenhall, Thomas C. Did Marlowe write Shakespeare? Current Literature (February, 1902)

[3] Descartes Discusses the Idea of an Artificial Language – (Jan, 2010)

[4] Hutchins, John. The History of Machine Translation in a NutshellSelf-published (November, 2005)

[5] Weaver, Warren. Translation – The Rockefeller Foundation Archives   (July 15, 1949).

[6] About Google TranslateGoogle Translate Official Website.

[7] Google seeks world of instant translations – Reuters (March, 2007)

[8] How Google Translate Works – The Independent (September, 2011).

[9] Statistical Machine Translation – Wikipedia The Free Encyclopedia (Last modified: May, 2015)

[10] Hidden Markov Model  – Wikipedia The Free Encyclopedia (Last modified: May, 2015)

[11] Och, Franz Josef. Statistical machine translation from single word models to alignment templates Techn. Hochsch., Diss., 2002-Aachen.