Where Probability Meets Literature and Language: Markov Models for Text Analysis

by Hari Balasubramanian

220px-Markovkate_01Is probabilistic analysis of any use in analyzing text – sequences of letters or sequences of words? Can a computer generate meaningful sentences by learning statistical properties such as how often certain strings of words or sentences occur in succession? What other uses could there be of such analysis? These were some questions I had this year as I collected material to teach a course on a special class of probability models called Markov chains. The models owe their name to the Russian mathematician Andrey Markov, who first proposed them in a 1906 paper titled “Extension of the law of large numbers to dependent quantities”.

The key phrase, as we shall see, is ‘dependent quantities'. Broadly speaking, Markov models are applications of that basic rule of conditional probability, P(A|B): the probability of Event A happening, given that B occurs. The uses of Markov chains are many and varied – from the transmission of genes through generations, to the analysis of queues in telecommunication networks, to the movements of particles in physics. In 2006 – the 100th anniversary of Markov's paper – Philipp Von Hilgers and Amy Langville summarized the five greatest applications of Markov chains. This includes the one that is unknowingly used by most of us on a daily basis: every time we search on the internet, the ranking of webpages is based on the solution to massive Markov chain.

The focus of this piece, however, is the analysis of letter and word sequences as they appear in text. In what follows, I'll look at four examples where Markov models play a role.

1. Vowel and Consonant Pairs in Pushkin's Eugene Onegin

The first such example was demonstrated by Andrey Markov himself in 1913. To illustrate an example of his theory on dependent quantities, Markov had collected data – painstakingly, by hand! – on the first 20,000 letters of Alexander Pushkin's popular novel in verse, Eugene Onegin. He was interested in counts of vowels and consonants and the order in which they appeared. Of the first 20,000 letters in Eugene Onegin 8638 were vowels and 11362 were consonants. The overall probability estimate that a letter is a vowel is therefore 8638/20000 = 0.43. For a consonant, the same estimate is 11362/20000 = 0.57.


Suppose the probability that a letter is a vowel or consonant is independent of what the previous letter was – in the same way that the outcome of a coin toss is independent of the previous toss. Just as the probability of a heads following a heads is 0.5*0.5 = 0.25, we can calculate the probability that: (1) a vowel is followed by a vowel (0.43*0.43 = 0.185), (2) a vowel is followed by a consonant (0.43*0.57 = 0.245), (3) a consonant is followed by a vowel (0.57*0.43 = 0.245) and (4) a consonant is followed by a consonant (0.57*0.57 = 0.325).

If these 4 probabilities (which sum to 1) were correct, we would expect that in 19,999 letter pairs of Eugene Onegin we should find approximately 0.185*19,999 = 3698 pairs where a vowel is followed by a vowel.

But it's not hard to see that the independence assumption is strange. A vowel is more likely to be succeeded by a consonant than it is by a vowel. Markov's counts based on 19,999 pairs of successive letters demonstrated this clearly. The number of pairs where a vowel is followed by a vowel is 1104, less than a third the number (3698) estimated assuming independence. Here are same four probabilities we discussed above, but now based on the pairs actually observed in Onegin:

v-v count: 1104, P (second letter is v, given that the first is a v) = 1104/8638 = 0.128

v-c count: 7534, P (second letter is c, given that the first is a v) = 7534/8638 = 0.872

c-v count: 7534, P (second letter is v, given that the first is a c) = 7534/11362 = 0.663

c-c count: 3827, P (second letter is c, given that the first is a c) = 3827/11362 = 0.337

[My reference is this article, and the figure above comes from here.]

What we see above is a simple illustration of dependent quantities. In this case, the probability that a letter is a consonant or vowel depends only what the previous letter was, but nothing more than that.

Markov's application of probability to letters in a text must have seemed quaint at the time. What practical value could the analysis of vowels and consonants have? Andrey Kolmogorov (1903-1987), another Russian mathematician – who came up with the axioms of probability – felt that Markov chose Eugene Onegin because he was somewhat isolated in Russia and therefore wasn't able to apply his ideas to the exciting discoveries in physics that Western Europe was abuzz in the first decades of the 20th century.

But what is quaint in one era can suddenly become important in another. As David Link notes in his article, Traces of the Mouth, Markov's efforts in retrospect “represent an early and momentous attempt to understand the phenomenon of language in mathematical terms.” It's not an exaggeration to say that Markov's analysis of text is in principle similar to what Google and other firms now routinely carry out on a massive scale: analyzing words in books and internet documents, the order in which the words occur, analyzing search phrases, detecting spam and so on.

2. Sentences with Meaning?

Markov's innocuous seeming analysis of text took off in a huge way in 1948. This was when Claude Shannon, a Bell Labs researcher, published A Mathematical Theory of Communication, a landmark paper that spawned the field of information theory. Shannon's theory impacted telecommunications the most, but also had a lasting influence on computational linguistics and machine learning.

457px-Shannon_communication_systemShannon was concerned with efficient methods of sending analog messages from one point to another. For example, telegraphs at the time used a sequence of dots, dashes and spaces to transmit the sequence of letters that made up a message. Shannon's big insight was that the statistical properties of the message, if known beforehand, would be greatly useful. If certain letters occur very frequently in the message, it makes sense to create short bits to encode them, and longer bits for less frequent letters. This way you save on the number of bits you have to send across a communication channel. Shannon's paper covers a lot of ground and includes many fascinating ideas, such as information entropy, but I am concerned here only with the link to Markov.

Like Markov, Shannon considered the case where the probability of a letter could be independent of what preceded it. He also considered the case where the probability of a letter depended only on the letter that preceded it: similar to the dependence Markov demonstrated for vowel and consonant pairs.

But it's not necessary to only have one step dependence, which is clearly too restrictive. You could go even further with two step dependence, bigrams: transitions from one letter pair to another, in such a way that 2nd letter of the first pair matches with the 1st letter. For example, the bigram ‘ab' could be followed by the bigram ‘bc' — with the letter b as the connecting link, we would get the sequence ‘abc'. In the case of trigrams, a sequence of three letters would be followed by another sequence of three such that last two letters of the former are the same as the first two letters of the latter. So ‘xyz' can link to ‘yza' to give us ‘xyza'. In probability terms, this would be: P (next sequence is yza, given that the previous sequence is is xyz).

Extending the idea more generally, you get n-grams. Bigrams, trigrams and n-grams are examples higher order Markov models, where the dependence is not restricted to just the prior step, but stretches 2, 3 or n prior steps. In essence, we are allowing more and more memory to be included.

Next, Shannon explored the possibility of having a word as the basic unit of a sequence rather than a letter and then applying bigrams (pairs of words), trigrams (a sequence of 3 successive words) and n-grams (a sequence of n successive words). A modern, everyday example is typing the first word of a search term in Google, only to find the search engine providing suggestions for the next few words. Although I am not sure, it is certainly possible that n-gram type Markov models, based on a database of what Google users have searched (in a particular region, say), are behind these automated suggestions.

Another application is completing words or sentences while you are typing on your phone. For those who are unable to type due to a disability, this could be particularly useful. Here is an example where a gaze-tracker, simply by gauging where the visual focus of a person is, can help select and type letters and words in case a hand cannot be used. According to the website, after an hour's practice, you can type 20-30 words per minute with your eye!

Once we have enough data on commonly occurring sequences of words, we could have an algorithm generate text by randomly sampling from the counts we have learned. Such generated text would have the same statistical properties as the text data we used to train it. Shannon hypothesized that higher order Markov models would get us closer to language itself. For example, bigram counts involving letters as units (space counts as a letter) leads, in Shannon's paper, to this meaningless stream of randomly generated text:


In contrast, a bigram approximation with words as units led Shannon to this randomly generated sentence, still complete nonsense, but with minor parts that do hold together:


If word bigrams can get us this far, could we get consistently meaningful sentences if we use higher order Markov chains that are trained with millions of words? Such large amounts of data were not readily available in Shannon's time, but we do now have massive amounts of text that can be processed easily. Consider this randomly generated text in Kevin Murphy's textbook, Machine Learning: A Probabilistic Perspective. The text generated from a 4-gram word model, in which the probabilities were estimated from something called the Broadcast News Corpus:



This sounds just as nonsensical as Shannon's text, even though it's a higher order model and even though the model was trained with 400 million words!

What we call ‘meaning' emerges from a sequence of letters and words, but the vast majority of such sequences have no meaning at all. So coming up with language that makes sense, purely through probabilistic approaches driven by lots of data, is not the best idea.

It's not hard to see why. If a computer is trying to predict what I'll say next, based on what it's heard me say in the past, then it's not going to get very far, other than predicting some garbled combination of greetings and generalities. I may follow the rules of grammar, and I may (uniquely as an individual) repeat some phrases and combination of phrases (which a probabilistic model will be able to detect and which might be interesting in their own right), but what I say depends on the thoughts and sensory perceptions I am currently having (some of which I am consciously aware of, others that are in the background), whom I am speaking to, what specific memories come back to me that moment, what I may or may not have said a few sentences ago or what I might have said a few days ago.

Approximating that function would be like capturing consciousness itself! This would explain why no machine can currently participate in the type of dynamic, context-driven conversations that we humans can carry on so easily. Or create a page of meaningful text that is not drivel.

But provide some structure, such as a question answer format or a search phrase, an enormous database to search from and efficient algorithms, and we get something like IBM's Watson, which is able to successfully parse through massive amounts of text, infer correlations, and be successful in formats such as Jeopardy!. There has also been progress with regard to machines recognizing the words a person is speaking. Detecting what words are being spoken is not as trivial as it sounds. If you say, “recognize speech”, the computer might feel you said “wreck a nice speech” or “wreck a nice beach” [example from Kevin Murphy's Machine Learning]. It will also struggle to figure out when a particular word ends and the next one begins. What the algorithm presumably has as input is the digitized version the acoustic signal, and there may be some extraneous noise in the background. From this, it has to probabilistically infer the correct sounds and assemble them to form words and sentences. So called Hidden Markov Models have been used for this.

3. Identifying the Author

While generating consistently meaningful sentences remains a challenge, Markov models can be useful in figuring out who might have authored a text. The hypothesis here is that the sequence in which words or letters appear is a signature, unique to a particular author. We all know that each writer has something we call as ‘style'. Naipaul's short and declarative sentences come to mind, or the effusive, over the top prose that Salman Rushdie is known for.

I am reminded of an essay by Zadie Smith many years ago in which she wrote:

“A writer's personality is his manner of being in the world: his writing style is the unavoidable trace of that manner. When you understand style in these terms, you don't think of it as merely a matter of fanciful syntax, or as the flamboyant icing atop a plain literary cake, nor as the uncontrollable result of some mysterious velocity coiled within language itself. Rather, you see style as a personal necessity, as the only possible expression of a particular human consciousness.”

From the point of view of probability models, what we call a style or sensibility has to be inferred from the collection or arrangement of sentences. Sentences in turn are composed of words in a particular order and words are simply sequences of letters. Letters by themselves don't mean anything, so is it really possible that an author could be identified based on how many times a letter in an alphabet follows another in a text?

In 2001, Dmitri Khmelev and Fiona Tweedie published results that suggested such a technique could be reasonably successful. They selected 45 authors – mostly 18th and 19th century writers – that had more than one text in Project Gutenberg, a repository of online books. In total these 45 authors had authored 387 books. To determine an author's signature, Khmelev and Tweedie used the probability of a letter following another – the simplest one-step application of a Markov chain. So the probability matrix for each author consisted of 27 rows and 27 columns (one for each alphabet and one for blank space, such as the one below, which I got from here).


If a book is selected at random, and we don't know who its author is, the algorithm has to choose the right author among 45 possible authors. It must do so based solely on how closely observed counts of alphabet pairs in the unidentified books (which the algorithm is encountering for the first time) resemble the counts observed for each author in all their other, identified texts. In the end, the algorithm produces a rank quantifying this resemblance. If an author gets a rank of 0, this means she or he is the most likely author of the randomly selected book. If an author gets a rank of 1, this means the author is second most likely author, and so on.

The algorithm did pretty well and identified the correct author in 288 of the 387 cases! In 26 other cases, the correct author was ranked 2nd. So the success rate is not close to 100% but it is still non-trivially good considering that the occurrence of pairs of letters was used as a signature. Success did depend on the author, though. All 8 of Jane Austen's books were correctly identified. So were Agatha Christie's two books. Joseph Conrad's 22 books were identified for the most part. However, the algorithm had some trouble with Charles Dickens' 57 texts. On average, Dickens was the third choice for his own books!

4. Do Ancient Symbols Constitute a Written Script?

Now to a detection problem of a different kind. If archaeological excavations have unearthed a large corpus of symbols, how do we know that these symbols are evidence of a written script? The symbols, although they appear in a sequence, could be some type of religious or artistic expression, not necessarily a linguistic script. If someone in the distant future excavated samples of printed DNA sequences, which consist of 4 letters A, G, C and T, then could they prove or disprove that the sequence is a written script? Similarly, what would the conclusion be if samples of Fortran programming code were excavated?


These are precisely the type of questions that this 2009 Science paper attempted to answer using the conditional probability principles that underlie Markov models. The corpus they applied it to was the excavated symbols of the Indus Valley Civilization, “which stretched from what is now eastern Pakistan and northwestern India” from around 2600-1900 BCE. There are over 3800 such inscriptions made up of 417 symbols. The average length of each inscription (the analogy that comes to my mind is word length) is around 5 symbols. The largest consists of 17 symbols.

The Indus script has not yet been deciphered. Indeed, because it is yet undeciphered, there still remains a question whether it represents a language at all!

If the Indus collection is indeed a language, then we should see general patterns that we see in other languages. In the same way that vowels and consonants do not occur independently of each other, letters of an alphabet do not occur independently either. Some letters occur more frequently than others in written text (see the Zipf distribution). In English, the letter pair ‘th' occurs very frequently since the word ‘the' is the most frequently used word, but you'll be hard-pressed to find the letter pair ‘wz' in English.

Thus there is a kind of imbalance that can be observed in languages. A measure called information entropy, which was proposed in Claude Shannon's paper we discussed earlier, quantifies this imbalance based on the observed counts/frequencies of letter pairs in a language. If the relative frequencies of pairs of Indus symbols exhibits similarities to the frequencies observed in other linguistic systems, then that provides supporting (but certainly not conclusive) evidence that the symbols constitute a written script.

This is what the Science paper is claiming. The entropy of the Indus symbols was closer to languages – Sumerian, Old Tamil, Sanskrit and English – than it was to the entropy of non-linguistic systems such as DNA sequences, protein sequences and programming languages such as Fortran.

Around 7 years ago when these results were published, I remember they were heavily circulated on social media. It's a cool story for sure – mathematics revealing patterns of an ancient, undeciphered script in the hotly contested ground that is Indian history. However, Richard Sproat, a computational linguist, raised concerns that provide an important counterpoint. As late as June 2014, Sproat was still doggedly pointing out technical issues in the original Science paper!

Whatever the concerns, I did find this type of work intriguing – a clever use of probabilistic approaches. If the data and parameters used in the calculations were made public, it should be possible to replicate the findings and debate the conclusions if necessary.