by Hari Balasubramanian
Is probabilistic analysis of any use in analyzing text – sequences of letters or sequences of words? Can a computer generate meaningful sentences by learning statistical properties such as how often certain strings of words or sentences occur in succession? What other uses could there be of such analysis? These were some questions I had this year as I collected material to teach a course on a special class of probability models called Markov chains. The models owe their name to the Russian mathematician Andrey Markov, who first proposed them in a 1906 paper titled “Extension of the law of large numbers to dependent quantities”.
The key phrase, as we shall see, is ‘dependent quantities'. Broadly speaking, Markov models are applications of that basic rule of conditional probability, P(A|B): the probability of Event A happening, given that B occurs. The uses of Markov chains are many and varied – from the transmission of genes through generations, to the analysis of queues in telecommunication networks, to the movements of particles in physics. In 2006 – the 100th anniversary of Markov's paper – Philipp Von Hilgers and Amy Langville summarized the five greatest applications of Markov chains. This includes the one that is unknowingly used by most of us on a daily basis: every time we search on the internet, the ranking of webpages is based on the solution to massive Markov chain.
The focus of this piece, however, is the analysis of letter and word sequences as they appear in text. In what follows, I'll look at four examples where Markov models play a role.
