Markov Chains: How Memoryless Processes Power Search and Speech
Every time you use Google, the search engine ranks pages using a branch of mathematics invented in 1906 by a combative Russian atheist who was trying to win an argument about God and free will. Every time you speak to a voice assistant and it understands you, it relies on the same framework. Markov chains — named after Andrei Markov — are one of the most quietly consequential ideas in the history of mathematics. And their origin story is stranger than anything you'd expect.
The Concept
A Markov chain is a mathematical model for a sequence of events where the probability of what happens next depends only on the present state — not on anything that came before. This is called the Markov property, or the memoryless property.
Think of it like a board game. When it's your turn, the only thing that matters is which square you're currently on. It doesn't matter whether you got there in three moves or thirty, whether you were winning last round or struggling — all of that history is irrelevant. Only the current square determines your probabilities for the next roll.
That's the Markov property: the present is a complete summary of the past. Everything you need to predict the future is already encoded in where you are right now.
This might sound like a severe restriction — surely history matters sometimes? In practice it isn't, because you get to define what counts as a 'state.' If you need your system to remember something, you build that something into the state itself. The Markov property is always satisfiable; the trick is choosing the right level of description.
Why It Matters
The applications of Markov chains span nearly every field of science and technology, and a few surprising ones that have nothing to do with either.
Google's PageRank. When Sergey Brin and Larry Page founded Google in 1998, their core insight was to model web browsing as a Markov chain. Imagine a random surfer who clicks links at random — following whatever link appears on the current page, with some small probability of teleporting to a completely random page instead. Each webpage is a state; each hyperlink is a transition. Now ask: in the long run, what fraction of time does this random surfer spend on each page? That fraction is PageRank — the most influential website ranking algorithm in history. The math that computes it is Markov's 1906 framework. A company now worth trillions of dollars is, at its core, finding the long-run behavior of a very large Markov chain.
Speech Recognition. For decades, from roughly the 1980s through the 2010s, voice recognition technology ran on Hidden Markov Models (HMMs). When you speak a sentence, you produce a sequence of sounds. The words underneath those sounds — the phonemes you're actually saying — form a Markov chain: each phoneme has a probability of transitioning to the next. The 'hidden' part is that you can only observe the acoustic signal, not the words directly. HMMs solved the inverse problem: given the sounds, what's the most likely word sequence underneath? This powered Dragon Dictate, early Siri, call-center automation, and decades of voice research.
Markov Chain Monte Carlo. Perhaps the most powerful and surprising application. MCMC is a family of algorithms that use Markov chains to sample from complex probability distributions that are too complicated to sample from directly. The key insight: design a chain whose long-run behavior is the distribution you want. Run it long enough and your samples describe the distribution. MCMC has become indispensable in Bayesian statistics, climate modeling, drug discovery, evolutionary biology, cosmology, and even archaeology. The Metropolis algorithm — the original MCMC method — was invented at Los Alamos in 1953 to simulate atomic bomb physics. Nobody imagined it would later be used to reconstruct evolutionary trees from ancient DNA.
Cryptanalysis. MCMC has been used to crack substitution ciphers — the kind where each letter is replaced by another letter. The algorithm proposes random swaps in a decryption key, evaluates how 'English-like' the decoded result looks using a statistical model of letter transitions (itself a Markov chain), then accepts or rejects each swap probabilistically. Given enough time, it converges on the correct key without any prior knowledge of it. It works because English has strong Markov structure: 'q' is almost always followed by 'u,' 'th' appears far more often than 'ht,' certain words are far more likely to follow certain other words. Markov chains model that structure; MCMC exploits it to reverse-engineer secrets.
The Details
How the Math Works
Every Markov chain has a set of states and a transition matrix — a table of probabilities. Each row represents a current state; each column represents a next state; each entry gives the probability of that transition. Every row must sum to 1, because you always go somewhere.
Here's a simple example for weather with three states — Sunny, Cloudy, and Rainy:
If today is sunny, there's a 60% chance of sun tomorrow, 30% chance of clouds, 10% chance of rain. If today is rainy, tomorrow is most likely to be cloudy (50%). To forecast two days ahead, you simply multiply the matrix by itself. Three days ahead: multiply three times. The matrix powers do all the work — no need to enumerate every possible path.
The Long Run: Stationary Distributions
Here's where Markov chains become philosophically interesting. For a chain that has certain technical properties — it can reach any state from any other state, and it doesn't cycle with a fixed period — the distribution of the chain at time t converges to a fixed stationary distribution as t grows. No matter where you start, the chain eventually 'forgets' its initial state and settles into this stable pattern.
For our weather model, if you ran it for years, you might find that about 45% of days are sunny, 35% cloudy, and 20% rainy — regardless of whether you started in a drought or a monsoon. That's the stationary distribution.
For Google's random surfer, the stationary distribution is PageRank — the fraction of time the surfer spends on each page in the infinite long run. Important pages collect more traffic; unimportant ones receive less. The algorithm that runs the world's dominant search engine is computing a probability vector.
The Ergodic Theorem
There's a deep mathematical theorem underlying all of this: the Ergodic Theorem. It says that for a well-behaved Markov chain, the time-average of any single long trajectory equals the ensemble average across all possible trajectories. In other words, watching one chain run for a very long time gives you the same information as watching infinitely many chains run simultaneously for one step.
This is deeply counterintuitive but enormously useful. It's why MCMC works: you don't need to explore all possible configurations of a system simultaneously. You can follow a single cleverly designed Markov chain for long enough, and your samples will correctly represent the target distribution. The time dimension substitutes for the population dimension.
Hidden Markov Models
The most technically sophisticated application is the Hidden Markov Model. Regular Markov chains assume you can directly observe the state. HMMs relax this: you only observe noisy, indirect signals that depend on an underlying hidden state sequence. Two algorithms make HMMs practical: the Viterbi algorithm, which finds the most likely hidden state sequence given the observations, and the Baum-Welch algorithm, which learns the transition probabilities automatically from data. Together they turned HMMs into the workhorse of voice recognition, gesture detection, and genomic analysis for thirty years.
The Ancestor of Language Models
There's a direct intellectual lineage from Markov's 1906 framework to the AI systems of today. In 1948, Claude Shannon used a Markov chain over letters and words to model English text. His 'first-order approximation' sampled each letter independently from English frequencies — producing strings like 'OCRO HLI RGWR NMIELWIS EU LL.' His second-order model sampled each letter given the previous letter — producing 'ON IE ANTSOUTINYS ARE T INCTORE.' Higher-order models looked more and more like English. Shannon's demonstration showed that statistical dependencies between adjacent symbols capture real linguistic structure. That insight, massively scaled, is what powers modern large language models. GPT and its successors are doing essentially what Markov described in 1906 — but with billions of parameters tracking dependencies across hundreds of tokens instead of just one or two.
The Unlikely Origin Story
The history of Markov chains is one of the stranger origin stories in mathematics. Andrei Markov (1856–1922) was a student of Pafnuty Chebyshev at Saint Petersburg Imperial University. He was legendarily combative — known as 'Andrey the Furious' — and deeply opposed to religion. He eventually petitioned the Russian Orthodox Church to formally excommunicate him, and they obliged.
His mathematical nemesis was Pavel Nekrasov, a devoutly Orthodox mathematician who had published an argument in 1902 fusing probability theory with theology. Nekrasov claimed that the Law of Large Numbers — which explains why averages become predictable for large samples — only applied to independent events. Human voluntary acts, he argued, are independent because they express free will. Therefore, social statistics work because people have free will, and free will implies divine order. Mathematics, in Nekrasov's formulation, was evidence for God.
Markov found this infuriating. He set out to demolish the argument mathematically by proving that the Law of Large Numbers could apply perfectly well to dependent variables — that independence was not required. He succeeded, publishing the foundational paper in 1906.
Then, seven years later, he did something remarkable: to demonstrate the theory empirically, he hand-counted the first 20,000 letters of Pushkin's verse novel Eugene Onegin, classifying each letter as a vowel or consonant. He arranged them in 200 grids of 10×10 characters and counted pairs. He found 8,638 vowels and 11,362 consonants. Crucially, he found that the probability of any given letter being a vowel depended on whether the preceding letter was a vowel — consonant-consonant pairs appeared far more often than independent sampling would predict. The letter sequence was a Markov chain. He presented this to the Imperial Academy of Sciences on January 23, 1913 — the first computational linguistics experiment, conducted by hand, in 19th-century Russia.
The argument he won? Whether mathematics proves free will. The tool he invented to win it? One of the foundations of modern artificial intelligence.
Takeaways
- The Markov property says that the future is independent of the past, given the present — the current state contains all the information needed to predict what comes next.
- Transition matrices encode all probabilities in a chain; raising the matrix to a power gives multi-step forecasts without enumerating paths.
- Stationary distributions describe the long-run behavior of ergodic chains and are the mathematical foundation of Google's PageRank.
- Markov chains were invented in 1906 by Andrei Markov to refute a theological argument — the first empirical application was analyzing the vowel-consonant structure of Pushkin's Eugene Onegin in 1913.
- The applications are vast: speech recognition, search ranking, Bayesian statistics, drug discovery, cryptanalysis, evolutionary biology, and the ancestor of every large language model in use today.
The universe of Markov chains is vast. We've barely traced the outline. But at the center of it all is a simple, powerful idea: that sometimes, all the history you need is already here — in the present moment, in the state you're in right now.