This page explains the ideas behind ngramflow from the ground up. No prior knowledge needed, but a genuine curiosity about how language and probability connect.
A generative language model is a system that assigns probabilities to sequences of words. Given some text that has already been produced, it estimates how likely each possible next word is.
You already have an intuition for this. If someone says "I'd like a cup of..." your brain immediately suggests "coffee", "tea", or "water" as likely continuations. "purple" or "democracy" would feel surprising. That intuition is a rough language model built from everything you have ever read or heard.
The question is: how do you teach a machine to do this? The answer that Claude Shannon proposed in 1948 is beautifully simple: count things.
Shannon was interested in the entropy of English, meaning how much information each letter or word actually carries. To measure it, he needed a model of language. His approach: look at how often sequences of characters occur in real text, and use those frequencies as probabilities.
This is exactly what ngramflow does. No neural networks, no learned parameters, no gradient descent. Just counting words and computing probabilities from those counts.
Before a model can count anything, it needs to decide what the basic unit of analysis is. These units are called tokens.
ngramflow offers two choices. Try them below:
Word-level tokens treat each word as one unit. A typical English text might have 10,000 to 100,000 unique words (the vocabulary). The model learns relationships between words.
Character-level tokens treat each letter (and space) as one unit. The vocabulary is tiny: just 26 letters plus a space. But the model has to work harder, learning to construct words from individual characters.
Both approaches are valid and reveal different things. Character-level models are closer to what Shannon originally described in 1948.
An n-gram is simply a sequence of n tokens. The number n determines how much context the model looks at when predicting the next token.
Ignores all context. Predicts based on global word frequency alone. "The" is always likely, "zebra" is always rare.
Looks at the previous one token. "cup of ___" will suggest "tea" more often than "the" because "of tea" is common.
Looks at the previous two tokens. More specific context leads to more coherent predictions.
The key trade-off: more context = better predictions, but rarer matches. A trigram like "white rabbit hurried" might appear only once in the whole corpus. A unigram like "the" appears thousands of times. When a context is never seen, the model falls back to a simpler model (trigram to bigram, bigram to unigram).
The highlighted tokens show what the model "sees" as context before predicting the next word (shown with a dashed border).
The formula is straightforward. For a bigram model, the probability of word w following word w‐1 is:
In plain English: how many times does w follow w‐1, divided by how many times w‐1 appears in total? The result is a number between 0 and 1.
Example: In the Alice in Wonderland corpus, suppose "alice" appears 95 times. Of those, it is followed by "was" 60 times. Then:
The model does this for every possible context word pair during the build phase. It scans the entire corpus from left to right, counting every consecutive pair. The result is a large lookup table.
Click a word to see its bigram distribution, based on the actual Alice in Wonderland corpus used in the app.
Notice how different words have very different distributions. Common function words like "the" have diffuse distributions (many possible next words, all with low individual probability), while more specific words like "rabbit" have spiky distributions (a few very probable next words).
This difference in distribution shape is directly related to Shannon's concept of entropy: a flat distribution has high entropy (= high uncertainty), a spiky one has low entropy (= low uncertainty, easier to predict).
Once the model has a probability distribution over next tokens, it needs to pick one. The obvious choice would be to always pick the most probable token. This is called greedy decoding.
But greedy decoding produces boring and often repetitive text. Imagine a model that always picks "the most common next word": it would quickly get stuck in loops or generate very generic sentences.
Instead, ngramflow uses weighted random sampling: each token is picked proportionally to its probability. A 60% token gets chosen 60% of the time, not always. This produces varied, sometimes surprising output, but still respects the statistical structure of the language.
This shows the bigram distribution for "alice". Click "Sample!" to draw a token. The red line shows where the random number landed on the probability scale.
Try clicking "Sample!" many times. You will notice that "was" wins most often (because it has the highest probability), but other words appear too. Over many samples, the observed frequency converges to the true probability.
This stochastic nature is what makes language generation feel more natural. It also mirrors how Shannon originally described his experiments with approximating English text.
Shannon's 1948 paper included a remarkable demonstration. He showed that by conditioning on sequences of characters (not words), a purely statistical model could produce text that resembled English, at least superficially.
In the app, switch to Character mode and observe:
N-gram models are the starting point of a long line of development in language modeling. Understanding them makes it much easier to understand what came after.
At every step in this progression, the core task remains the same: predict the next token given the context. What changes is how rich and flexible the model's representation of context can be.
An n-gram model says: "I remember the last 2 words, and I look up what typically comes next." A transformer says: "I consider all previous tokens simultaneously, weigh their relevance through learned attention, and produce a sophisticated probability estimate." The math under the hood is very different, but the output is the same thing: a probability distribution over the next token.
Try unigram first, then bigram, then trigram. Switch between word and character level. Open the "How it works" panel to see the live probability calculation. Notice when and why the model falls back to a simpler model.
Open ngramflow →