Explainer

How does a computer learn to predict language?

This page explains the ideas behind ngramflow from the ground up. No prior knowledge needed, but a genuine curiosity about how language and probability connect.

What is a generative language model?

A generative language model is a system that assigns probabilities to sequences of words. Given some text that has already been produced, it estimates how likely each possible next word is.

You already have an intuition for this. If someone says "I'd like a cup of..." your brain immediately suggests "coffee", "tea", or "water" as likely continuations. "purple" or "democracy" would feel surprising. That intuition is a rough language model built from everything you have ever read or heard.

The core task of every language model, from the simplest to GPT-4: given what came before, predict what comes next.

The question is: how do you teach a machine to do this? The answer that Claude Shannon proposed in 1948 is beautifully simple: count things.

"These sequences [of letters], however, are not completely random. In general, they form sentences and have the statistical structure of, say, English. The letter E occurs more frequently than Q, the sequence TH more frequently than XP, etc." A Mathematical Theory of Communication, Bell System Technical Journal (1948)

Shannon was interested in the entropy of English, meaning how much information each letter or word actually carries. To measure it, he needed a model of language. His approach: look at how often sequences of characters occur in real text, and use those frequencies as probabilities.

This is exactly what ngramflow does. No neural networks, no learned parameters, no gradient descent. Just counting words and computing probabilities from those counts.


Tokens: breaking text into pieces

Before a model can count anything, it needs to decide what the basic unit of analysis is. These units are called tokens.

ngramflow offers two choices. Try them below:

Interactive Demo: Token Splitter
"Alice had begun to think that very few things indeed were really impossible."
Choose a mode below to see the tokens.

Word-level tokens treat each word as one unit. A typical English text might have 10,000 to 100,000 unique words (the vocabulary). The model learns relationships between words.

Character-level tokens treat each letter (and space) as one unit. The vocabulary is tiny: just 26 letters plus a space. But the model has to work harder, learning to construct words from individual characters.

Both approaches are valid and reveal different things. Character-level models are closer to what Shannon originally described in 1948.


N-grams: learning from sequences

An n-gram is simply a sequence of n tokens. The number n determines how much context the model looks at when predicting the next token.

Unigram (n=1)

P(w)

Ignores all context. Predicts based on global word frequency alone. "The" is always likely, "zebra" is always rare.

Bigram (n=2)

P(w | w‐1)

Looks at the previous one token. "cup of ___" will suggest "tea" more often than "the" because "of tea" is common.

Trigram (n=3)

P(w | w‐2, w‐1)

Looks at the previous two tokens. More specific context leads to more coherent predictions.

The key trade-off: more context = better predictions, but rarer matches. A trigram like "white rabbit hurried" might appear only once in the whole corpus. A unigram like "the" appears thousands of times. When a context is never seen, the model falls back to a simpler model (trigram to bigram, bigram to unigram).

Interactive Demo: Context Window

The highlighted tokens show what the model "sees" as context before predicting the next word (shown with a dashed border).


How probabilities are calculated

The formula is straightforward. For a bigram model, the probability of word w following word w‐1 is:

P(w | w‐1)  =  count(w‐1, w)  /  count(w‐1, *)

In plain English: how many times does w follow w‐1, divided by how many times w‐1 appears in total? The result is a number between 0 and 1.

Example: In the Alice in Wonderland corpus, suppose "alice" appears 95 times. Of those, it is followed by "was" 60 times. Then:

P("was" | "alice")  =  60 / 95  =  0.632  ≈  63.2%
P("found" | "alice") = 15 / 95 = 15.8%
P("opened" | "alice") = 10 / 95 = 10.5%
... and so on for all other words that follow "alice"

The model does this for every possible context word pair during the build phase. It scans the entire corpus from left to right, counting every consecutive pair. The result is a large lookup table.

Interactive Demo: Bigram Frequency Explorer

Click a word to see its bigram distribution, based on the actual Alice in Wonderland corpus used in the app.

Select a word above to explore its distribution.

Notice how different words have very different distributions. Common function words like "the" have diffuse distributions (many possible next words, all with low individual probability), while more specific words like "rabbit" have spiky distributions (a few very probable next words).

This difference in distribution shape is directly related to Shannon's concept of entropy: a flat distribution has high entropy (= high uncertainty), a spiky one has low entropy (= low uncertainty, easier to predict).


Sampling: why not always pick the most likely word?

Once the model has a probability distribution over next tokens, it needs to pick one. The obvious choice would be to always pick the most probable token. This is called greedy decoding.

But greedy decoding produces boring and often repetitive text. Imagine a model that always picks "the most common next word": it would quickly get stuck in loops or generate very generic sentences.

Instead, ngramflow uses weighted random sampling: each token is picked proportionally to its probability. A 60% token gets chosen 60% of the time, not always. This produces varied, sometimes surprising output, but still respects the statistical structure of the language.

Interactive Demo: Weighted Sampling

This shows the bigram distribution for "alice". Click "Sample!" to draw a token. The red line shows where the random number landed on the probability scale.

Press Sample to pick a token.

Try clicking "Sample!" many times. You will notice that "was" wins most often (because it has the highest probability), but other words appear too. Over many samples, the observed frequency converges to the true probability.

This stochastic nature is what makes language generation feel more natural. It also mirrors how Shannon originally described his experiments with approximating English text.


The character-level model: Shannon's original experiment

Shannon's 1948 paper included a remarkable demonstration. He showed that by conditioning on sequences of characters (not words), a purely statistical model could produce text that resembled English, at least superficially.

In the app, switch to Character mode and observe:

1
Unigram (char) Random noise. Letters appear roughly in proportion to their frequency in English (e is common, z is rare), but there is no structure.
2
Bigram (char) Common English letter pairs ("th", "he", "in") start appearing. Vowels tend to follow consonants and vice versa. It looks like scrambled words.
3
Trigram (char) Recognizable words and word fragments emerge. Some real English words appear. Spaces occur at plausible positions. The output starts to feel language-like.
The key insight: Something that looks like language can emerge purely from counting character sequences in text, with no understanding of meaning whatsoever. This tells us that a large part of what makes text look "English" is its statistical regularities, not its semantics.

From n-grams to modern language models

N-gram models are the starting point of a long line of development in language modeling. Understanding them makes it much easier to understand what came after.

1
N-gram models (Shannon, 1948 onwards) Count-based, no parameters to train, transparent and interpretable. Limited by data sparsity: rare contexts get poor estimates, and context window is fixed and small (1-2 tokens).
2
Neural n-gram models (Bengio et al., 2003) Replace the lookup table with a neural network. Instead of counting, the model learns a continuous representation (embedding) for each word. Better generalization to unseen contexts.
3
Recurrent Neural Networks (2010s) Process tokens one by one with a hidden state that carries information forward. Context window is in theory unlimited, but in practice the model forgets distant tokens.
4
Transformers (Vaswani et al., 2017) Use "attention" to directly compare every token with every other token in the context. No fixed window, no forgetting. Scale to billions of parameters. The basis of GPT, Claude, Gemini, and others.

At every step in this progression, the core task remains the same: predict the next token given the context. What changes is how rich and flexible the model's representation of context can be.

An n-gram model says: "I remember the last 2 words, and I look up what typically comes next." A transformer says: "I consider all previous tokens simultaneously, weigh their relevance through learned attention, and produce a sophisticated probability estimate." The math under the hood is very different, but the output is the same thing: a probability distribution over the next token.

ngramflow is showing you that core loop in its simplest possible form. Every time you press "Next Token", the model executes exactly the same predict-and-sample cycle that underpins the most powerful language models in the world today.

Go explore the app

Try unigram first, then bigram, then trigram. Switch between word and character level. Open the "How it works" panel to see the live probability calculation. Notice when and why the model falls back to a simpler model.

Open ngramflow →