Language modeling are at the heart of many Natural Language Processing systems. Whenever a system needs to guess the next word in a sequence, evaluate how natural a sentence sounds or choose between alternative outputs, it relies on some estimate of how likely different word sequences are.
Older applications such as speech recognition, spelling correction and machine translation depended heavily on statistical language models. Modern transformer based models can be viewed as very powerful neural language models that generalise the same basic idea. For this reason, understanding classical N-gram language models, smoothing and backoff remains extremely valuable even in the deep learning era.
This lecture explains what a language model is, how N-gram models approximate full sequence probabilities, why raw counts are not enough and how smoothing and backoff help handle sparsity. It also introduces perplexity as a standard evaluation measure and connects the concepts to practical NLP tasks.
What is a language model
Informally, a language model is a probability distribution over sequences of words. It answers questions such as.
- How likely is the sentence “The students submitted the assignment on time.”
- Which of the two sentences is more probable. “She likes to read books.” or “She like to reads book.”
- Given the partial sentence “The weather today is very”, which word is a more probable continuation, “hot” or “elephant”.
Formally, for a sequence of words w1, w2, …, wN a language model assigns a probability P of w1 to wN. Using the chain rule of probability, this can be written as.
P of w1 to wN equals the product over i from 1 to N of P of wi given w1 to wi minus 1.
This expression is exact but not directly useful because it involves conditioning on all previous words, which becomes very high dimensional. N-gram models introduce a simplifying assumption to make this computation tractable.
The N-gram approximation
The key idea behind N-gram models is that the probability of a word depends mainly on a limited number of preceding words rather than the entire history. This is a Markov assumption of order N minus 1.
For a bigram model the probability of wi depends only on wi minus 1.
P of wi given w1 to wi minus 1 is approximated by P of wi given wi minus 1.
For a trigram model the probability of wi depends on the previous two words.
P of wi given w1 to wi minus 1 is approximated by P of wi given wi minus 2 and wi minus 1.
More generally, an N-gram model uses N minus 1 previous words as context.
This simplification allows probabilities to be estimated by counting how often different word sequences occur in a corpus. For example, the bigram probability P of “time” given “on” is approximated by the count of the bigram “on time” divided by the count of the unigram “on” in the training corpus.
For a complete theoretical treatment of N-gram language models and smoothing, see the N-gram chapter in Speech and Language Processing.
Maximum likelihood estimation from a corpus
Assume there is a large corpus of text where all N-gram counts can be computed.
- For a unigram model, the probability of a word w is estimated as.
P of w equals count of w divided by total number of tokens in the corpus.
- For a bigram model, the conditional probability of word w2 given w1 is.
P of w2 given w1 equals count of w1 w2 divided by count of w1.
- For a trigram model, the conditional probability of word w3 given w1 and w2 is.
P of w3 given w1 and w2 equals count of w1 w2 w3 divided by count of w1 w2.
These estimates are called maximum likelihood estimates because they are the probabilities that make the observed corpus most likely under the model.
A small example makes this concrete. Suppose a tiny corpus consists of the following sentences.
- the dog chased the cat
- the cat chased the mouse
If we treat each sentence as starting with a special token start and ending with end, we can compute bigram counts. For example, the bigram count of start, the is 2, the bigram count of the, dog is 1 and the bigram count of the, cat is 2. The estimated bigram probability P of dog given the is 1 over 3 and the probability P of cat given the is 2 over 3.
In real systems, corpora are many orders of magnitude larger and N-grams up to order three or four are common. Even so, a serious issue arises. data sparsity.
The problem of data sparsity
Human language is extremely rich. Even large corpora cannot contain every possible sequence of words. For example, a trigram such as “blue quantum toaster” may not appear in any training data, even though the phrase is perfectly grammatical. Under the maximum likelihood formula, any unseen N-gram receives probability zero.
This creates two problems.
- Zero probability sentences. A single unseen trigram in a sentence leads to the entire sentence receiving probability zero, which is unrealistic.
- Poor generalisation. N-gram probabilities based only on raw counts do not generalise well to rare events, and the model may overfit frequent patterns.
Smoothing addresses these issues by redistributing some probability mass from seen events to unseen ones so that nothing has zero probability and rare events are handled more sensibly.
Lecture 3 – Minimum Edit Distance and Spelling Correction in Natural Language Processing
Basic smoothing methods
Several smoothing techniques have been developed. Only the most important ones are outlined here to build intuition.
5.1 Add one and add k smoothing
Add one smoothing, also called Laplace smoothing, adds 1 to every N-gram count before normalising. For a bigram model.
P of w2 given w1 equals count of w1 w2 plus 1 divided by count of w1 plus V.
Here V is the vocabulary size. The plus 1 ensures that every possible continuation receives at least a small positive probability, while the plus V in the denominator keeps the distribution normalised.
A more flexible version adds a fractional constant k between 0 and 1 instead of 1. This is called add k smoothing. It reduces the distortion of probabilities for frequent N-grams.
Although conceptually simple, add one smoothing often performs poorly on real language data because it overestimates the probability of unseen events, especially when the vocabulary is large.
5.2 Good Turing discounting
Good Turing smoothing is based on the idea that the total probability mass of unseen events should be related to the number of events that occur once, twice and so on in the data.
Instead of using the raw count c for an N-gram, Good Turing uses an adjusted count c star computed from the distribution of counts. N-grams that occur once have their counts reduced slightly, and the removed probability mass is assigned to the many N-grams that never occurred.
While the detailed formulas require some care, the core idea is redistribution. frequent events are discounted, and the freed probability mass is given to unseen events in a principled way.
5.3 Kneser–Ney smoothing
Kneser Ney smoothing is considered one of the most effective methods for N-gram language models. At a high level, it combines discounting with an intelligent way of backing off to lower order models.
Instead of basing unigram probabilities purely on frequency, Kneser–Ney considers how many different contexts a word appears in. A word that appears in many different contexts, such as “of”, receives higher base probability than a word that only appears frequently in a single fixed phrase.
The full algorithm is beyond the scope of a basic lecture, but it is important to know that modern N-gram toolkits often use some variant of Kneser Ney smoothing by default because of its strong empirical performance.
Backoff and interpolation
In many situations, even with smoothing, higher order N-grams remain unreliable when counts are very low. Backoff and interpolation are strategies for combining higher order and lower order models in a sensible way.
Backoff works by using the highest order N-gram that has sufficient evidence. If a particular trigram has zero or very low count, the model “backs off” to the corresponding bigram or unigram. The final probability is computed from the lower order model, often with a scaling factor to ensure proper normalisation.
Interpolation always combines models of different orders rather than choosing one. For example, a trigram probability can be written as a weighted sum of trigram, bigram and unigram components.
P of wi given wi minus 2 and wi minus 1 equals lambda1 times P trigram plus lambda2 times P bigram plus lambda3 times P unigram.
The weights lambda are chosen so that they are non-negative and sum to one. They may be set by hand or estimated from held-out data to optimise performance. Interpolation ensures that lower order models always contribute, providing robustness when specific higher order contexts are rare.
Backoff and interpolation ideas are closely connected to the way modern neural models mix information from different layers and contexts, so understanding them at the N-gram level builds intuition for more advanced architectures.
Evaluating language models with perplexity
To compare different language models and smoothing strategies, an objective evaluation metric is necessary. Perplexity is the standard measure for this purpose.
Given a language model and a test set of word sequences that were not used for training, perplexity is defined as.
Perplexity equals the inverse of the geometric mean probability of the test sequences. In practice this is computed using the average log probability per word.
Intuitively, perplexity measures how “surprised” the model is by the test data. Lower perplexity indicates a better model, because it assigns higher probability to the sequences that actually occur in the language.
Perplexity is not a direct measure of task performance, but there is usually a strong correlation between lower perplexity and better results in tasks such as speech recognition and machine translation, where the language model is part of a larger system.
Real world applications of N-gram language models
Language models play a central role in many classical NLP applications.
First, in speech recognition systems, the acoustic model proposes several possible word sequences based on audio. The language model scores these candidates and helps pick the most plausible sentence. A trigram model with good smoothing can significantly reduce errors by preferring grammatically correct and semantically natural word sequences.
Second, in spelling correction and query suggestion, language models help evaluate entire sentences rather than isolated words. For example, in the phrase “I like to ear cake”, the word “ear” is valid but unlikely after “to” in this context. A language model can determine that “eat” yields a much higher probability for the full sentence and therefore suggest it as the correction.
Third, in statistical machine translation, language models in the target language are used to ensure that translated output is fluent. Among several possible translations that are grammatically correct at the phrase level, the one with the lowest perplexity under the target language model is often selected.
Finally, even neural sequence models can benefit from ideas developed for N-gram models. Concepts such as next word prediction, context windows, smoothing and evaluation via perplexity carry directly over to modern transformer based architectures.
Step by step algorithm for training an N-gram language model
The basic training procedure for an N-gram language model can be summarised in a sequence of clear steps.
Step 1. Choose the order N of the model, for example unigram, bigram or trigram.
Step 2. Collect a large corpus of text representative of the domain of interest. Add special start and end tokens to each sentence.
Step 3. Preprocess the corpus using the techniques from earlier lectures. This includes normalization, tokenization and optional handling of rare words by mapping them to an unknown token.
Step 4. Count N-grams. Traverse the tokenized corpus and for each position record counts of the N-gram and its context. For a trigram model you maintain counts of both trigrams and their corresponding bigrams.
Step 5. Compute maximum likelihood estimates by dividing N-gram counts by the counts of their contexts, yielding preliminary conditional probabilities.
Step 6. Apply smoothing. Depending on the chosen method, adjust counts or probabilities to allocate some probability mass to unseen N-grams. For simple Laplace smoothing, add a fixed constant to each count before normalization. For more advanced methods, use discounting and backoff formulas.
Step 7. Optionally perform interpolation by combining higher and lower order probabilities with tuned weights.
Step 8. Evaluate the model by calculating perplexity on a held-out test set. If perplexity is unsatisfactory, revisit decisions about N, corpus size, preprocessing or smoothing method.
Step 9. Use the trained language model within downstream applications such as speech recognition, spelling correction or machine translation.
This algorithm bridges theory and practical implementation, and students can follow it directly when building their own N-gram language models in code.
Python code example. simple bigram language model
The following Python example builds a very small bigram language model with add one smoothing and uses it to compute sentence probabilities and simple next word suggestions.
from collections import defaultdict, Counter
import math
# Small toy corpus
corpus = [
"the dog chased the cat",
"the cat chased the mouse",
"the dog ate the food"
]
# Preprocess: add start and end tokens
tokenised_sentences = []
for sent in corpus:
tokens = ["<s>"] + sent.lower().split() + ["</s>"]
tokenised_sentences.append(tokens)
# Build unigram and bigram counts
unigram_counts = Counter()
bigram_counts = Counter()
for tokens in tokenised_sentences:
for i in range(len(tokens)):
unigram_counts[tokens[i]] += 1
if i > 0:
bigram = (tokens[i-1], tokens[i])
bigram_counts[bigram] += 1
vocab = set(unigram_counts.keys())
V = len(vocab)
def bigram_prob(prev_word, word, k=1.0):
"""Add-k smoothed bigram probability P(word | prev_word)."""
bigram = (prev_word, word)
num = bigram_counts[bigram] + k
denom = unigram_counts[prev_word] + k * V
return num / denom
def sentence_log_prob(sentence):
tokens = ["<s>"] + sentence.lower().split() + ["</s>"]
log_p = 0.0
for i in range(1, len(tokens)):
p = bigram_prob(tokens[i-1], tokens[i])
log_p += math.log(p)
return log_p
# Compare two sentences
s1 = "the dog chased the cat"
s2 = "cat dog the chased"
print("log P(s1) =", sentence_log_prob(s1))
print("log P(s2) =", sentence_log_prob(s2))
# Simple next-word suggestion
def suggest_next(prev_word, top_k=3):
probs = []
for w in vocab:
if w == "<s>":
continue
p = bigram_prob(prev_word, w)
probs.append((p, w))
probs.sort(reverse=True)
return probs[:top_k]
print("Next word suggestions after 'the':")
for p, w in suggest_next("the"):
print(w, "->", round(p, 3))
Even though this is a tiny corpus, the example demonstrates all key steps. building counts, applying smoothing, computing sentence probabilities and using the model as a simple predictor. In more realistic scenarios, the same logic is applied to much larger datasets using efficient data structures and specialised toolkits.
Summary
Language models provide the probabilistic backbone of many Natural Language Processing systems. By assigning probabilities to sequences of words, they allow applications to prefer fluent sentences over improbable ones and to predict likely continuations in context. N-gram models operationalise this idea in a simple yet powerful way by using local context and corpus statistics.
Because raw counts are sparse and cannot handle unseen sequences, smoothing and backoff are essential. They ensure that every plausible sequence receives some probability mass and that information from lower order models supports higher order estimates. Perplexity offers a principled way to compare different language models and to judge the impact of choices such as N-gram order and smoothing technique.
Although modern neural language models have advanced far beyond classic N-gram models, the concepts introduced in this lecture remain fundamental. Understanding N-grams, smoothing, backoff and perplexity builds strong intuition for later topics such as neural sequence models and transformers.
Lecture 5 – Hidden Markov Models and POS Tagging in NLP.
People also ask:
A language model is a probability distribution over sequences of words. It allows a system to estimate how likely a sentence is and to predict which words are probable in a given context.
An N-gram model is a type of language model that approximates the probability of a word based only on the previous N minus 1 words. Common examples are bigram models that look at one previous word and trigram models that look at two previous words.
Even large corpora cannot contain every possible word sequence, so many N-grams will have zero counts. Without smoothing these N-grams would have zero probability, which is unrealistic and harmful for applications. Smoothing reassigns some probability mass from frequent events to unseen ones so that all sequences have non-zero probability.
Backoff is a strategy where the model uses a higher order N-gram when there is enough data, but falls back to a lower order N-gram when the higher order count is sparse or zero. This makes the model more robust while still taking advantage of longer contexts when possible.
Perplexity measures how well a language model predicts a separate test set of sentences. Lower perplexity means the model assigns higher probability to the actual data and is therefore considered better. It is widely used to compare smoothing methods and N-gram orders.




