Lecture 4 - Language Modeling in Natural Language Processing

Language modeling are at the heart of many Natural Language Processing systems. Whenever a system needs to guess the next word in a sequence, evaluate how natural a sentence sounds or choose between alternative outputs, it relies on some estimate of how likely different word sequences are.

Older applications such as speech recognition, spelling correction and machine translation depended heavily on statistical language models. Modern transformer based models can be viewed as very powerful neural language models that generalise the same basic idea. For this reason, understanding classical N-gram language models, smoothing and backoff remains extremely valuable even in the deep learning era.

This lecture explains what a language model is, how N-gram models approximate full sequence probabilities, why raw counts are not enough and how smoothing and backoff help handle sparsity. It also introduces perplexity as a standard evaluation measure and connects the concepts to practical NLP tasks.

What is a language model

Informally, a language model is a probability distribution over sequences of words. It answers questions such as.

How likely is the sentence “The students submitted the assignment on time.”
Which of the two sentences is more probable. “She likes to read books.” or “She like to reads book.”
Given the partial sentence “The weather today is very”, which word is a more probable continuation, “hot” or “elephant”.

Formally, for a sequence of words w1, w2, …, wN a language model assigns a probability P of w1 to wN. Using the chain rule of probability, this can be written as.

P of w1 to wN equals the product over i from 1 to N of P of wi given w1 to wi minus 1.

This expression is exact but not directly useful because it involves conditioning on all previous words, which becomes very high dimensional. N-gram models introduce a simplifying assumption to make this computation tractable.

The N-gram approximation

The key idea behind N-gram models is that the probability of a word depends mainly on a limited number of preceding words rather than the entire history. This is a Markov assumption of order N minus 1.

For a bigram model the probability of wi depends only on wi minus 1.

P of wi given w1 to wi minus 1 is approximated by P of wi given wi minus 1.

For a trigram model the probability of wi depends on the previous two words.

P of wi given w1 to wi minus 1 is approximated by P of wi given wi minus 2 and wi minus 1.

More generally, an N-gram model uses N minus 1 previous words as context.

This simplification allows probabilities to be estimated by counting how often different word sequences occur in a corpus. For example, the bigram probability P of “time” given “on” is approximated by the count of the bigram “on time” divided by the count of the unigram “on” in the training corpus.

For a complete theoretical treatment of N-gram language models and smoothing, see the N-gram chapter in Speech and Language Processing.

Maximum likelihood estimation from a corpus

Assume there is a large corpus of text where all N-gram counts can be computed.

For a unigram model, the probability of a word w is estimated as.

P of w equals count of w divided by total number of tokens in the corpus.

For a bigram model, the conditional probability of word w2 given w1 is.

P of w2 given w1 equals count of w1 w2 divided by count of w1.

For a trigram model, the conditional probability of word w3 given w1 and w2 is.

P of w3 given w1 and w2 equals count of w1 w2 w3 divided by count of w1 w2.

These estimates are called maximum likelihood estimates because they are the probabilities that make the observed corpus most likely under the model.

A small example makes this concrete. Suppose a tiny corpus consists of the following sentences.

the dog chased the cat
the cat chased the mouse

If we treat each sentence as starting with a special token start and ending with end, we can compute bigram counts. For example, the bigram count of start, the is 2, the bigram count of the, dog is 1 and the bigram count of the, cat is 2. The estimated bigram probability P of dog given the is 1 over 3 and the probability P of cat given the is 2 over 3.

In real systems, corpora are many orders of magnitude larger and N-grams up to order three or four are common. Even so, a serious issue arises. data sparsity.

The problem of data sparsity

Human language is extremely rich. Even large corpora cannot contain every possible sequence of words. For example, a trigram such as “blue quantum toaster” may not appear in any training data, even though the phrase is perfectly grammatical. Under the maximum likelihood formula, any unseen N-gram receives probability zero.

This creates two problems.

Zero probability sentences. A single unseen trigram in a sentence leads to the entire sentence receiving probability zero, which is unrealistic.
Poor generalisation. N-gram probabilities based only on raw counts do not generalise well to rare events, and the model may overfit frequent patterns.

Smoothing addresses these issues by redistributing some probability mass from seen events to unseen ones so that nothing has zero probability and rare events are handled more sensibly.

Lecture 3 – Minimum Edit Distance and Spelling Correction in Natural Language Processing

Basic smoothing methods

Several smoothing techniques have been developed. Only the most important ones are outlined here to build intuition.

5.1 Add one and add k smoothing

Add one smoothing, also called Laplace smoothing, adds 1 to every N-gram count before normalising. For a bigram model.

P of w2 given w1 equals count of w1 w2 plus 1 divided by count of w1 plus V.

Here V is the vocabulary size. The plus 1 ensures that every possible continuation receives at least a small positive probability, while the plus V in the denominator keeps the distribution normalised.

A more flexible version adds a fractional constant k between 0 and 1 instead of 1. This is called add k smoothing. It reduces the distortion of probabilities for frequent N-grams.

Although conceptually simple, add one smoothing often performs poorly on real language data because it overestimates the probability of unseen events, especially when the vocabulary is large.

5.2 Good Turing discounting

Good Turing smoothing is based on the idea that the total probability mass of unseen events should be related to the number of events that occur once, twice and so on in the data.

Instead of using the raw count c for an N-gram, Good Turing uses an adjusted count c star computed from the distribution of counts. N-grams that occur once have their counts reduced slightly, and the removed probability mass is assigned to the many N-grams that never occurred.

While the detailed formulas require some care, the core idea is redistribution. frequent events are discounted, and the freed probability mass is given to unseen events in a principled way.

5.3 Kneser–Ney smoothing

Kneser Ney smoothing is considered one of the most effective methods for N-gram language models. At a high level, it combines discounting with an intelligent way of backing off to lower order models.

Instead of basing unigram probabilities purely on frequency, Kneser–Ney considers how many different contexts a word appears in. A word that appears in many different contexts, such as “of”, receives higher base probability than a word that only appears frequently in a single fixed phrase.

The full algorithm is beyond the scope of a basic lecture, but it is important to know that modern N-gram toolkits often use some variant of Kneser Ney smoothing by default because of its strong empirical performance.

Backoff and interpolation

In many situations, even with smoothing, higher order N-grams remain unreliable when counts are very low. Backoff and interpolation are strategies for combining higher order and lower order models in a sensible way.

Backoff works by using the highest order N-gram that has sufficient evidence. If a particular trigram has zero or very low count, the model “backs off” to the corresponding bigram or unigram. The final probability is computed from the lower order model, often with a scaling factor to ensure proper normalisation.

Interpolation always combines models of different orders rather than choosing one. For example, a trigram probability can be written as a weighted sum of trigram, bigram and unigram components.

P of wi given wi minus 2 and wi minus 1 equals lambda1 times P trigram plus lambda2 times P bigram plus lambda3 times P unigram.

The weights lambda are chosen so that they are non-negative and sum to one. They may be set by hand or estimated from held-out data to optimise performance. Interpolation ensures that lower order models always contribute, providing robustness when specific higher order contexts are rare.

Backoff and interpolation ideas are closely connected to the way modern neural models mix information from different layers and contexts, so understanding them at the N-gram level builds intuition for more advanced architectures.

Evaluating language models with perplexity

To compare different language models and smoothing strategies, an objective evaluation metric is necessary. Perplexity is the standard measure for this purpose.

Given a language model and a test set of word sequences that were not used for training, perplexity is defined as.

Perplexity equals the inverse of the geometric mean probability of the test sequences. In practice this is computed using the average log probability per word.

Intuitively, perplexity measures how “surprised” the model is by the test data. Lower perplexity indicates a better model, because it assigns higher probability to the sequences that actually occur in the language.

Perplexity is not a direct measure of task performance, but there is usually a strong correlation between lower perplexity and better results in tasks such as speech recognition and machine translation, where the language model is part of a larger system.

Real world applications of N-gram language models

Language models play a central role in many classical NLP applications.

First, in speech recognition systems, the acoustic model proposes several possible word sequences based on audio. The language model scores these candidates and helps pick the most plausible sentence. A trigram model with good smoothing can significantly reduce errors by preferring grammatically correct and semantically natural word sequences.

Second, in spelling correction and query suggestion, language models help evaluate entire sentences rather than isolated words. For example, in the phrase “I like to ear cake”, the word “ear” is valid but unlikely after “to” in this context. A language model can determine that “eat” yields a much higher probability for the full sentence and therefore suggest it as the correction.

Third, in statistical machine translation, language models in the target language are used to ensure that translated output is fluent. Among several possible translations that are grammatically correct at the phrase level, the one with the lowest perplexity under the target language model is often selected.

Finally, even neural sequence models can benefit from ideas developed for N-gram models. Concepts such as next word prediction, context windows, smoothing and evaluation via perplexity carry directly over to modern transformer based architectures.

Step by step algorithm for training an N-gram language model

The basic training procedure for an N-gram language model can be summarised in a sequence of clear steps.

Step 1. Choose the order N of the model, for example unigram, bigram or trigram.

Step 2. Collect a large corpus of text representative of the domain of interest. Add special start and end tokens to each sentence.

Step 3. Preprocess the corpus using the techniques from earlier lectures. This includes normalization, tokenization and optional handling of rare words by mapping them to an unknown token.

Step 4. Count N-grams. Traverse the tokenized corpus and for each position record counts of the N-gram and its context. For a trigram model you maintain counts of both trigrams and their corresponding bigrams.

Step 5. Compute maximum likelihood estimates by dividing N-gram counts by the counts of their contexts, yielding preliminary conditional probabilities.

Step 6. Apply smoothing. Depending on the chosen method, adjust counts or probabilities to allocate some probability mass to unseen N-grams. For simple Laplace smoothing, add a fixed constant to each count before normalization. For more advanced methods, use discounting and backoff formulas.

Step 7. Optionally perform interpolation by combining higher and lower order probabilities with tuned weights.

Step 8. Evaluate the model by calculating perplexity on a held-out test set. If perplexity is unsatisfactory, revisit decisions about N, corpus size, preprocessing or smoothing method.

Step 9. Use the trained language model within downstream applications such as speech recognition, spelling correction or machine translation.

This algorithm bridges theory and practical implementation, and students can follow it directly when building their own N-gram language models in code.

Python code example. simple bigram language model

The following Python example builds a very small bigram language model with add one smoothing and uses it to compute sentence probabilities and simple next word suggestions.

from collections import defaultdict, Counter
import math

# Small toy corpus
corpus = [
    "the dog chased the cat",
    "the cat chased the mouse",
    "the dog ate the food"
]

# Preprocess: add start and end tokens
tokenised_sentences = []
for sent in corpus:
    tokens = ["<s>"] + sent.lower().split() + ["</s>"]
    tokenised_sentences.append(tokens)

# Build unigram and bigram counts
unigram_counts = Counter()
bigram_counts = Counter()

for tokens in tokenised_sentences:
    for i in range(len(tokens)):
        unigram_counts[tokens[i]] += 1
        if i > 0:
            bigram = (tokens[i-1], tokens[i])
            bigram_counts[bigram] += 1

vocab = set(unigram_counts.keys())
V = len(vocab)

def bigram_prob(prev_word, word, k=1.0):
    """Add-k smoothed bigram probability P(word | prev_word)."""
    bigram = (prev_word, word)
    num = bigram_counts[bigram] + k
    denom = unigram_counts[prev_word] + k * V
    return num / denom

def sentence_log_prob(sentence):
    tokens = ["<s>"] + sentence.lower().split() + ["</s>"]
    log_p = 0.0
    for i in range(1, len(tokens)):
        p = bigram_prob(tokens[i-1], tokens[i])
        log_p += math.log(p)
    return log_p

# Compare two sentences
s1 = "the dog chased the cat"
s2 = "cat dog the chased"

print("log P(s1) =", sentence_log_prob(s1))
print("log P(s2) =", sentence_log_prob(s2))

# Simple next-word suggestion
def suggest_next(prev_word, top_k=3):
    probs = []
    for w in vocab:
        if w == "<s>":
            continue
        p = bigram_prob(prev_word, w)
        probs.append((p, w))
    probs.sort(reverse=True)
    return probs[:top_k]

print("Next word suggestions after 'the':")
for p, w in suggest_next("the"):
    print(w, "->", round(p, 3))

Even though this is a tiny corpus, the example demonstrates all key steps. building counts, applying smoothing, computing sentence probabilities and using the model as a simple predictor. In more realistic scenarios, the same logic is applied to much larger datasets using efficient data structures and specialised toolkits.

Summary

Language models provide the probabilistic backbone of many Natural Language Processing systems. By assigning probabilities to sequences of words, they allow applications to prefer fluent sentences over improbable ones and to predict likely continuations in context. N-gram models operationalise this idea in a simple yet powerful way by using local context and corpus statistics.

Because raw counts are sparse and cannot handle unseen sequences, smoothing and backoff are essential. They ensure that every plausible sequence receives some probability mass and that information from lower order models supports higher order estimates. Perplexity offers a principled way to compare different language models and to judge the impact of choices such as N-gram order and smoothing technique.

Although modern neural language models have advanced far beyond classic N-gram models, the concepts introduced in this lecture remain fundamental. Understanding N-grams, smoothing, backoff and perplexity builds strong intuition for later topics such as neural sequence models and transformers.

Lecture 5 – Hidden Markov Models and POS Tagging in NLP.

Lecture 4 – Language Modeling in Natural Language Processing

What is a language model

The N-gram approximation

Maximum likelihood estimation from a corpus

The problem of data sparsity

Basic smoothing methods

5.1 Add one and add k smoothing

5.2 Good Turing discounting

5.3 Kneser–Ney smoothing

Backoff and interpolation

Evaluating language models with perplexity

Real world applications of N-gram language models

Step by step algorithm for training an N-gram language model

Python code example. simple bigram language model

Summary

People also ask:

Leave a ReplyCancel Reply

Contact Us

What is a language model

The N-gram approximation

Maximum likelihood estimation from a corpus

The problem of data sparsity

Basic smoothing methods

5.1 Add one and add k smoothing

5.2 Good Turing discounting

5.3 Kneser–Ney smoothing

Backoff and interpolation

Evaluating language models with perplexity

Real world applications of N-gram language models

Step by step algorithm for training an N-gram language model

Python code example. simple bigram language model

Summary

People also ask:

Related Posts

Lecture 16 – Natural Language Processing Exam Questions

Lecture 15 – Text summarization in Natural Language Processing

Lecture 14 – Question answering in Natural Language Processing

Leave a ReplyCancel Reply