Perplexity: The Poetry of Uncertainty

The veil of perplexity - LLM

Perplexity

I’ve been revisiting the concept of perplexity, especially in the context of large language models (LLMs) and supervised fine-tuning (SFT). This post summarizes my notes and reflections on the topic, and I plan to update it as I continue to explore recent research and interpretations of perplexity.

This blog post is organized as follows:

In the first section, I discuss the math behind perplexity. If you want to skip the math or the geometrical intuition, here’s a quick recap. Perplexity is a measure of uncertainty i.e. how surprised (“perplexed”) we are by a given outcome. Reading a sentence one word at a time (like right now), we make intuitive choices about the next word based on the context of the previous words. If we have too many equally likely choices, the uncertainty increases, and hence the surprise if we are wrong. For a language model, perplexity indicates the amount of uncertainty in predicting the next word, i.e. how well does it model the language in the corpus.

Math Breakdown

To understand what perplexity actually measures, let’s unpack its derivation from entropy and cross-entropy.

1. Entropy & Uniform Distribution

For a discrete probability distribution \(p(x)\) where \(x\) can be a word / sub-word / character (or simply, a token), the entropy of this distribution is defined as:

\[H(p(x)) = -\sum_{x} p(x) \log_b p(x)\]

If at each time step, we assume \(N\) equally likely choices (i.e., a uniform distribution), then

\[H(p(x)) = -\sum_{x} \frac{1}{N} \log_b \frac{1}{N} = \log_b N\]

or, equivalently,

\[N = b^{H(p(x))}\]

This \(N\) is called perplexity, which we can rewrite as:

\[N = \prod_i p(x_i)^{-p(x_i)}\]

where \(x_i\) is a sample from the distribution.

The distribution, in reality, is non-uniform for language modeling. Some tokens are much more likely than others, and the next token’s \(p(x)\) depends on the context; i.e.,

\[p(x_t|x_{1:t-1})\]

where \(x_{1:t-1}\) are the previous tokens in the context window. In non-uniform cases, the entropy is less than the uniform entropy:

\[H(p(x)) < \log_b N = \log_b N_{\alpha}\]

Perplexity \(N_{\alpha}\) for non-uniform distributions can be seen as the effective number of equally likely choices, i.e., the perplexity is “as if” uniformly picking among \(N_{\alpha}\) options. This value is smaller than the uniform perplexity since highest perplexity is equivalent to knowing nothing about the next token, and predicting one of the vocabulary tokens with equal probability.

2. Approximating the True Distribution

Since the true \(p(x)\) is unknown, language modeling aims to find an approximation \(q_{\theta}(x)\), where \(\theta\) are the model parameters. From information theory, the goal is to minimize the KL divergence between \(q(x)\) and the true \(p(x)\); in effect, training tries to make the model’s predicted distribution as close as possible to the data’s true token distribution.

Cross-Entropy

The cross-entropy between \(p(x)\) and \(q(x)\) is:

\[H(p(x), q(x)) = -\sum_{x} p(x) \log_b q(x)\]

KL Divergence

The KL divergence is:

\[D_{KL}(p(x) || q(x)) = H(p(x), q(x)) - H(p(x))\]

Cross-Entropy Perplexity

The cross-entropy-based perplexity (PPL) follows as:

\[PPL(x) = N = b^{H(p(x), q(x))} = b^{-\sum_{x} p(x) \log_b q(x)}\]

This leads to:

\[PPL(x) = b^{H(p(x))} \times b^{D_{KL}(p(x) || q(x))}\]

and thus,

\[\text{Model perplexity} = \text{True perplexity} \times \text{KL Divergence Penalty}\]

The true perplexity is the theoretical minimum perplexity that can be achieved by the model, and the KL Divergence term serves as a penalty factor for the model’s imperfect approximation of the true distribution. Minimizing KL divergence is equivalent to minimizing cross-entropy and by extension, model perplexity .

\[\arg \min_{q(x)} D_{KL}(p(x) || q(x)) = \arg \min_{q(x)} H(p(x), q(x))\]

3. Empirical Computation

Lastly, these equations still operate under the assumption that the true distribution \(p(x)\) is known. In practice, we only have observed samples from the true distribution (aka, the training corpus). So we estimate the cross-entropy empirically. Assuming samples \(x_1, x_2, ..., x_i\) are drawn from the true distribution, the Monte-carlo approximation of the cross-entropy is:

\[H(p, q) \approx -\frac{1}{N} \sum_i \log_b(q(x_i))\]

where \(x_i\) are observed samples and \(H(p,q)\) is the sample average (notice this is just the average negative log-likelihood). This is derived using Asymptotic Equipartition Property.

Consequently, perplexity can be estimated as:

\[PPL(x) \approx \prod_i \left(\frac{1}{q(x_i)}\right)^{1/N}\]

This is a geometric mean of the inverse model probabilities, i.e., perplexity is the weighted average factor by which the model is “surprised” on predicting the next token. This effectively is the weighted average branching factor at every time step - the number of possible next words that can follow a word (Speech and Language Processing by Jurafsky and Martin).


Perplexity Across Training Phases: From Learning to Alignment

In this section, we explore how perplexity can be used to interpret learning and alignment across model and data dimensions.

Pre-training

During pre-training, the primary objective is to minimize the model’s negative log-likelihood; perplexity directly measures progress on this goal. Because perplexity reflects how well the model captures the statistical structure of language, it serves as a strong intrinsic metric for evaluating language understanding during this phase. Unlike measuring the raw probability assigned to an evaluation set, which diminishes with longer sequences, perplexity provides a per-token view, making results more interpretable. However, perplexity does not indicate downstream task performance, such as factual accuracy or reasoning.

If we have a model A with two pre-training checkpoints, \(A_1\) and \(A_2\), where \(A_2\) is further trained than \(A_1\), a lower \(PPL(A_1)\) compared to \(PPL(A_2)\) (on the same evaluation set) would suggest that \(A_2\) has degraded in performance. Thus, perplexity is valid for comparing pre-training checkpoints or model architectures, as long as tokenization remains consistent.

Musings on Post-training

Now, consider post-training such as supervised fine-tuning (SFT) for a domain-specific task. What does the perplexity of the SFT training data, measured under the base model, reveal? For instance, suppose we have a base model Llama-2-7b-hf and we want to instruction fine tune it to a healthcare question answering task. An example instruction pair could be:

Human: Explain what an EOB (Explanation of Benefits) is.
Assistant: An EOB is a statement sent by a health insurance company that explains what medical treatments or services were paid for, what was not covered, and why

Calculating perplexity on SFT data helps assess how familiar a model is with the new domain or instruction format and how well it can predict such data. Comparing perplexity values under different models and data alignments quantifies domain shift and the effectiveness of post-training or fine-tuning. In the following code snippet, we calculate perplexity values under these conditions.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, math

def get_model_tokenizer(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    model = AutoModelForCausalLM.from_pretrained(model_name, dtype=torch.float16, device_map="auto")
    model.eval()
    return tokenizer, model

def ppl_fullsequence(model_name, texts):
    """ Calculate PPL for raw text sequences"""
    tokenizer, model = get_model_tokenizer(model_name)
    losses = []
    for text in texts:
        inputs = tokenizer(text, return_tensors="pt").to(model.device)
        with torch.inference_mode():
            loss = model(**inputs, labels=inputs["input_ids"]).loss.item()
        losses.append(loss)
    return math.exp(sum(losses)/len(losses))

def ppl_assistant_only_chat(model_name, pairs):
    """Calculate PPL for chat template pairs by only looking at assistant tokens (aligns with Llama-2-chat-hf's instruction template)"""
    tokenizer, model = get_model_tokenizer(model_name)

    losses = []
    for example in pairs:
        messages = [
            {"role": "system", "content": "You are a helpful, respectful and honest assistant."}, # system instruction
            {"role": "user", "content": example["user"]}]
        # 1) Get token IDs for the prompt exactly as the model expects
        prompt_ids = tokenizer.apply_chat_template(
            messages,
            tokenize=True,
            add_generation_prompt=True,  # indicate start of assistant response
            return_tensors="pt"
        ).to(model.device)               # shape: (1, prompt_len)

        # 2) Tokenize assistant without adding BOS/EOS again
        assistant_ids = tokenizer(
            " " + example["assistant"],   # add space to separate prompt from assistant
            add_special_tokens=False,
            return_tensors="pt"
        ).input_ids.to(model.device)     # shape: (1, ans_len)

        # 3) Build full input ids and labels, and mask prompt tokens
        full_ids = torch.cat([prompt_ids, assistant_ids], dim=1)  # (1, prompt_len+ans_len)
        labels = full_ids.clone()
        prompt_len = prompt_ids.shape[1]
        labels[:, :prompt_len] = -100  # ignore prompt in loss

        with torch.inference_mode():
            loss = model(input_ids=full_ids, labels=labels).loss.item()
        losses.append(loss)

    return math.exp(sum(losses)/len(losses))

Let’s calculate the perplexity values with the following data which is similar to Llama-2-chat’s training data.


raw_texts = [
    "Human: What type of species is a orangutan?\nAssistant: An orangutan is a species of ape.",
    "Human: What is the capital of France?\nAssistant: The capital of France is Paris."
]

# Equivalent to:
pairs = [
    {
        "user": "What type of species is a orangutan?",
        "assistant": "An orangutan is a species of ape."
    },
    {
        "user": "What is the capital of France?",
        "assistant": "The capital of France is Paris."
    }
]

base = "meta-llama/Llama-2-7b-hf"
chat = "meta-llama/Llama-2-7b-chat-hf"

print("Base PPL (raw text):", ppl_fullsequence(base, raw_texts))
print("SFT model PPL (naive, raw text):", ppl_fullsequence(chat, raw_texts))
print("SFT model PPL (assistant-only, chat template):", ppl_assistant_only_chat(chat, pairs))

This results in the following output for the SFT data:

Output:
Base PPL (raw text): 10.266649146981344
SFT model PPL (naive, raw text): 6.7316364438131675
SFT model PPL (assistant-only, chat template): 3.8855575447740236

We calculate perplexity values across different dimensions, which imply different things:

Model Text format Perplexity meaning
base
(Llama-2-7B-hf)
raw text Measures how well the base LM predicts this data’s tokens
chat
(Llama-2-7B-chat-hf)
raw text Measures how well the SFT model predicts this data’s tokens without its expected conversational framing
chat
(Llama-2-7B-chat-hf)
structured dialogue Measures how well the SFT model performs the instruction-based completion given user input in its expected conversational template

Base model perplexity on SFT data

The base model’s perplexity on its evaluation set is not directly comparable to its perplexity on the SFT data. We hope that they are similar, but by nature of instructing tuning task, the data distribution is slightly different.

  • The PPL \(\approx 10.27\) for the base model (raw text of SFT data) represents its general linguistic fluency as it is trained auto-regressively on all tokens. This purely tells us how much “in-distribution” the SFT data is to the base model’s data domain (i.e. domain familiarity).
  • This low PPL even before SFT means the base model already finds the data similar in style, structure, and semantics. If the SFT domain differs substantially (e.g., moving from English classics to biology research), we should expect a higher PPL (we will see this next).
    • For instruction finetuning, a low PPL indicates greater familiarity with the instruction language (even if not fully aligned yet). Understanding familiarity of the language is a useful diagnostic before alignment. This opens up questions about continued pre-training on new domain data, but we are jumping ahead here.

SFT model perplexity on SFT data

Instruction tuning is not optimized for pure language modeling, rather for specialized behavioural alignment. (<s>[INST] Explain what an EOB ... [/INST] [ASSISTANT] An EOB is ...). The SFT model is trained only to predict after the [ASSISTANT] token; The user instruction is part of the context, but not part of the training loss. The SFT training shifts the language model heavily towards the templates of behaviour (instruction) we desire. This is no longer a pure language model.

  • The PPL \(\approx 6.73\) (SFT model, raw text) is lower than the base model’s \(10.27\), showing the SFT training has improved overall prediction even without the template.
  • The PPL \(\approx 3.89\) (SFT model, structured dialogue) is the lowest, demonstrating the SFT model’s strong alignment with its specific conversational template (e.g., [INST], [/INST]). This low PPL confirms the model has heavily learned to expect and predict the post-instruction sequence (the assistant’s response) when provided with the correct context and template. Perplexity on raw instruction data is higher due to these missing mode of “understanding” that comes from the template.

Out of domain data

Suppose our dataset comes from an entirely different domain, such as healthcare,

raw_texts = [
    "Human: Explain what an EOB (Explanation of Benefits) is.\nAssistant: An EOB is a statement sent by a health insurance company that explains what medical treatments or services were paid for, what was not covered, and why",
    "Human: How do you verify prior authorization status?\nAssistant: Contact the payer or check their portal using the member ID and service codes; confirm approval dates, remaining units, and any documentation required"
]
Output:
Base PPL (raw text): 13.936028931681786
SFT PPL (naive, raw text): 18.254311566365946
SFT PPL (assistant-only, chat template): 17.49617187783925

Notes:

  • We observe that aligning with the chat template of the SFT model helps reduce perplexity, but it is not any better than the base pure language model. The healthcare domain data differs in phrasing and semantics, so the SFT model is not able to generalize well to this new domain. The PPL \(\approx 17.49\) is a measure of where perplexity is for this new domain.
  • If we however, performed SFT on the base model with the healthcare data, the PPL \(\approx 13.94\) is baseline indicator of how well the pretrained language model already models the domain’s token distribution before instruction-tuning.
  • A higher PPL on this data using a previously finetuned SFT model (Llama-2-7B-chat-hf) \(\approx 17.49\) indicates a distribution mismatch with respect to healthcare jargon as-is. (Note: this is a naive setting; we could add in-context examples to potentially improve this).

This exercise is a potentially useful diagnostic to understand the domain shift between the pre-training and post-training data, and the need for domain adaptation through continued pre-training. If the model doesn’t speak the same “language”, we cannot expect SFT / RLHF to perform well, as the next-token prediction is “off-track”. Instructing tuning can force the model to mimic the language but it is not internalized the way a pure language model would.

So, in summary,

  • the pre-training PPL measures model’s familiarity of the language.
  • the post-training PPL measures model’s familiarity of the instruction language, i.e alignment familiarity.

Deeper insights into perplexity

As mentioned above, perplexity on its own does not fully capture model quality. It should be considered alongside task-specific evaluations and specialized benchmarks, such as those for instruction following and other targeted abilities. Not to forget, human evaluation is still golden. Still, perplexity is a useful diagnostic; it provides a baseline for how well the model predicts next tokens on task data and can reveal much about model fit and data alignment. This section collects key insights from the literature on the interpretation and limitations of perplexity.

  • Scaling Laws for Neural Language Models (2020) demonstrates that perplexity is a good proxy for model quality, and that cross entropy loss scales according to power laws with model size, data size and compute dimensions. This allows us to estimate how much perplexity will reduce with more training.
    • Training Compute-Optimal Large Language Models (2023) proposes Chinchilla correction that estimates that model size and data size should scale equally to achieve optimal performance. This prevents under-trained models on number of tokens relative to the model size.
  • Training Trajectories of Language Models Across Scales (2023) indicates at a given perplexity, models with different sizes can behave similarly on downstream / (in-context) evaluation tasks. They measure validation next-token perplexity and observe a similar subset of training tokens see the most significant reduction in loss across these model variants.

  • Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs? (2025) studies the choice of pre-training checkpoints to maximize downstream finetuning performance. If the perplexity of \(A_1\) checkpoint is lower than \(A_2\) checkpoint, does it tell us that \(A_1\) will perform better than \(A_2\) on the SFT task? The authors find that conventional perplexity has little correlation towards how well a model will do after supervised or instruction fine-tuning / reasoning tasks. Task specific metrics are more reliable, but potentially so can unsupervised proxy metrics (that they propose).

  • Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality (2025) investigated SFT data properties (e.g., perplexity of SFT data given the base model, as in our toy example above). In this case, they use the same SFT dataset on various base models and observe that the SFT data with a lower perplexity, that is, the SFT data patterns the base model found to be easier to predict/more familiar, consistently led to greater improvements in downstream task performance. This aligns with our previous note about match of the SFT data to the base model’s data domain, and the need for domain adaptation through continued pre-training. Inversely, given a base model X, we can reliably use perplexity to compare and rank different SFT datasets. This allows us to efficiently evaluate and select SFT datasets for a given base model.

  • Paloma - A Benchmark for Evaluating Language Model Fit (2024) specifically evaluates perplexity of the many distinct domains vs. measuring perplexity on all text as one unit in pre-training phase. Given that pre-training data is typically a mix of many domains, this is a useful benchmark to understand the model’s fit to specific data domains. The paper shows assuming a good perplexity score on one distribution extrapolates to all others is a flawed assumption.

Summary

It’s fascinating to see how a concept as old as perplexity continues to shed light on modern fine-tuning and alignment of LLMs. It’s not a silver bullet, but a useful diagnostic, one that bridges information theory with today’s model behavior. As models get larger and training objectives more complex, revisiting these fundamentals feels less like nostalgia and more like grounding! The recent literature in this area is intriguing, and I hope to continue exploring this topic in the future.

References

Updated: