Vector & Verse

We are all architects now. Congrats on the promotion.

2026-03-20T01:00:00+00:00

In the last month or so, I have been working on a greenfield project that does a lot of complex things. The code is robust to a reasonable degree, quick, and works as expected. I, however, barely know the code other than the structure I communicated in the design conversations. In every sense, I have orchestrated a full-fledged system that may very well be a Frankenstein one, but it works as expected. I know a little more than I care to admit here. I also know the memory and space constraints and optimizations that need to be done at scale.

Today I spent some time cooking up a two-pointer fuzzy matching algorithm for an evaluation framework. I wrote up the specs and gave them to Claude, asking to develop it in a TDD fashion. It wrote up several unit tests covering the two-pointer scenarios and built the evaluation algorithm in one shot. In subsequent iterations, we optimized for performance with review sub-agents.

The question of identity

Suffice it to say, less than 3 months into the year, my function has changed drastically. My previous hesitations around AI taking away my identity are both true and false. See my previous post 5,000 Lines of Trust for more on this. However, the skillsets I have built over a lifetime around systems thinking and product intuition are helping me function at a pace bottlenecked by my own consumption and feedback pace, and processes we have built as guardrails.

2 meta questions These are exploratory thought experiments, not proposals or recommendations. on that front:

A. If your code was written with AI and reviewed by AI, do you need a human reviewer for every PR, other than yourself?

B. If we let a long-running agent build the entirety of the last month’s work with occasional human feedback, do we need human developers to raise PRs?

These aren’t questions of panic. Rather, it’s written from a vantage point through reflection on possibilities that is incredibly freeing. That context setting makes me wonder at the purpose of a human in the loop here. From a development view, I see us as architects. From a PR review perspective, we remain as policy setters. Mapping back to the architects, the job was to shape systems and streamline execution without causing roadblocks. If an AI reviewer is able to reliably assess that the code changes followed the architectural principles [with review, tests, integration], adding myself as a man in the middle is a bottleneck that is vestigial and ceremonial. Essentially, I don’t need to check if the pipes are leaking (the AI agent already did that); I need to check if the pipes are leading to the right room.

We’re entering an era where systems work before they are understood. I expect both personas to see a rapid transformation, where the intent validation gets more emphasis over the implementation specifics.

The paradox of surface architects

Code was always the medium. Judgment was always the craft. Ahmad Al-Dahle, The Future of Software Engineering Isn’t What You Think

Systems thinking has always been the key. Before the rise of AI agents, it was a craft one developed as they hand-wrote systems from scratch, integrated various systems as components of a mega system, and architected a system into a general-purpose platform that accommodates multiple use-cases. Now, AI agents can do that in a few hours, while ironically and simultaneously estimating the effort to several weeks.

In a different setting, we once pushed back against the product team because the specs didn’t make sense. In the new world, we are incresingly expected to be both engineers and product thinkers. We need both the why and how in the design thinking, as we instruct our coding agents to perform the deed.

So the engineers pre-AI agents built intuition the hard way. What does that entail for the new grads? Or even for engineers working on new areas that they don’t have deep expertise in? I think there is a risk of creating a generation of “surface architects” who can describe a system but cannot diagnose a failure when the abstractions leak. How can they reason about the failure modes of this system? A recursive loop of relying on agents to fix the code written by agents leaves me unsettled. Today, we can use different models as reviewers to prevent model family biases, but it still leaves a lot desired from a trust standpoint.

These days, I find myself sending a review agent to do daily code reviews on top of per PR code reviews to take into evaluation purview a larger granularity of code and come up with a strategic plan of optimizations. I am still in the loop, asking specific design questions and rejecting choices of solutions or things that are wrong, most likely due to a lack of context specification.

From that vantage point, perhaps the way we develop new intuitions lies in deep observability and debugging. Effectively auditing systems like a systems pathologist forces one to look at gaps and flaws. We could rely on AI to audit, but to avoid falling into the circular reasoning traps, the focus moves to effective hypothesis building. This requires having a clear mental model of the system before auditing for the answer.

I think there’s value in this “auditing” regardless of experience in this field. For one, this adversarial testing mindset can actually help us create sound arguments that can be validated against observed behaviour. Second, intuition is built on the things that cannot be hallucinated, i.e., must rely on observed behaviour.

Looking forward

We now coordinate AI agents to think across the product stack from design to infra and across various interconnected components. We get continuous feedback from AI like drinking water from a firehose, faster than we can synthesize. In many ways, our job has evolved to synthesize that feedback into a format consumable by the next iteration with our intuition.

For example, I am seeing noticeable productivity gains in assessing existing systems with AI-generated documentation. We can effectively do incisive integration assessments at such a rapid pace that it’s almost unsettling. Now, it’s about validating the spec generated by AI, which more and more is a cursory glance to see if something is off.

I think we have fundamental new problems in this state space. For example, what should an interview look like? If an agent can solve a convoluted two-pointer problem in minutes, what signal does Leetcoding give other than serving as a rejection filter? We should shift toward debugging systems or refining specs, i.e., giving a candidate a flawed AI-generated architecture and seeing if they have the intuition to spot the “off” not, potentially without using AI.

We also have a free fall possibility to semantic fragmentation in code - slightly different version of the same logic spread across the code base. For example, the two pointer fuzzy match algorithm I asked for could be reused for various use-cases, rewritten each time with slight variation to accomodate the different needs. A good engineering pattern is to ensure the system maintains a single source of truth, especially when it’s easier to generate a new implementation than to find and refactor an existing one.

The question of the need for a human reviewer also comes from data showing AI-generated PRs actually wait 4.6x longer for review than human ones. Source: AI Technical Debt: 30-41% Increase Hits Developers As discussed above, building auditor agents can help, but how do we build long-term trust in this coordinated setup to keep code maintainable? Trust is built over time, but it only needs one major production incident to break it.

TL;DR

The cost of doing has dropped to zero, leaving only the cost of deciding. We are moving away from the aesthetic of clean code toward the pragmatism of viable systems; we have moved away from striving for perfection to effectiveness. If the code is a Frankenstein, let it be, provided we are not its victim. It is a terrifying level of responsibility, and an incredibly freeing one. It’s time to take responsibility for the one thing agents cannot: the why.

5,000 Lines of Trust

2026-02-07T01:00:00+00:00

Alt title: Engineering Agency in the Age of Agents

Amidst the humdrum hustle of a warm San Francisco Friday, my little command line agents are writing code, endlessly whirring, pontificating, and moseying. I interject here and there and respond to questions; I am heavily involved in the initial planning phase, but I take a less active role once the team begins the execution of the work. I review the code after I have a second team of agents review the code. I perform the remnants of testing before I push my code.

My code… Is it still my code? At my very core, I am an engineer. I love building castles out of code and data. I used to cherish knowing the ins and outs of a system that my human team and I built. I may not have known every tiny bit of it, but I knew how to debug them and how to look at logs in the absence of metrics to find the needle in the haystack.

Now, here I am with 5000 lines of code in a single PR, every bit production-ready, and I am heavily I wouldn’t describe myself as 100% reliant on the agents, as I still own the design of the system. I also must mention that this line is a bit exaggerated in this musing. We are still a bit away from cooking up something so complex and production ready instantaneously. Cycles of conversations with agents seems to do the trick but the end result isn’t a dramatic improvement. Also, see anthropic’s research on this topic. reliant on an extensively verbose documentation to understand the intricacies of the system. My mind is tense at the amount of information it is suddenly consuming. Perhaps the easy way out is to trust the agents; That trust comes with the acceptance that I will be further reliant on them to debug issues.

This shift forces a confrontation with the nature of volition itself. In the old world, my agency was expressed through the doing. Now, my agency is expressed through will. From “How do I implement this?” to “How do I ensure this system represents what I wanted?”, the stakes have become higher.

I am left with hesitation, though; I do not have all the answers yet. This year feels both magical and calamitous, all at once. It’s like a speeding train that does not seem to have brakes. There is always great joy and purpose in creation, but there used to be a greater joy in the journey of it. And now, that’s gone.

My work identity has evolved in the last 8 months. I can still write code, but I am now more of a designer and an architect than I ever have been. At times, I wonder at the grandiose nature of the systems being built out of thin air. I marvel at them, but very differently from the ones where human teams breathed life into them through hard work for months on end, in iterations. Every such thing suddenly feels pointless in its old form. That’s what confounds the most, I think – a loss of familiarity, forcing a search for a new kind of gravity.

These reflections meander to intrinsic motivations. Creating designs brought joy because it was like assembling a jigsaw puzzle. Fixing bugs was like embarking on a grand quest. At the end of it, there was an answer, an explanation, an understanding. The rhythms of team dynamics, pair programming, and rubber-ducking – it made work interesting. While the human team dynamics haven’t drasticallychanged, the invisible dynamics with AI agents have. So, now what?

I have no qualms about automation or AI I believe we eventually will achieve AGI, and hope that it is achieve safely and sustainably. . The pace of transformation leaves an unease because my internal mechanics need constant reframing. I enjoy building products end-to-end, so it’s no longer about castles, but rather the cities. I am finding myself deeper in the grips of understanding systems, so it’s the workflows, and not the code. In some crude sense, it’s about deciding why the haystack needs to exist at all. My focus is shifting deeply toward understanding the motivations, rather than just the artifact of these agents - the code itself.

I am excited for the things that are now within my grasp, much more than ever before. I can teach myself faster than ever before with the help of a team of assistant agents. I can optimize my life cheaply for monthly agent subscriptions These encompass health, dating tips, career advice, financial planning etc. at the expense of privacy. Everything is at my fingertips. I just need to instruct. I am the conductor, the orchestrator. I am now an urban planner, instead of a stone mason. Everyone is.

I feel like I have zoomed out; that’s why I think in terms of cities, and less in terms of castles. But equally, and in the opposite way, I think about how these castles are being built. What hidden structures lie beneath these walls that tell us more about their creators themselves? Perhaps I am trying to be more than just a man, vying for power, power that I once seemed to have, moulded differently. And perhaps, the answer to that quest lies in the psyche of the system.

I’m intrigued by the inner workings of increasingly complex agents, even if I don’t write all the code. I want to understand how my intent is carried out once it is no longer directly in my control.

In these newfound promotions, I wonder what my signature in the streets is. There is no more the tactile joy of a chisel. Like the intimate act of writing these verses evokes joy, despite the pen now a keyboard and the paper a Google doc, the act of creation remains. The journey ahead is understanding the intent, not just in the code, but in the systems it shapes and the paths it sets in motion.

What Interests Me in 2026

2026-01-28T12:00:00+00:00

I’m broadly interested in questions of alignment and interpretability, modeling human behavior, and building NLP systems for specialized domains like healthcare, where reasoning, representation, and operational constraints matter as much as raw performance.

Arrival and Contact are two of my favorite films – not for the aliens, but for the problem they revolve around: interpretation under epistemic uncertainty. In both stories, the question of alien contact isn’t about intelligence or capability, but about representation. Any action depends on finding a shared language that allows meaning to be communicated reliably; the part I absolutely love in both movies: interpretation is the action.

ML has always been about learning the right frame of view that maps questions to expected answers, which is why my sentiments around this xkcd comic remain complicated. The opacity or failure often reflects a representational mismatch, not an absence of structure. We could influence the resulting model by modifying architectural choices, adjusting hyperparameters, and varying the volume of training data. Productizing such ML systems works precisely because they align with the real world.

Similarly, large neural systems can be tuned to exhibit stable, predictive signals even without the ability (or the need) to create an explanation for those outcomes. The model underneath is free to learn a representation as long as it eventually maps to the outcomes we desire. There’s more likely to be a rich structure in this latent space that we don’t understand yet because we haven’t used (or created) the right language. Irrespective of building interpretability mechanisms post-hoc or during the model training, the core problem is about a translation. There are tools like linear probes, concept activation vectors (CAVs), sparse autoencoders (SAEs), and steering vectors, all attempting to learn a change of basis within the activation space of neuron layers. As much as we hope to learn something (feature/concept dictionaries) out of this, we don’t know if they are all actually useful Anthropic’s work on attribution graphs especially around faithfullness. in the model’s outcome necessarily. What is the model thinking? How do we tell whether a concept is present, causal, and interacting with others inside a model’s representation?

I am motivated by operationalizing this mode of understanding for safety-sensitive domains and scientific advances Goodfire’s research on applied interpretability to identify new Alzeimer biomarkers. . We don’t need perfect explanations, but we do need reliable signals These signals can simply be learned directions in some geometric space. that can enhance auditability of a ML system. For example, concept vectors can accompany LLM judges as a second stage learned directions that effectively gate decisions under thresholds tuned on data. Much like in Arrival or Contact, we don’t need to fully understand the language to act safely, but we do need a translation we can trust.

I’ve been drawn to representations for a long time. During my Master’s work, I was motivated by the idea of modeling human behavior by learning the right joint latent space for users and items on temporal and geographical dimensions. That instinct has only strengthened as models have grown larger and less transparent. I am excited by the prospects of looking deeper into the pile, from various frame of references, to understand a model’s decision making. What secrets of the universe do these mysterious activations unfold?

This perspective also shapes how I think about safe alignment. When an LLM does something fundamentally “mis-aligned”, we want to be able to measure this and attribute to incorrect reward modeling or poorly specified constraints. For example, in high recall safety systems where flagging rare events is important, we may be able to get LLMs to do this. Without access to the internal representations, we are forced to either trust these opaque outcomes or disregard them, as we are unable to justify them.

More broadly, I remain interested in the dynamics of real-world ML systems that sit at the intersection of modeling, evaluation, and deployment. Success here has stemmed from careful framing, experimentation, and disciplined evaluation. The complete systems lifecycle, encompassing data gathering, model construction, metrics monitoring, feedback loops, and analysis of failure modes, remains a compelling area of focus.

Currently, this is our current translation. It may not be the best one for the questions we are still asking. It’s also unlikely to be the final language. I am excited by these prospects.

The Art of Shipping ML

2026-01-18T12:00:00+00:00

In this blog post, I reflect on the things that stood out to me while shipping multiple ML products over the last year. These reflections represent solely my personal views and experiences, not those of my employer. Examples referenced are either drawn from publicly shared company materials or are purely hypothetical. In 2025, AI transformed product development workflows in remarkable ways. This past year, I kept myself busy improving and scaling our existing ML model pipelines. I demonstrated value in a 0 → 1 ML use-case that is now core part of the eco-system, going beyond the 0 →1 pit into a stable state. I explored replacing existing ML systems with cheaper and better systems, including LLMs and simpler linear probes balancing speed and scalability. Some things, like the time to create V0s or MVPs of an ML product, drastically went down. In some other cases, LLMs completely replaced traditional ML models. However, the core principles of ML lifecycle became even more prominent; The most successful ML systems we deployed were characterized by an evaluation-centric design philosophy.

Broadly, in this post, I categorize these musings along the traditional dimensions of data, evaluation, cost surface, engineering systems, and feedback loops. These together are summarized as the following principles:

ML product velocity is built on “boring” infrastructure, not brute speed.
The data is never going to be 100% “clean” or “gold”. Accept the noise or, better yet, create silver data.
Optimize the cost surface, not just the accuracy of the model.
Do not shirk away from ensembles.
Feedback loops are invaluable; their absence is your product’s nail-in-the-coffin.
Coding agents are indispensable luck potions, but use them well.

My TL;DR summary is:

Nothing has changed, and yet, everything has changed. It’s not in the what, but rather the how.

Shipping ML fast is built on “boring” infrastructure

While speed is a given, the true accelerators, though, are in the pre-existence of a stable architectures across data pipelines and warehouses, evaluation frameworks, training pipelines, and feedback systems. For instance, building non-standard frameworks for evaluation per use-case can cause not only maintenance overheads, but also create significant hazards for brittle evaluation. Creating evaluation datasets (real or synthetic) follows similar standardization challenges; if using real data, they also necessitate investing in data warehousing and analytics even before building a V0.

I view V0s [Version Zero] not as a fully fledged product, but rather as an opportunity for rapid learning and correction. Even existing ML systems can have V0s to replace components. As we develop these systems, we want to be able to quantify the gap between the model’s behaviour and the reality of the production environment, i.e., where are we wrong, and how wrong are we?

We must start with the end state in mind and work backwards to what a V0 must look like. The clarity of what a north star looks like helps set milestones for both short-term and medium-term goals. Even if the full north star vision is never fully realized, a forward-looking design that anticipates future requirements has the advantage that the system is inherently extensible. This avoids costly architectural overhauls in the future.

This isn’t always possible with fast timelines in mind, in which case the outline of a vision at least helps recognize deviations and whether those trade-offs are acceptable buy-ins at the moment. Tradeoffs can be design compromises, a smaller subspace of the product where the ML will be applied to, or a subpar model that can be trained iteratively and intentionally. The foundational corners that cannot be cut remain the data, evaluation, and feedback loops. These are non-negotiables and must either be implemented robustly from the start or rigorously scheduled as immediate follow-up tasks after the initial iteration.

Start from the evaluation, not the model

Most problems have varying costs for different types of errors. Knowing what mode of failures is expensive is invaluable for optimization. If false negatives cannot be tolerated (for example, cancer detection), then we must prioritize it as the critical metric during iteration. The ideal scenario is to achieve the best of both sides: perfect recall and zero false positive noise. A V0 is most likely to be a single model that will face precision-recall tradeoffs. If we can solve for high recall first, we can tune the false positive rates incrementally. Subsequent model pipelines can even employ multi-stage models to target these individual metrics.

Alternatively, noise levels can be minimized with humans in the loop who review the false positives. This reduces the urgency of tackling this type of error An illustrative use-case discussed publicly this year involved adverse event detection in healthcare conversations. Early approaches such as keyword search proved insufficient for rare events, leading to high-recall modeling with human-in-the-loop review to manage false positives. at the expense of reliance on domain experts who possess deep, nuanced knowledge of the problem space. These domain experts not only filter out obvious false positives but also provide valuable feedback on the edge cases that allow us to incrementally support them. These false positive labels also provide harder negatives for future evaluations. Pure synthetic data cannot strive to do this without significant investment. I would love for us to get here but this is an investment that is unlikely to be at scale and generalized as a platform without investment this year. They make your V0 move beyond the vacuum of statistical metrics to a trustworthy, deployable product.

On the other hand, we still need evaluation toolkits and frameworks to iterate on V0. In an ideal world, these pre-exist. These evaluation frameworks connect an evaluation dataset to the model performance on the failure modes and the critical metrics we want to measure. While it isn’t possible to maintain a single framework across every ML use-case, A generalized Evals platform is also possible as a separate investment, and something I am excited to see succeed in 2026. the rise of coding agents allows taking existing robust frameworks and adapting to new use-cases quickly. The existing principles of experiment tracking, per data point telemetry, and analysis remain vital to iterations to a viable V0 and beyond.

This also shifts the evaluation from purely model metrics to a real world failure modes. For a model that detects dialogue breakdown on chats or phone calls, we want to know what specific kind of breakdowns the model handles well and what are those cases it entirely misses. Answering these requires understanding the operational constraints of prototype ML systems as well; if the model is purely text-based, we won’t be able to assess voice-based emotional cues that could signify breakdown. Deploying a V0 in such a constrained subspace requires rigorous analytics to be able to disambiguate such cases and measure the production performance.

The Mirage of Gold Data

The success of all ML products depends inherently on data to train and evaluate the ML systems. We have all crossed the waters of “clean” datasets like MNIST that are near-perfect with minimal label noise. We can strive to create similar conditions in real production settings. But these attempts should not predate initial iterations to a V0. Post V0, we can focus on data cleaning tasks. Until then, we strive towards a “good enough” dataset state over a “gold” standard.

Hard negatives define real systems

Consider the hypothetical task of detecting refusal to transfer in chat conversations; this is a specific type of conversational breakdown. Suppose this issue occurs rarely, in less than 1% of the chat volume. Regardless of employing a standard ML model, a fine-tuned LLM, or an LLM API, the model’s ability to perform accurately in this imbalanced setting is entirely dependent on the quality and representativeness of the evaluation (and training) data.

Given the rarity of the positive class, data gathering must be explicitly targeted. We could leverage humans and their explicit feedback or labels on conversations. A well-established data warehouse that logs every turn of a conversation is also an invaluable asset. This infrastructure enables fast analytics, allowing engineers to filter calls or conversations using specific phrases, keywords, or semantic matching heuristics. While not yielding perfect gold data, a simple SQL query can quickly surface the most straightforward examples, providing an almost “magical” starting point for data collection. We could use targeted annotation workflows where humans, LLMs, or AI-assisted humans meticulously review chat logs to identify potential positive samples. These are gold standard as well, once human preferences are calibrated for the task by measuring inter-annotator agreements.

Along with collecting positive examples, we need negative samples - the conversations where the refusal to transfer issue is absent. We could just use clean chats with no breakdown issues whatsoever. While these make it easy for the model’s metrics to look excellent, they deviate from true production performance.

To build a robust system, the problem space must be made harder intentionally. Hard negatives are not just the simple absence of the target issue (X); they must include samples of related but distinct breakdown issues, such as frustration, a system outage, or general confusion. These are cases that are visually or semantically close to X but are fundamentally different. The contrastiveness that boosts model performance does not simply come from discriminating between X and not X. It arises from the ability to differentiate between X, not X, and not Y. The model must learn to distinguish the specific target (refusal to transfer) from all other forms of conversational complexity or breakdown. This framing is analogous to contrastive learning setups, where the decision boundary is defined not just by positive and negative classes, but by semantically adjacent or confusing alternatives.

To gather these hard negative samples, we could apply the same annotation or analytical setups used for positive examples. We could use sparse explicit feedback about specific issues like frustration to assume it did not contain the issue we want to model. In that mode of operation, if these conversations have both issues, we assume that the human labeler or end-user felt frustration was more important than refusal. So this collection process can feel somewhat “janky,” though necessary.

In addition, we could gather these examples through creative heuristics and partial human review. We could even ask LLMs to modify clean calls (i.e no issues) to inject issues. Despite not being pristine gold labels, they are often exactly what is needed to dramatically boost a model’s performance and make the system truly “work like magic.”

The case of subtle noise

With these setups, we can chase the mirage of gold labels. In practice, it is either unnecessary to have a ton of gold labels, or impossible to obtain a high volume of pristine labels in a low-calorie manner. However, note that the presence of significant noise in the dataset will naturally prohibit any meaningful learning. These issues are easily observed though. The harder and more insidious problem is subtle noise.

For example, in annotation tasks that require human judgment and interpretation, a degree of inherent noise is inevitable. Even with rigorous guidelines, differences persist between annotators. We could measure the inter-annotater agreement (IAA) and recalibrate until we are satisfied; recalibration would involve changing the rubrics, updating standard guidelines, and extensive training. The reality, though, is that the data will still have discrepancies that are borderline human interpretation differences. Consider the preceding example where a single turn in a chat exhibits both ‘frustration’ and ‘refusal to transfer.’ In such a scenario, two different labelers are likely to categorize that instance very differently unless this scenario is codified explicitly in the annotation guidelines.

Now we have a choice - do we keep this data? This may depend on several factors. A thorough qualitative analysis of the label distribution and the nature of disagreements can tell us a lot. This can be manual annotation analysis. This ties to understanding where and why the current model fails, particularly if it struggles with specific, ambiguous examples. We must also consider this along with the other dimension of whether the model introduces new, unexpected failures that humans don’t make.

Tainted data can be gold

Now, consider a model trained to extract data from chat conversations. We would like to mimic human reviewers who either approve or reject these extractions. In this context, we are modeling a human’s judgment, which is inherently noisy. We hope for the law of averages to normalize and even out these individual human errors. This assumes annotation noise is approximately unbiased and independent. Systematic bias or correlated errors violate this assumption and require explicit intervention (see above footnote). But if there are guideline gaps or annotation calibration gaps, these errors are no longer errors. They become features that the models need to learn to mimic, even if they are not explicitly codified. The data is “tainted” if one were to look at the objective truth of what the extracted value should be. But we aim to do at least as well as humans, in which case, perhaps counter-intuitively, this tainted data is the gold truth.

Given that these errors can compound, we could build a robust feedback mechanism for detection, diagnosis, and correction. If we are able to dissect these error modes, we can run ablation tests, temporarily removing this data from the training set. Annotators can be retrained simultaneously so that fresh data is aligned with updated guidelines and can be integrated into the dataset. We could make the model flag high-uncertainty examples or detect these low-confidence regimes, in which case these examples receive special attention in a review queue. We could build LLM analyzers that scan and rate human annotations, too. Lastly, if it’s not a critical issue, we could simply live with it.

Across projects, I’ve never seen truly “perfect” gold data. What matters is not purity, but knowing the imperfections around where the data lies and where it breaks.

Optimize the Cost Surface, Not Solely Accuracy

The optimization metrics very rarely remain the model metrics like precision and recall. This is especially true with the proliferation of LLM APIs, where the evals should be further grounded on a wider cost surface encompassing model vendors, thinking budget, modalities (text, audio, image), as well as the usage patterns, latency requirements, error modes, and total cost.

Spinning up experiments across these dimensions, even on a smaller scale, is useful to understand failure modes. You get a ton of data on what dimensions work great on specific error modes at this point in time. For example, thinking budget may help solve an edge case that requires alignment of the conversation sequence, while it may perform worse on a different kind of reasoning task. It is possible that future LLM versions that are more capable solve this by default, so knowing the coverage of error modes at any given point in time is highly useful.

The per-unit cost budget (for e.g., per call or per chat) often dictates what subset of these vendor models is available for a usecase. This cost analysis remains a prerequisite to understanding the compatibility of the ML system for scale. However, the prices of LLM APIs have decreased significantly and very quickly, albeit at the expense of the reliability of the service. The pay-as-you-go model is very cheap, but not as reliable on latencies or service uptime as provisioned throughput.

Lastly, the cost surface must also take into account how the ML product fits into the existing ecosystem, optimizing a further set of tradeoffs. These can include focusing on the cost of a specific type of error (false negatives vs false positives, tuned thresholds for balancing these errors). More generally, it includes the human costs associated with the ML predictions; High false positive rates that involve significant human review time for the above refusal-to-transfer issue can mean suboptimal use of human experts. Human costs include review time, cognitive load, escalation overhead, and trust calibration. There is also the somewhat qualitative assessment of iteration time across the ML system evolution timeline, and the factors influenced by the ecosystem’s evolution itself. As systems mature from V0 → V100, scale often changes. Even stable systems need to anticipate increase in volume. ML Infrastructure scaling needs would need to be considered.

A model that achieves 99% accuracy but has a prohibitive per-call cost and long latency can be projected onto a far worse point on the cost surface than a model with 95% accuracy that is fast, cheap, and requires comparable or smaller engineering effort and minimal human intervention. Connecting this back to V0s, the concept of “shipping fast” requires rapidly finding a local optimum on the cost surface that is better than the current state, acknowledging that the initial model is merely a starting point for continuous, cost-aware iteration.

In essence, the iterations that matter are the ones that move the system to a better point on the cost surface, not the ones that optimize a single metric in isolation. They are effectively aligned with progress towards the north star.

ML Lifecycle iteration

Iteration speed in ML comes from de-risking change, not aiming to replace systems instantaneously. This is essentially a software engineering principle.

Iteration without breaking production

We want to answer the question “Can this model or ML system safely influence the world yet?” Shadow mode, A/B testing, and canaries are are levers that allow us to answer this question under different deployment constraints (regardless of replacement models or V0s). Shadow mode We can deploy replacement models (ML System ‘B’) or even the V0 in a shadow mode deployment. A portion, or ideally all, of the live production traffic is simultaneously routed to both the existing and replacement systems. The outputs from both systems are logged and stored for analysis, even if the outputs from the shadow model don’t influence the primary workflows. is ideal when outputs can be observed without consequence; A/B testing Shadow mode isn’t applicable in systems that involve a feedback loop, such as multi-turn conversational agents, which often require the model’s output to genuinely influence the subsequent user actions to gather meaningful data. In such cases, A/B testing can determine the impact of the new system by routing a portion of the traffic to these systems that actively influence the environment. is necessary when outcomes depend on interaction like multi-turn conversations. Canaries With canary deployments, we could divert a small portion of production traffic to these newer versions and monitor system and business metrics. This effectively is a phased rollout where we are able to quickly detect model drifts, data skews, or regressions in the ML system (assuming a representative sample in this subset). With automated health metrics, we can rollback to a known healthy deploy if critical issues exist. allow us to incrementally expose real users while retaining rollback guarantees. Note that we could build e2e simulation systems that can accelerate these measurement workflows even before the system hit real-time production environments.

All of these levers are backed by rigorous and potentially excessive logging for future analysis. First, it allows fast verification of model parity. Second, the traceability provides affordance for exploration of data drifts and failure modes as we shall see in feedback loops section. The underlying notion remains that monitoring business metrics in production environments is still vital, and we need to repeat this process with every iteration of the ML system.

The Ensemble Approach

So far, this has been an all-or-nothing approach because we need to decide: system A or system B? We don’t need to decide right away; Ensemble approaches give a nice middle ground. One system effectively outperforms the other within a specific, well-defined subset of the problem space, an “optimization surface,” where system B outperforms system A.

Our goal here is to replace system A entirely (aka become legacy). If this System B can cover a majority of the volume (e.g., 70%), iteration speed should not prevent us from deploying this subset model at this point. It is still a win, at the expense of maintaining two systems. We must prioritize getting system B into parity, while also iterating on mistakes made by system B. It’s the best of both worlds, until we move entirely to the world of System B.

The core principle lies in identifying a subspace where System B provides a clear advantage, whether that is superior performance, higher accuracy, or lower operational cost, compared to System A. With V0s, this is often clear because of the design choices we have had to make head-on. Once this subspace is identified, the system must be able to employ cheap heuristics (e.g. simple, fast, and reliable eligibility checks) to determine if an incoming request falls within the realm of System B. Within this eligible subspace, traffic is routed to System B. Outside of it, System A maintains its role.

An advantageous surface here is being able to rely on confidence thresholds for system B. If the model’s output falls below this threshold, the system can automatically fall back to the existing System A. System A becomes a definitive safety net in maintaining the overall quality and reliability of the service, while allowing gradual performance gains in System B.

The primary cost of this strategy is the need to maintain two production systems (A and B) simultaneously for a period of time. This dual maintenance burden can be a necessary trade-off for risk mitigation and faster iteration. In fact, V0s rarely replace existing systems entirely; they earn their way into production by expanding the subspace they can safely handle. Established feedback cycles and evaluation infrastructure can in turn accelerate the expansion of system B’s subspace to that of system A.

Feedback loops are vital for V0 → V100

A production system without a feedback loop is effectively running blind, destined for irrelevance. System health metrics provide operational feedback around latency, throughput, and error-rates. Explicit or implicit user feedback can map to business metrics through clicks, interactions, and acceptance of model predictions. We want to be able to continuously validate that the ML product is delivering value and to use that operational and user data to rapidly drive the next iteration cycle.

Telemetry of the request as it flows through the ML system provides not only observability but also real-time analytical visibility. Be it usage tokens of the LLM or the confidence scores of a model, these provide direct feedback around system health. Logging even those reasoning summary tokens is useful to further categorize and analyze distributions of these reasoning texts, which in turn allow us to detect anomalies. For example, it’s possible that there is an error in sending chat logs to the LLM API, and the majority of LLM responses bring up the lack of the chats as an issue. Observability here allows categorizing on the reasoning texts to identify such patterns.

While automated metrics are vital, human review remains the gold standard for high-quality ground truth and identifying novel failure modes. Previously, human review was mostly post-hoc. Models deployed in production will encounter edge cases and concept drift. Humans explicitly provide feedback directly as they encounter these or through annotation tasks. With LLM-driven simulations or modeling, we upfront these human evaluations during iterations. These can be accompanied by LLM analyzers as well, but a human in the loop to review the failure modes remains essential.

Therefore, the successful transition from V0 to V100 is not a single leap but a continuous cycle powered by these deployment, logging, monitoring, feedback, and analysis best practices, aligned with the north star vision. Each time, data dictates how to improve the system next.

AI-assisted ML Engineering in 2026

In fast-paced timelines, a comprehensive “platform-first” approach isn’t always feasible and is mostly suboptimal for the earlier iterations of the ML system. However, a conscious effort to reduce the technical debt inherited with clean, well-structured, optimized code can reap long-term benefits in unintentional ways.

For example, in a scenario where the new feature’s productionization piece shares 70% commonality with established patterns, the path forward becomes significantly less burdensome. Even if the immediate goal requires a small amount of duplicate code for a V0 or initial release, the critical advantage is that the team is not starting from scratch. This is because we know the existing system has been battle-tested.

The momentum generated by having a clean, familiar foundation is everything. It drastically reduces cognitive load, accelerates development, minimizes the surface area for new bugs, and ensures that the team can iterate rapidly. The later efforts to refactor, generalize, or integrate the feature into a broader platform become much more straightforward and less costly. These active prioritizations directly influence the agility of the team.

On the aspect of agility, the current generation of LLM powering tools like Codex or Cursor, are powerful force multipliers. The capacity of these assistants to rapidly generate initial versions of code, content, and even entire systems has dramatically reduced the cost and time required to build a V0. The culture has quickly moved to “experiment-first” without the lengthy pre-launch development cycles.

However, the true value of these assistants is unlocked by teams with established discipline and expertise. If the team already possesses a robust framework for measuring performance, tracking metrics, and conducting rigorous A/B testing, LLMs can dramatically accelerate their development cycles. This allows ML Engineers to focus on the planning: higher-level architectural and strategic problems around data gathering, model construction and evaluation. Without existing infrastructures or clarity around these processes, these agents can obscure fundamental errors. Knowing what to measure and how to interpret the results remains the paramount skill. The AI assistants act as a highly productive junior partner, but the engineers must still serve as the senior architect and as a quality control layer.

In the past year, the aspect of synthetic data has come up constantly as a way to unlock new ML use-cases. Historically, creating annotations for entirely new tasks has been cost-prohibitive and time consuming. LLMs are transforming this remarkably. With zero or few-shot examples, we can use LLM annotators with great accuracy, bootstrapping human labelers. We are now able to generate realistic synthetic training examples with mutli-dimensional constraints (persona, tone, emotion, speech errors, noise etc.) really well, expanding the diversity of these silver datasets. This also unlocks simulation frameworks without needing to curate a clean gold dataset. These advancements also unlock adversarial simulation that offer unprecedented testing of the system even before we have deployed it.

Given these new quests in the ML lifecycle, evaluation frameworks aren’t just about the core ML system. Even your LLM judge needs human calibration to some extent (even if it’s just the engineer creating it). Everything is now about evaluation discipline. Evaluation must continue to rigorously assess complex attributes like coherence, safety, and faithfulness, moving beyond model metrics.

None of the principles we talked about above have changed; the problem space has changed. Even prompt optimization with techniques like DSPy is converging to familiar patterns; we effectively get a hybrid data-driven modeling approach to tuning prompts. It’s still not a free lunch, and that’s the point.

Conclusion

There is a low cost and speed barrier to V0s with LLMs, coding assistants, and the ease of creating synthetic data. However, my experience so far has been that these advances amplify the underlying hidden technical debt in ML systems.

Fast iteration does not come from better models alone. It comes from knowing what to measure, having data that reflects real-world ambiguity, understanding the cost surface you are optimizing over, and building feedback loops that teach the system what matters in production. Importantly, these are not new ideas, but they are easier to ignore when progress feels rapid and instantaneous with AI assistants.

The teams that ship ML reliably design systems that can be observed and corrected without breaking production. They treat V0s as instruments for learning, not endpoints. They accept imperfect data early, invest in evaluation discipline, and let systems evolve their way into production.

AI-assisted development changes how quickly we encounter these decisions, not whether we have to make them. In that sense, the art of shipping ML remains the same: reducing uncertainty faster than the problem shifts, without losing visibility into why a system behaves the way it does.

I asked ChatGPT A meta note is how useful AI assistants have been in reviewing early drafts of my blog posts. This feels like a writer-editor collaboration where the writer still has the agency of these edits. In full transparency, this post was written entirely by hand, fed to ChatGPT as a first draft for section by section critique, and edited by me based on suggestions made (mostly good, and some utterly non-sensical). Gemini was used to generate images from the finalized contents of the sections along with a draft drawing outlined on Excalidraw. My speed has increased, but these are still my words. to summarize this post in one line, and it was aptly the following:

Speed is a consequence. Measurement is the work.

Perplexity: The Poetry of Uncertainty

2025-10-25T12:00:00+00:00

Perplexity

I’ve been revisiting the concept of perplexity, especially in the context of large language models (LLMs) and supervised fine-tuning (SFT). This post summarizes my notes and reflections on the topic, and I plan to update it as I continue to explore recent research and interpretations of perplexity.

This blog post is organized as follows:

Math Breakdown
Perplexity Across Training Phases
Deeper insights into perplexity
Summary
References

In the first section, I discuss the math behind perplexity. If you want to skip the math or the geometrical intuition, here’s a quick recap. Perplexity is a measure of uncertainty i.e. how surprised (“perplexed”) we are by a given outcome. Reading a sentence one word at a time (like right now), we make intuitive choices about the next word based on the context of the previous words. If we have too many equally likely choices, the uncertainty increases, and hence the surprise if we are wrong. For a language model, perplexity indicates the amount of uncertainty in predicting the next word, i.e. how well does it model the language in the corpus.

Math Breakdown

To understand what perplexity actually measures, let’s unpack its derivation from entropy and cross-entropy.

1. Entropy & Uniform Distribution

For a discrete probability distribution \(p(x)\) where \(x\) can be a word / sub-word / character (or simply, a token), the entropy of this distribution is defined as:

\[H(p(x)) = -\sum_{x} p(x) \log_b p(x)\]

If at each time step, we assume \(N\) equally likely choices (i.e., a uniform distribution), then

\[H(p(x)) = -\sum_{x} \frac{1}{N} \log_b \frac{1}{N} = \log_b N\]

or, equivalently,

\[N = b^{H(p(x))}\]

This \(N\) is called perplexity, which we can rewrite as:

\[N = \prod_i p(x_i)^{-p(x_i)}\]

where \(x_i\) is a sample from the distribution.

The distribution, in reality, is non-uniform for language modeling. Some tokens are much more likely than others, and the next token’s \(p(x)\) depends on the context; i.e.,

\[p(x_t|x_{1:t-1})\]

where \(x_{1:t-1}\) are the previous tokens in the context window. In non-uniform cases, the entropy is less than the uniform entropy:

\[H(p(x)) < \log_b N = \log_b N_{\alpha}\]

Perplexity \(N_{\alpha}\) for non-uniform distributions can be seen as the effective number of equally likely choices, i.e., the perplexity is “as if” uniformly picking among \(N_{\alpha}\) options. This value is smaller than the uniform perplexity since highest perplexity is equivalent to knowing nothing about the next token, and predicting one of the vocabulary tokens with equal probability.

2. Approximating the True Distribution

Since the true \(p(x)\) is unknown, language modeling aims to find an approximation \(q_{\theta}(x)\), where \(\theta\) are the model parameters. From information theory, the goal is to minimize the KL divergence between \(q(x)\) and the true \(p(x)\); in effect, training tries to make the model’s predicted distribution as close as possible to the data’s true token distribution.

Cross-Entropy

The cross-entropy between \(p(x)\) and \(q(x)\) is:

\[H(p(x), q(x)) = -\sum_{x} p(x) \log_b q(x)\]

KL Divergence

The KL divergence is:

\[D_{KL}(p(x) || q(x)) = H(p(x), q(x)) - H(p(x))\]

Cross-Entropy Perplexity

The cross-entropy-based perplexity (PPL) follows as:

\[PPL(x) = N = b^{H(p(x), q(x))} = b^{-\sum_{x} p(x) \log_b q(x)}\]

This leads to:

\[PPL(x) = b^{H(p(x))} \times b^{D_{KL}(p(x) || q(x))}\]

and thus,

\[\text{Model perplexity} = \text{True perplexity} \times \text{KL Divergence Penalty}\]

The true perplexity is the theoretical minimum perplexity that can be achieved by the model, and the KL Divergence term serves as a penalty factor for the model’s imperfect approximation of the true distribution. Minimizing KL divergence is equivalent to minimizing cross-entropy and by extension, model perplexity .

\[\arg \min_{q(x)} D_{KL}(p(x) || q(x)) = \arg \min_{q(x)} H(p(x), q(x))\]

3. Empirical Computation

Lastly, these equations still operate under the assumption that the true distribution \(p(x)\) is known. In practice, we only have observed samples from the true distribution (aka, the training corpus). So we estimate the cross-entropy empirically. Assuming samples \(x_1, x_2, ..., x_i\) are drawn from the true distribution, the Monte-carlo approximation of the cross-entropy is:

\[H(p, q) \approx -\frac{1}{N} \sum_i \log_b(q(x_i))\]

where \(x_i\) are observed samples and \(H(p,q)\) is the sample average (notice this is just the average negative log-likelihood). This is derived using Asymptotic Equipartition Property.

Consequently, perplexity can be estimated as:

\[PPL(x) \approx \prod_i \left(\frac{1}{q(x_i)}\right)^{1/N}\]

This is a geometric mean of the inverse model probabilities, i.e., perplexity is the weighted average factor by which the model is “surprised” on predicting the next token. This effectively is the weighted average branching factor at every time step - the number of possible next words that can follow a word (Speech and Language Processing by Jurafsky and Martin).

Perplexity Across Training Phases: From Learning to Alignment

In this section, we explore how perplexity can be used to interpret learning and alignment across model and data dimensions.

Pre-training

During pre-training, the primary objective is to minimize the model’s negative log-likelihood; perplexity directly measures progress on this goal. Because perplexity reflects how well the model captures the statistical structure of language, it serves as a strong intrinsic metric for evaluating language understanding during this phase. Unlike measuring the raw probability assigned to an evaluation set, which diminishes with longer sequences, perplexity provides a per-token view, making results more interpretable. However, perplexity does not indicate downstream task performance, such as factual accuracy or reasoning.

If we have a model A with two pre-training checkpoints, \(A_1\) and \(A_2\), where \(A_2\) is further trained than \(A_1\), a lower \(PPL(A_1)\) compared to \(PPL(A_2)\) (on the same evaluation set) would suggest that \(A_2\) has degraded in performance. Thus, perplexity is valid for comparing pre-training checkpoints or model architectures, as long as tokenization remains consistent.

Musings on Post-training

Now, consider post-training such as supervised fine-tuning (SFT) for a domain-specific task. What does the perplexity of the SFT training data, measured under the base model, reveal? For instance, suppose we have a base model Llama-2-7b-hf and we want to instruction fine tune it to a healthcare question answering task. An example instruction pair could be:

Human: Explain what an EOB (Explanation of Benefits) is.
Assistant: An EOB is a statement sent by a health insurance company that explains what medical treatments or services were paid for, what was not covered, and why

Calculating perplexity on SFT data helps assess how familiar a model is with the new domain or instruction format and how well it can predict such data. Comparing perplexity values under different models and data alignments quantifies domain shift and the effectiveness of post-training or fine-tuning. In the following code snippet, we calculate perplexity values under these conditions.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, math

def get_model_tokenizer(model_name):
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    model = AutoModelForCausalLM.from_pretrained(model_name, dtype=torch.float16, device_map="auto")
    model.eval()
    return tokenizer, model

def ppl_fullsequence(model_name, texts):
    """ Calculate PPL for raw text sequences"""
    tokenizer, model = get_model_tokenizer(model_name)
    losses = []
    for text in texts:
        inputs = tokenizer(text, return_tensors="pt").to(model.device)
        with torch.inference_mode():
            loss = model(**inputs, labels=inputs["input_ids"]).loss.item()
        losses.append(loss)
    return math.exp(sum(losses)/len(losses))

def ppl_assistant_only_chat(model_name, pairs):
    """Calculate PPL for chat template pairs by only looking at assistant tokens (aligns with Llama-2-chat-hf's instruction template)"""
    tokenizer, model = get_model_tokenizer(model_name)

    losses = []
    for example in pairs:
        messages = [
            {"role": "system", "content": "You are a helpful, respectful and honest assistant."}, # system instruction
            {"role": "user", "content": example["user"]}]
        # 1) Get token IDs for the prompt exactly as the model expects
        prompt_ids = tokenizer.apply_chat_template(
            messages,
            tokenize=True,
            add_generation_prompt=True,  # indicate start of assistant response
            return_tensors="pt"
        ).to(model.device)               # shape: (1, prompt_len)

        # 2) Tokenize assistant without adding BOS/EOS again
        assistant_ids = tokenizer(
            " " + example["assistant"],   # add space to separate prompt from assistant
            add_special_tokens=False,
            return_tensors="pt"
        ).input_ids.to(model.device)     # shape: (1, ans_len)

        # 3) Build full input ids and labels, and mask prompt tokens
        full_ids = torch.cat([prompt_ids, assistant_ids], dim=1)  # (1, prompt_len+ans_len)
        labels = full_ids.clone()
        prompt_len = prompt_ids.shape[1]
        labels[:, :prompt_len] = -100  # ignore prompt in loss

        with torch.inference_mode():
            loss = model(input_ids=full_ids, labels=labels).loss.item()
        losses.append(loss)

    return math.exp(sum(losses)/len(losses))

Let’s calculate the perplexity values with the following data which is similar to Llama-2-chat’s training data.

raw_texts = [
    "Human: What type of species is a orangutan?\nAssistant: An orangutan is a species of ape.",
    "Human: What is the capital of France?\nAssistant: The capital of France is Paris."
]

# Equivalent to:
pairs = [
    {
        "user": "What type of species is a orangutan?",
        "assistant": "An orangutan is a species of ape."
    },
    {
        "user": "What is the capital of France?",
        "assistant": "The capital of France is Paris."
    }
]

base = "meta-llama/Llama-2-7b-hf"
chat = "meta-llama/Llama-2-7b-chat-hf"

print("Base PPL (raw text):", ppl_fullsequence(base, raw_texts))
print("SFT model PPL (naive, raw text):", ppl_fullsequence(chat, raw_texts))
print("SFT model PPL (assistant-only, chat template):", ppl_assistant_only_chat(chat, pairs))

This results in the following output for the SFT data:

Output:
Base PPL (raw text): 10.266649146981344
SFT model PPL (naive, raw text): 6.7316364438131675
SFT model PPL (assistant-only, chat template): 3.8855575447740236

We calculate perplexity values across different dimensions, which imply different things:

Model	Text format	Perplexity meaning
`base` (`Llama-2-7B-hf`)	raw text	Measures how well the base LM predicts this data’s tokens
`chat` (`Llama-2-7B-chat-hf`)	raw text	Measures how well the SFT model predicts this data’s tokens without its expected conversational framing
`chat` (`Llama-2-7B-chat-hf`)	structured dialogue	Measures how well the SFT model performs the instruction-based completion given user input in its expected conversational template

Base model perplexity on SFT data

The base model’s perplexity on its evaluation set is not directly comparable to its perplexity on the SFT data. We hope that they are similar, but by nature of instructing tuning task, the data distribution is slightly different.

The PPL \(\approx 10.27\) for the base model (raw text of SFT data) represents its general linguistic fluency as it is trained auto-regressively on all tokens. This purely tells us how much “in-distribution” the SFT data is to the base model’s data domain (i.e. domain familiarity).
This low PPL even before SFT means the base model already finds the data similar in style, structure, and semantics. If the SFT domain differs substantially (e.g., moving from English classics to biology research), we should expect a higher PPL (we will see this next).
- For instruction finetuning, a low PPL indicates greater familiarity with the instruction language (even if not fully aligned yet). Understanding familiarity of the language is a useful diagnostic before alignment. This opens up questions about continued pre-training on new domain data, but we are jumping ahead here.

SFT model perplexity on SFT data

Instruction tuning is not optimized for pure language modeling, rather for specialized behavioural alignment. ([INST] Explain what an EOB ... [/INST] [ASSISTANT] An EOB is ...). The SFT model is trained only to predict after the [ASSISTANT] token; The user instruction is part of the context, but not part of the training loss. The SFT training shifts the language model heavily towards the templates of behaviour (instruction) we desire. This is no longer a pure language model.

The PPL \(\approx 6.73\) (SFT model, raw text) is lower than the base model’s \(10.27\), showing the SFT training has improved overall prediction even without the template.

The PPL \(\approx 3.89\) (SFT model, structured dialogue) is the lowest, demonstrating the SFT model’s strong alignment with its specific conversational template (e.g., [INST], [/INST]). This low PPL confirms the model has heavily learned to expect and predict the post-instruction sequence (the assistant’s response) when provided with the correct context and template. Perplexity on raw instruction data is higher due to these missing mode of “understanding” that comes from the template.

Out of domain data

Suppose our dataset comes from an entirely different domain, such as healthcare,

raw_texts = [ "Human: Explain what an EOB (Explanation of Benefits) is.\nAssistant: An EOB is a statement sent by a health insurance company that explains what medical treatments or services were paid for, what was not covered, and why", "Human: How do you verify prior authorization status?\nAssistant: Contact the payer or check their portal using the member ID and service codes; confirm approval dates, remaining units, and any documentation required" ]

Output: Base PPL (raw text): 13.936028931681786 SFT PPL (naive, raw text): 18.254311566365946 SFT PPL (assistant-only, chat template): 17.49617187783925

Notes:

We observe that aligning with the chat template of the SFT model helps reduce perplexity, but it is not any better than the base pure language model. The healthcare domain data differs in phrasing and semantics, so the SFT model is not able to generalize well to this new domain. The PPL \(\approx 17.49\) is a measure of where perplexity is for this new domain.

If we however, performed SFT on the base model with the healthcare data, the PPL \(\approx 13.94\) is baseline indicator of how well the pretrained language model already models the domain’s token distribution before instruction-tuning.

A higher PPL on this data using a previously finetuned SFT model (Llama-2-7B-chat-hf) \(\approx 17.49\) indicates a distribution mismatch with respect to healthcare jargon as-is. (Note: this is a naive setting; we could add in-context examples to potentially improve this).

This exercise is a potentially useful diagnostic to understand the domain shift between the pre-training and post-training data, and the need for domain adaptation through continued pre-training. If the model doesn’t speak the same “language”, we cannot expect SFT / RLHF to perform well, as the next-token prediction is “off-track”. Instructing tuning can force the model to mimic the language but it is not internalized the way a pure language model would.

So, in summary,

the pre-training PPL measures model’s familiarity of the language.

the post-training PPL measures model’s familiarity of the instruction language, i.e alignment familiarity.

Deeper insights into perplexity

As mentioned above, perplexity on its own does not fully capture model quality. It should be considered alongside task-specific evaluations and specialized benchmarks, such as those for instruction following and other targeted abilities. Not to forget, human evaluation is still golden. Still, perplexity is a useful diagnostic; it provides a baseline for how well the model predicts next tokens on task data and can reveal much about model fit and data alignment. This section collects key insights from the literature on the interpretation and limitations of perplexity.

Scaling Laws for Neural Language Models (2020) demonstrates that perplexity is a good proxy for model quality, and that cross entropy loss scales according to power laws with model size, data size and compute dimensions. This allows us to estimate how much perplexity will reduce with more training.

Training Compute-Optimal Large Language Models (2023) proposes Chinchilla correction that estimates that model size and data size should scale equally to achieve optimal performance. This prevents under-trained models on number of tokens relative to the model size.

Training Trajectories of Language Models Across Scales (2023) indicates at a given perplexity, models with different sizes can behave similarly on downstream / (in-context) evaluation tasks. They measure validation next-token perplexity and observe a similar subset of training tokens see the most significant reduction in loss across these model variants.

Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs? (2025) studies the choice of pre-training checkpoints to maximize downstream finetuning performance. If the perplexity of \(A_1\) checkpoint is lower than \(A_2\) checkpoint, does it tell us that \(A_1\) will perform better than \(A_2\) on the SFT task? The authors find that conventional perplexity has little correlation towards how well a model will do after supervised or instruction fine-tuning / reasoning tasks. Task specific metrics are more reliable, but potentially so can unsupervised proxy metrics (that they propose).

Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality (2025) investigated SFT data properties (e.g., perplexity of SFT data given the base model, as in our toy example above). In this case, they use the same SFT dataset on various base models and observe that the SFT data with a lower perplexity, that is, the SFT data patterns the base model found to be easier to predict/more familiar, consistently led to greater improvements in downstream task performance. This aligns with our previous note about match of the SFT data to the base model’s data domain, and the need for domain adaptation through continued pre-training. Inversely, given a base model X, we can reliably use perplexity to compare and rank different SFT datasets. This allows us to efficiently evaluate and select SFT datasets for a given base model.

Paloma - A Benchmark for Evaluating Language Model Fit (2024) specifically evaluates perplexity of the many distinct domains vs. measuring perplexity on all text as one unit in pre-training phase. Given that pre-training data is typically a mix of many domains, this is a useful benchmark to understand the model’s fit to specific data domains. The paper shows assuming a good perplexity score on one distribution extrapolates to all others is a flawed assumption.

Summary

It’s fascinating to see how a concept as old as perplexity continues to shed light on modern fine-tuning and alignment of LLMs. It’s not a silver bullet, but a useful diagnostic, one that bridges information theory with today’s model behavior. As models get larger and training objectives more complex, revisiting these fundamentals feels less like nostalgia and more like grounding! The recent literature in this area is intriguing, and I hope to continue exploring this topic in the future.

References

Wikipedia: Perplexity, Entropy, Cross-Entropy, KL Divergence, Asymptotic Equipartition Property

Speech and Language Processing by Jurafsky and Martin

Scaling Laws for Neural Language Models (2020)

Training Compute-Optimal Large Language Models (2023)

Training Trajectories of Language Models Across Scales

Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs?

Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality

Paloma - A Benchmark for Evaluating Language Model Fit

The Red Queen Runs, The Red King Waits

2025-10-06T12:00:00+00:00

Alternate title : Evolutionary Lessons for AI and Humanity

This morning has been a meandering through the world of chess on Duolingo, alternating between lessons and open play against the game’s tutor, as I make small yet satisfying strides in my ELO rating climb. The addiction is real: the queen commands vast territories, wreaking havoc with blistering speed, her presence a burst of tactical chaos that forces the board to bend in response. The king, meanwhile, moves with a measured grace - slow and deliberate, each step a quiet pursuit of positional balance. Of the twenty-plus games I have played this morning, the king has moved only when needed to, while my pawns promote to queens and unleash more chaos agents across the board. At the end of the day, the queen(s) make moves to ensure the king(dom) does not fall.

Moves and Counter-moves

This brings to mind the adaptation of Yellowstone National Park’s food chain following the reintroduction of wolves. Much like chess, the absence of wolves created imbalance: increased elk populations led to overgrazing, devastated vegetation, and unstable riverbanks, further impacting beavers. With their return - like pawns promoting to queens - the ecosystem underwent local chaos towards long-term good: elk avoided river valleys, allowing vegetation and riverbanks to recover and the beaver population to rise.

Nick Bostrom’s Super Intelligence reads differently now, looking somewhat at the past, not all of it hyperbolic and obtuse science fiction, with the progress in the field of AI and genomics. I have indeed made inroads with the book this time around than a decade ago when I last tried to read it. So far, evolution reads as a series of dynamic equilibria at its core; every adaptation provoking a counter-adaptation, every move reshaping the environment. The type of moves directly characterizes each evolutionary step as one of the queen or the king. This tense oscillation, I think, holds the key to how we as humanity will co-evolve with AI, towards a superintelligent future.

The Red Queen Hypothesis: Run to Stay Still

“Now, here, you see, it takes all the running you can do, to stay in the same place.” - Lewis Caroll, Through the Looking Glass.

The Red Queen’s Race postulates that all species must constantly adapt and evolve to survive the attack from other species. Humans fought off predators like lions and tigers by learning first to create vigilant attack formations, and then by creating impenetrable shelters. Likewise, rabbits evolved to have speed to avoid predators like foxes as a defensive adaptation, while the foxes also evolved to have speed for a similar adversarial advantage.

The Red King Hypothesis: Slow and Steady Wins the Alliance

The Red Queen’s race is simply put - adapt or perish, even if to simply maintain the stability. The stability itself hinges upon both sides continuously adapting; if either side stopped, they would perish. The Red King hypothesis, on the other hand, suggests a slow, but progressive evolution through mutual co-operation. Can the fox and the rabbits agree on a symbiotic relationship similar to coral-algae symbiosis? Perhaps a measured attack on rabbits allows rabbit populations to grow while maintaining requirements for sustenance, creating a harmonious balance in the ecosystem. These systems, as evolution indicates, go through periods of both competition and cooperation, endlessly, dancing through time.

Humanity’s Turn: Evolution Within Our Own Species

I wonder at these hypotheses’ relevance for intra-species evolution through the lens of Darwinian evolutionary theory of ‘survival of the fittest’. Loosely put, let’s consider all evolution to have some form of fitness function that we can measure for the capabilities we consider to be above general intelligence, spanning the ability to learn, recursive self-improvement, deal with uncertainty, and create flexible internal representations.

Sperm selection by definition is an intra-species arms race - the fastest sperm wins. Enhanced by advanced methods like magnetic-activated cell sorting (MACS) and genetic trait selection, the fastest sperm is now also the most genetically preferred. As new generations of these “advanced” humans come up, we now look at further genetic selection, aided by more technological advancements. In essence, ever since the beginning of time, we as humans have co-evolved competitively as the same species for the best version of ourselves to be put forth.

Now, imagine flipping the Prisoner’s Dilemma: cooperation, not betrayal, yields the higher long-term payoff. In this iterated version, suppose one prisoner adapts slowly – rarely updating their cooperative strategy – while the other reacts rapidly to each round. Over repeated interactions, the fast learner would continually adjust to sustain cooperation, effectively conceding more, while the slow learner anchors a favorable equilibrium. This mirrors the Red King effect. The slower player effectively anchors the cooperative equilibrium and captures more of the long-term benefit.

However, the analogy holds only when adaptation rates differ within a feedback loop (i.e., one step). When both agents evolve at a similar speed – or the environment itself stands still – the asymmetry collapses, reverting to the familiar Prisoner’s Dilemma. Cooperation erodes; betrayal once again becomes the rational choice. The system slips into a Red Queen’s Race, where both must run faster simply to survive. The slower player is forced to abandon their anchored advantage, i.e., adapt, or go extinct. Betrayal gives way to chaos.

Intriguingly, all the “rational agents” here – the prisoners and the jailers – belong to the same species. But what happens if one of them isn’t? What if the faster runner in this co-evolutionary race is a superintelligent AI? What if we remove the notion of rationality to introduce chaos monkeys?

With some full-blown meandering, when I zoom out to the global scale, the world itself seems caught in a Red Queen’s race. Nations sprint to preserve their GDP; companies accelerate to defend their market share. The answer to the survival puzzle simply seems to be motion. In quieter moments, I notice the same pattern in myself - keeping myself abreast of breakthroughs in LLMs, frameworks, and agentic AI, just to stay relevant. Running, learning, updating, never quite arriving. And, I wonder, in this endless chase for adaptation, am I too a Red Queen – running only to remain in place?

The Superintelligence Dance

Given that all systems must evolve through some form of feedback, either through a direct measurement of the fitness function or through indirect signals from the environment, perhaps the path to superintelligence is also defined by the tempo of Red Queens and Red Kings.

AI-2027 ends with a choose-your-own-ending: a deliberate slowdown toward measured, stable progress or a full-blown race toward superintelligence. I find myself somewhere in between, believing we need both Red Queens and Red Kings; the Red Queens to evolve our defenses, developing prompt shielding, red-teaming, and new countermeasures against adversarial ingenuity. And the Red Kings to steady the pace, building interpretability, AI governance, and ethical alignment into the foundation itself. I wonder though, in the iterated Prisoner’s Dilemma of human-AI coevolution, can we choose to be the slow adapters, the species that anchors cooperation? To hold our ground long enough to define the “equilibrium” we desire? The pace of AI will continue to accelerate, reshaping the environment in which we operate. Yet if we hold firm on the core of AI safety, we – the effectively slower species – still have a rare opportunity: to define the cooperative equilibrium, even as we necessitate adaptation around it. Slowness, in this framing, is not inertia but intentionality: a deliberate safeguard in the face of runaway adaptation. The runaway state is bound to happen, by the consideration of AI evolution to be akin to biological evolution.

The Emergent Red Duality

Ultimately, the queens are nothing if the king falls. In the grand game of evolution, if we cannot learn to build safer systems amid this oscillating pendulum of moves and counter-moves, we risk checkmating ourselves out of off the board entirely. The true endgame is not humans versus AI, but humans + aligned AIs. We will reach that equilibrium only through alternating phases of rivalry and adaptation, through Red Queen races that push our defenses forward, and Red King moments that create leverage for our long-term sustenance as a human species.

References

Red Queen’s Race

Red King’s Hypothesis

AI-2027

Prisoner’s Dilemma

Through the Looking Glass

The Red King effect: When the slowest runner wins the coevolutionary race

The Dance of Endurance

2025-08-25T12:00:00+00:00

As the sun sets behind Twin Peaks, I carry my bike up the stairs, my legs jelly and aching after more than 200 kilometres (126.65 miles). Collapse is imminent, and I welcome it, sinking into the couch, enjoying the familiar company of joyful pain. The last time I rode this hard was in 2023. Since then, I’ve stayed on familiar roads—honing skills, losing them to injuries, and reconnecting on San Francisco streets. Returning to this level felt like homecoming, a palimpsest of my past rides.

For years, I have chased these wild flashes of joy. On the well-trodden routes, pelotons sweep by in a blur of color and camaraderie—waves, nods, quick hellos, all part of a rolling celebration. But as I push further north, the world grows quieter and lonelier. There are no bike shops to bail me out, not even a tree to cast a sliver of shade. Every twenty minutes, a car appears and vanishes in a heartbeat, leaving only the hush of the empty road. The cyclists out here are a rare breed, drawn to the long, grinding challenge. Speed is not the prize; Out here, it is all about the journey — slow, relentless, and deeply enduring.

The Why

On a Saturday after a long, hard week, nothing tops waking up early and riding hard. It’s a meditative experience, bringing deep solitude to a frenzied mind. The warm embrace of the mountain I am trying to conquer offers a unique mental space that seems to open up only after going the distance. Despite the challenge, this state comes naturally – a flow state is reached.

A natural breakdown of the ride into milestones of climbs and checkpoints aligns very well with my affinity to plan everything to the dot. Rarely do the rides go according to plan. The flow state then begets patience and a mindset to look at the next climb and only think of that segment. Okay, I got this one small climb. Done… ooh boy, the next one’s hard with 8% gradient. Row, row, row your boat gently down the stream… yep. Done.

These rides offer plenty of opportunities to persist and to embrace discomfort. It’s not a race; there’s no reward waiting at the end of it. What matters is enduring it all—the blazing sun, the stubborn mountains, the endless flats. The journey itself is the only prize.

Mind breaks before the body

But beneath this meditative flow lies the real battle: not against the body, but against the mind. With every step forward, as the heat slaps you in the face, I find the going tough. The mind withers at the thought of having to do yet another climb. Gosh, why couldn’t they bore a tunnel here? Why are the roads designed for cars and not road bikes? (I laugh incredulously at my own silliness.) 50+ miles in, the body is less fresh. The breathing is heavy. The mind starts looking for shortcuts to end this onslaught. Every baby hill becomes taxing. The smiles of hikers are unbecoming (how dare they!). The speeding cars are annoying, and the tiny bumps on the road are like a torture chamber. The mind has suddenly returned to its infancy state - in its most primitive form - being a baby.

When overcast skies press down on lonely climbs in the middle of nowhere, unease quickens its pace. The world drains of color and motion, save for the steady spin of my wheels, and a sharper awareness flickers to life. My inner child stirs, desperate to flee. In these moments, my mind shatters like a glass door caught in a sudden gust.

Over the years, I have learned to shatter the iron chains of doubt. I speak out loud: I can do it. I have been here before. I can conquer this. Hearing my own voice cuts through the runaway train of thoughts threatening to derail me.

There are days though when nothing feels enough. The train has crashed. The going feels impossible. On one of those days in the past, I was ready to abandon and hitchhike. A fellow rider gave me company for the last 40+ miles of the ride. We conversed about living in the city (again, a welcome distraction from the ride). I matched their slow and steady pace, brought my heart rate down, and pushed mile by mile to conquer my mental birdcage.

The Joy of Discovery

The kickstart to my rides was discovering places I would not see otherwise - backroads, valleys, and coastlines - at a much slower pace compared to cars. I am in my element, soaking in everything that these places have to offer. In the rhythmic movement of the pedaling, I lose myself on rolling hills and winding roads, utterly entranced.

On a recent ride through Chileno Valley Road, the summer heat was incredibly unforgiving. The 10-mile ride in hot headwinds and sun-baked asphalt with barely a tree for shade felt nothing like a torture chamber, though. I was in the present, discovering these new lands - the black, brown cows grazing in the meadows, the dairy and cheese farms with their rustic entrance signs, the quintessential wooden fences meandering along the road, the horses running in square paddocks - exhilarating joy to witness all of it in slow motion.

The Unexpected Gift

The sunrise starts, the coffee stops, and the water refill stations along the way, post-ride feasts - these little rituals born out of necessity have made my rides so much more meaningful. I carry wild memories of plans going awry, riding in the dark, witnessing changing seasons on the same routes, eating dirt on sharp descents, being one with the fog, taking risks on long, fun descents - all a grand adventure.

So, jelly legs and a heart on a workout - why does anyone willingly push themself for hours on the bike? I think I am ready for another adventure - it is the simplest answer from each one of my rides. As the sun sets and I collapse, weak, but smiling, I know: this isn’t escape. It’s arrival. And it’s a joy worth chasing.

Let the fires burn

2025-06-19T12:00:00+00:00

Waking up to the news of downed planes, political turmoils, and the potential emergence of World War 3, it’s easy to succumb to the pressure of letting myself be swept hither and thither. These times are strange and often feel very dire - my own world seems to be moving at a breakneck pace. As I take a deep breath to remind myself to come back to the present, I reflect on the importance of protecting one’s mental space and energy.

Mental space is a finite resource.

I am no stranger to the productivity buzzwords: willpower, grit, saying no, and focus. In essence, it’s about taking control of life - some semblance of control in the least. In reality, countless little things tug at my attention, draining the willpower I just recharged overnight. A quiet voice in my head immediately answers - ruthless prioritization and hyper-focus on very few things. So I do know this, and yet, I act childish automatically pacifying every thread and trying to do it all.

Let some of those fires burn.

Some fires are of my own making - sparked by habits, over-commitment, or fear. I’ve been caught in a Sisyphean loop, fanning their flames. That stops here. That stops here with a heads-down focus on those immediate fires. The world could be destroyed by Vogons tomorrow and I wouldn’t know - that’s the focus flow I aspire to in this period. Put my blinders on, and Godspeed! Nothing else matters.

Maybe, though, letting a few fires burn, is the way. It’s not failure; it’s wisdom - not all alarms are mine to answer, and not all smoke needs me choking on it. After all, fires burn to clear the path and unencumber the weary, confused traveler from fighting his way through the thick forests.

And then some fires are truly a figment of my imagination. These are trespassers who have been squatting in my headspace, letting excessive rumination fester. I am serving notice - no more mental court hearings though. I am evicting these squatters the only way I know how; by dancing it out. With every move, I reclaim my body and let my mind slow down. Each joyful step is a conscious shift in energy — no need for logic or explanation — into the present. No more grappling with these phantom blazes; Just movement, just release.

Makes you look at the gif very differently, doesn’t it?

Unplugged

2025-06-15T12:00:00+00:00

It is a warm summer mid-morning with crisp weather — blue skies and plenty of sunshine — much unlike the foggy blanket one would expect in San Francisco; The peak fog game (i.e. Fogust) is still months away; It’s still the time one would expect the city to get uhm… progressively colder with the fog either in the backdrop or enveloping you, all while the chilly winds sweep through. Yet, here I am, at 11 in the morning, at Alta Plaza Park, enjoying every bit of the sunshine glazing my skin.

The wet grass has begun to “bake,” and the sweet fragrance is all around. I lay on my blanket, reading a book and writing in my journal, as I watched the events unfold around me. A sea of dogs running amok unleashed and enjoying the unbridled freedom; A black cat in a harness exploring the park with so much gusto that one would not have batted an eyelid if they said he ruled the park and all of the dogs were his puny subjects; Couples with babies in strollers with bags of food from the nearby Bi-rite market, hoping to find some space to soak in the sunshine and take a pause; A dad with his baby girl, holding onto a beautiful pink kite with a majestic tail, eager to fly it on this windy Father’s day; I bear witness to the curiosity of the dogs welcoming a newly dog off his chains; with the arrival of each dog, there’s a raucous celebration and an abundance of joy spread embraced by all of the viewers.

And there are the tech bros, founders, and VCs on their coffee walks. I cannot escape them. I can hear their chatter; they are the loudest in the park. Perhaps not intentionally, but their enthusiasm for their craft is apparent, much to the vexation of this author, as he tries not to think of work on weekends. I must also apportion some blame to myself, as I am reading ‘LLM Engineer’s Handbook’. Agentic AI, LLMs, ACP, A2A — words of the year so far — our tech bros are discussing the same as well.

To each their own, I conclude, making peace with the failed notion of escaping the chatter of tech, living in the thick of it. The curiosity of the black cat (Ludo) has been piqued; He has tugged on his harness to arrive at my blanket, while at the same time attracting the attention of Daisy, the majestic golden retriever, sitting quietly less than a metre away. Daisy’s intentions are clear: she wants to make a new friend; Ludo’s, on the other hand, is much more nefarious. He stops approaching and looks me in the eye; the eyes of black cats are always captivating, almost like the needy eyes of the labs and retrievers. I get up to pet this friendly cat; his intentions now become clear, as he advances to my blanket and makes himself comfortable, sprawling majestically on the entirety of it. Pure evil, and so… kingly; yes, your Highness, I mutter, while I pet him for a bit until his owner decides it’s in the best interest of all to leave (while Daisy lies down on the grass visibly sad).

In these moments, I reflect how much I have missed this act of doing… nothing; To be able to feel things in their entirety, and not just go with the flow - a flow that I decided to tweak and accelerate in the name of optimization and efficiency. To take in every sound, near and far, including the humdrum of the church bell nearby, it feels uncanny living in a hyper-plugged world of working long hours. Unplanned ambles like these evoke raw emotions that are so deep-seated that I have forgotten they exist; all it takes is doing nothing for them to surface. And yes, I didn’t read the book as the distractions became too positive to immerse myself in one, a reference book no less. As I walked towards a nearby restaurant on Divisadero, I promised my suddenly super-alive heart that I would do more of these unplugged ambles.

The quarterly review — Q6

2018-07-10T12:00:00+00:00

Last Saturday, I graduated with a Master’s degree in Computer Science; I completed my thesis in the field of recommender systems titled, ”Exploiting temporal and geographical influences for personalized POI recommendation.” This was the last quarter of my master’s timeline and like every one of its likes before, this one too flew off in a blink of an eye.

I spent most of the quarter feeling an acute time crunch. At times, I would pull multiple nights with just 1 hour of sleep or no sleep. With the San Diego flu appearing in multiple cycles, I fell ill and a took a 1.5-week break in the middle of the quarter. If you aren’t laughing or shaking your head in great shock, you should be! Losing a week in a quarter is disastrous — so disastrous that you double down and sleep less.

The one big thing I underestimated is writing the thesis. This one was a sucker! I shouldn’t have started with the background chapter. I procrastinated with great levity and misjudgment about the time and effort it takes to transfer the observations and results to the written form. I spent roughly four full weeks writing it, including the revisions. This should make me happy considering the optimized time. However, it was a needless pressure that almost made me question if I would complete my thesis this quarter.

I spent some time on researching the problem of automatic playlist continuation, as a part of the Spotify Challenge. Unfortunately, with my thesis train going at an extremely slow pace and the illness only meant that I stopped working on it completely in the last month.

I had a pleasant quarter as a teaching assistant for CSE 21 — Mathematical Systems and Analysis, a foundation discrete mathematics class for computer science under Prof. Miles Jones. It was also my first time in a large TA peer group and I enjoyed the overall experience. The last discussion section was emotional as it was also my last teaching stint in graduate school thus far.

When I look back, I simply can’t believe the speed at which my two years at UCSD flew by. I learnt massively, I struggled through, and with every fall, I learnt to pick myself up. It’s been a fantastic journey, and I will miss everyone. It will actually hit me when I leave San Diego in one month, but for now, I still feel like I am in school, albeit carefree, pressureless and without time crunches. I wish I did more ab crunches at least now; Wh’re art thy abs, hath asked the longeth await’d summ’r.