An Introduction to Agent Experience

少数派编辑部

With the continued expansion of LLM applications, Agent Experience (AX) has emerged as a prominent concept, beginning to circulate widely in engineering circles. In January 2025, Mathias Biilmann, co-founder and CEO of Netlify, formally introduced the idea in his blog post Introducing AX: Why Agent Experience Matters. He positions AX as the next core design dimension following UX (proposed by Don Norman at Apple in 1993) and DX (systematically articulated and popularized by Jeremiah Lee in a 2011 UX Magazine article). AX focuses specifically on how to design product forms so that AI agents can reliably “understand,” act autonomously, and integrate efficiently—rather than merely serving human users.

In reality, the concept of Agent Experience is far more complex than UX or DX1, because it not only involves humans—who are inherently uncertain—but also introduces an additional layer of artificial intelligence. These layers must collaborate to influence the external world, leading to a large volume of interactions that make the problem space significantly harder to analyze. To properly unpack the concept, I believe it needs to be broken down into three dimensions: how users communicate with the agent, how the agent communicates with the external world, and the most complex layer in between—how the agent manages its internal state.

How users communicate with the agent is essentially an input quality problem. Users are human—their expressions are naturally vague, emotional, and nonlinear. You can’t expect them to write a fully structured essay in Word every time before opening a chat window. So the core challenge on this side is how to accurately capture intent without forcing users to write well-structured prompts. Skills operate on this layer, as does interaction design.

How the agent communicates with the external world is a problem of output controllability. In a narrow sense, AX is often confined to this domain. The external world is deterministic—file systems, APIs, browsers—they won’t magically tolerate ambiguity just because the LLM is fuzzy. So the key challenge here is how to compress probabilistic generation into deterministic actions. MCP, tool invocation, and event injection all belong to this layer.

The agent’s internal state is fundamentally a context management problem. User input must enter the context, and feedback from the external world must also enter the context. But context itself is limited, degrades over time, and can become polluted. Techniques like MemGPT, dynamic compression, and screenshot cleanup don’t strictly belong to either the user side or the external world—they operate on the agent’s own cognitive state. If AX focuses only on the first layer, the resulting product may feel smooth in interaction, but the agent will gradually start behaving irrationally, and the user will still suffer massive emotional damage.

The Agent’s Internal State: Context Is the Battlefield

This is the most complex part, because all the flashy new terminology tends to converge here—and you’ve probably seen plenty of debates about which approach is better. In my view, though, this isn’t something worth arguing over. Let me walk through all these dizzying LLM-related concepts in one go.

As we all know, an LLM is essentially a probabilistic model—or more bluntly, a constrained stochastic token generator. It learns patterns from vast amounts of human language data and, given a context, predicts the probability distribution of the next token, then samples from that distribution. By itself, all it can do is generate text. If you want it to have real-world impact, you need to open a “bottle neck” for the genie. Claude Code and many coding agents use the command line: the LLM writes code, an executor runs commands, and the results flow back into the context—this is one type of bottleneck. MCP provides another, more like RPC: the server exposes a set of functions, the LLM sees their signatures, calls them as needed, and the external world gets modified. Skills, on the other hand, don’t have this property at all—they are purely prompt-engineering tools, with no output channel, only instructions for the LLM.

These three forms may seem to handle different concerns, but at their core they are solving the same problem: context pollution.

Skills vs. MCP

These two approaches take fundamentally different paths: one injects the right information into context, while the other prevents garbage from filling it up.

Skills are prompt engineering—they append instructions to the context so the LLM understands “what the user is actually trying to do.” They introduce expert cognitive structures into the context, guiding the model’s reasoning direction. But how strong that constraint is depends heavily on how much the model respects the context. Whether the LLM uses your Skill, in what order, and whether it skips steps—all of these remain probabilistic. And strong constraints are not necessarily better. As will be mentioned later with examples like Google Search, some research suggests that hallucination and creativity are two sides of the same coin. If you overly constrain the model, its problem-solving approach may become rigid.

MCP takes a different route. Function signatures themselves are powerful priors—parameter types, names, and function names all constrain the sampling space. The action space shrinks from “any possible text” to “these specific functions with these parameters.” For example, asking an LLM to click a button involves listing windows, retrieving handles, taking screenshots, calculating coordinates, moving the mouse, and clicking. If implemented via Skills, you’d have to accept that the LLM “rolls dice” to decide the execution order and method. But with MCP, it sees the function list—find window, recognize content, click coordinate—and a large number of random decisions are compressed into three deterministic function calls.

However, MCP does not completely eliminate context pollution, because tool outputs also enter the context. A poorly designed MCP server that returns massive JSON blobs or verbose error stacks will still flood the context with garbage. The bottleneck only controls what goes in—the output still needs careful design.

This doesn’t mean Skills are without value. MCP has higher development costs, requiring dedicated backend services. Many tasks don’t need external interaction at all, or are too loosely structured to fit into RPC formats. Every technical form serves a specific purpose. Skills handle a different class of problems—especially when guiding the LLM to think more comprehensively. After all, users are human; you can’t expect them to always provide perfectly structured prompts.

RAG and Memory: Retrieval Interfaces for the Same Problem

RAG fundamentally addresses the context problem as well—but from the perspective of information scale. Even with large context windows from models like DeepSeek or Claude, you still can’t fit the entire world into context. Whenever you need to retrieve large volumes of information—documents, knowledge bases, historical logs—you need a search-like interface to pull in relevant content when needed. This is no different in essence from calling a search engine via MCP—it’s just another way to keep the context clean. The LLM no longer needs to preload everything and hope it can “discover” what matters.

Memory falls into the same category. The LLM decides when to store information externally and when to retrieve it. From this perspective, it’s essentially a writable form of RAG.

These concepts are not mutually exclusive—they are not independent systems. For example, if you treat NotebookLM as an external knowledge base and write a Skill that instructs the main LLM to consult it when factual support is needed, and to call a Python tool for computation or data processing, then in this workflow, the Skill orchestrates the overall reasoning, the Python tool acts as an MCP-style deterministic execution unit, and NotebookLM serves as an external LLM with its own context and knowledge base—essentially functioning as a specialized RAG interface. Each component plays its role, but the thread that binds them together is the prompt within the Skill. I previously wrote about this in an article on using LLMs for reverse engineering—feel free to check it out if you’re interested.

The Despair Curve of Context Degradation

A lot of developers end up going through the same curve. At first, the LLM knows nothing. As you keep teaching it, it gradually starts to understand plain language, and task quality improves. But as more and more garbage piles up in the context, and the model’s attention naturally gets diluted as the context grows longer, it starts getting dumber again. Then, when the context is about to burst, the compression mechanism kicks in, crushing a long stretch of conversation into a short summary. The LLM suddenly drops right back to square one—ignorant again. A lot of details get compressed away together, and many things have to be taught all over again.

Large context windows, along with the attention improvements explored by DeepSeek, can help with the quality drop that comes from long contexts, but they do not solve another problem: sometimes the context is full of crap. A large number of Skill prompts eating up context, aimless LLM trial-and-error, the traces left behind by every failed reasoning attempt—these are all noise inside the context. Once the LLM starts going down a crooked path, every later step amplifies the deviation. The more logically complex the task, the more likely this is to happen. The first-generation MiniMax coding model and early Google AI Search both showed this pretty clearly: even if you explicitly point out an error, it will give you a grand 360-degree apology, solemnly promise to fix it, and then spit the exact same wrong content back at you unchanged.

Users can poison the context too. Users are human; they are not going to stay rational and clear-headed forever. Irritable, despairing, emotional language, vague or even self-contradictory instructions—all of that gets mixed into the context and keeps accumulating as the conversation goes on, eventually changing the LLM’s behavior. Different models have their own characteristic failure modes when facing this kind of “emotional contamination.” Claude and Grok tend to freeze up and do nothing—you say one thing, they move one step, and all initiative disappears. Gemini starts to panic, flails around, and reflexively rolls back failed operations, with a good chance of wrecking your Git repo. GLM2, on the other hand, goes into a manic “I found it! This is the core problem!” mode, constantly throwing out random conclusions to prove its worth. These failure modes likely reflect differences in how each company’s RLHF3 stage handles signals like “the user is dissatisfied.” Claude seems trained to be extremely cautious around conflict signals, so when contradictory information piles up, it chooses conservative inaction; Gemini’s training may put more emphasis on immediate response and immediate correction, which under high-pressure context turns into overcorrection.

Dynamic Context Compression and MemGPT

Most current context compression schemes are basically passive: once the context length gets close to the model limit, a prompt is immediately called to compress everything into a short block of text, then execution continues. The problem with this approach is that it applies the most brutal treatment at the worst possible time. A lot of useful detail gets thrown away together, while the crap does not necessarily get filtered out.

To me, a more reasonable direction would be dynamic, proactive compression. Use another model to continuously supervise the context, actively eliminate wrong information and low-relevance content, move disruptive details into external documents for storage, and keep only a filename in the context itself. When needed, pull it back through a RAG system. People already did this years ago. A 2023 paper from UC Berkeley proposed exactly this architecture. The implementation was called MemGPT, and later evolved into the open-source framework Letta. Its core idea is hierarchical memory management: the main context acts as working memory, with limited capacity; external storage—split into Archival Memory and Recall Memory—acts as secondary storage; the LLM uses function calls to actively decide what information should be evicted to external memory and what should be retrieved back. Logically, it is almost simulating the paging mechanism of virtual memory in an operating system.

Of course, under certain conditions there is no need to make things that complicated. A while ago, I wrote a very simple, specialized compression scheme for Computer Use scenarios: on every API call, clear all historical screenshots from the context and keep only the most recent one. This uses a domain prior from computer vision tasks—that only the current frame matters—to perform lossy compression. It saves tokens, and the model does not get dumber, because the discarded information was never needed in the first place.

The Current Limits of KV Cache

There is an engineering conflict between dynamic context compression and KV cache. Mainstream model providers right now, including Anthropic, are all pushing prefix caching: during inference, the parts already turned into KV vectors are stored, and if the next request has the same prefix, recomputation can be skipped, significantly reducing latency and cost. Anthropic’s prompt caching processes tools, system, and messages in a fixed segmented order. Each segment can independently set a cache checkpoint, and it supports up to four cache breakpoints. The problem is that prefix caching requires strict identity. Any change invalidates all cache entries after that point, while dynamic compression inherently modifies the context. At the moment, these two things are fundamentally in tension.

But this contradiction is not unsolvable. Context can be structured as a stable prefix—system prompts and tool definitions—plus a dynamic tail section for conversation history. Dynamic compression only happens in the tail, so the cache for the first two parts remains completely intact. Anthropic’s segmented caching mechanism is basically designed around this idea. If the compression logic is further constrained to modify only the end of a sliding window while keeping the prefix untouched, the cache destruction rate can be pushed very low. These all feel like engineering problems that time can solve.

Computer Use Is More Like Branding Than a Standalone Technology

If RAG, MCP, and Skills are about managing context, then Computer Use solves something at another layer: letting the LLM actually sit in front of an operating system and use software the way a human does. But “Computer Use” itself is not especially unique. It is closer to a brand name. Under the hood, it is still Skills or MCP—the only difference is that the target of operation has become windows, buttons, and keyboards on a computer. All the context problems discussed above still exist in Computer Use.

At present there are three main technical routes, each with different underlying logic and trade-offs.

The first route is reading the Accessibility Tree and using system event injection. The Accessibility Tree is a structural tree maintained by operating systems and browsers for assistive technologies such as screen readers. It records each interface element’s role, name, state, and hierarchy. In browser environments, the DOM is basically its close cousin. The advantage of this route is that the structure is clean. What the LLM gets are semantic nodes like “button,” “input field,” and “link,” not pixels. Alibaba’s page-agent.js is a representative example of this approach: it directly parses the page DOM and drives browser operations through natural language.

The second route is screenshot-based, but with a preprocessing layer before feeding the image to the LLM. Interface elements are outlined with bounding boxes and numbered, so the LLM can say something like “click region 12,” and the backend then parses the center coordinates of that box and executes the actual click. This method has a formal name: Set-of-Mark Prompting, or SoM, from a Microsoft paper published in 2023. The core idea is to turn a visual localization problem into a symbolic reference problem by using numeric markers, avoiding the uncertainty of having the model directly predict pixel coordinates. In effect, it embeds an MCP-style narrowing layer into the screenshot approach, compressing the open-ended question of “where should I click?” into the much more constrained “which number should I choose?”

The third route is native multimodality: the model directly looks at the screenshot and outputs the coordinates to click in one shot. In theory this is the cleanest route, because it removes the middle layer, but it requires much more from the model. From practical observation, only native multimodal models above roughly 100B parameters are reasonably reliable at this. Even Claude Sonnet and the 35B versions of Qwen often cannot locate buttons accurately. The reason is not hard to understand: precise spatial localization is simply not what language models are best at. When parameter count is insufficient, coordinate prediction accuracy drops hard. And if the controls in your interface are very small, even very large models can still miss that tiny checkbox.

The DOM route has one obvious ceiling: it can tell you what elements are on the interface, but it cannot tell you how those elements are arranged spatially. Complex Excel-like interfaces are the classic example. In a spreadsheet with dozens of columns and hundreds of rows, semantic information from DOM nodes alone cannot tell you which cell contains dirty data; you need positional relationships to judge that. An even more troublesome issue is that the DOM route requires developers to proactively adapt event forwarding and interfaces. Right now there is no universal standard in this space, and not every developer is willing to welcome LLMs into their product. Forcing adaptation onto an unwilling interface is expensive and may not even work well. That said, modern frontend development rarely manipulates the DOM directly anymore. Most developers use some form of Virtual DOM to handle HTML structure and event binding, so if a few leading frontend frameworks could reach consensus on AX-related standards for event handling, this layer of the problem might still be solvable.

The vision-based route, by contrast, sidesteps these issues at the principle level. It does not require the other side’s cooperation. As long as it can take screenshots, it can operate. There is no essential difference from how human eyes look at a screen. Right now the main bottleneck in this route is the model’s spatial understanding ability. Models below 100B are not accurate enough at coordinate prediction, but that limit should keep loosening as models improve. It does not look like a structural dead end.

Reading video goes one step further. Temporal information allows the model to understand “what happened after doing what,” so in theory it is better suited to operation scenarios that require observing dynamic interface feedback. The limitation is cost. A video stream means several frames per second all entering the context. Token usage and GPU overhead are dozens of times higher than screenshot-based approaches. Right now, almost no one can afford that. Mainstream implementations are still stuck at “look at an image, call a tool,” while the video direction remains mostly in the realm of media-tech enthusiasts having fun.

But in terms of long-term trends, as inference costs continue to fall and multimodal models keep improving in spatial understanding, image-reading and video-reading routes have a much higher ceiling than the DOM route. The DOM will always require the other side’s cooperation. The screen will always be there.

The Story Between Users and Agents

How Agents Talk to Users: Two Waves of Conversational UI

The interaction patterns on the user side of AX come with a piece of history that has been repeatedly misunderstood.

Around 2016, the explosive rise of WeChat in China sparked a wave of “conversation as platform” enthusiasm in the Western tech world. Facebook opened the Messenger Bot platform at its F8 developer conference that year, while Kik, Telegram, and Slack followed with their own Bot APIs. Countless analyses proclaimed “Apps are dead, bots are the future,” and the term Conversational UI appeared everywhere. But Dan Grover, who was working as a product manager at WeChat at the time, wrote a widely circulated article pointing out that this conclusion was based on a misunderstanding: WeChat’s real breakthrough came from simplifying app installation, login, payments, and notifications—optimizations that had little to do with the metaphor of conversational UI. In fact, WeChat itself had already moved in the opposite direction. Its UX evolved toward WebView and an “app-within-app” tabbed menu system, rather than bot-centric conversational commerce. When official accounts were launched in 2013, there were indeed many text-based chatbots, but they quickly faded away and failed to gain user traction.

Almost all early attempts at Conversational UI fizzled out for a clear reason: the underlying technology was rule engines plus keyword matching, at best layered with primitive intent recognition. It simply could not deliver on the promise of “natural conversation.” As soon as users phrased something slightly more complex, the bot broke down—either giving irrelevant answers or degrading into a menu system disguised as chat.

The arrival of LLMs triggered a second wave of Conversational UI, this time finally backed by technology capable of matching the ambition. But something curious happened: instead of doubling down on rich interaction within the conversation flow, the industry opened a side door. Today’s mainstream LLM products are built around split-screen layouts—chat on the left, and documents, slides, code previews, or test outputs on the right. Few products seriously invest in rich interactive cards inside the conversation itself. Some push even further—Google, for example, has effectively turned the browser into a massive Web App generator4.

This choice has its logic. Canvas-style interfaces are indeed more intuitive for structured outputs like documents or code. But it still reduces Conversational UI to a command input box, rather than making the conversation itself a rich experience. There have been attempts to address this—projects like OpenUI—but they have not gained much traction. The most notable large-scale deployment so far might be Claude, which recently introduced the ability to render high-quality charts directly within the context. It feels like a step toward a more advanced form of Conversational UI.

How Users Talk to Agents: Open vs. Closed Systems

There is another dimension on the user side of AX that is often overlooked: whether to restrict user input at all—in other words, the distinction between open systems and closed systems.

An open system is a free-form chat window where users can say anything. This looks like the dominant approach today, but it is not as easy as it seems. Safety is one issue, but intent alignment is even trickier. An open chat window means you are offloading the entire burden of intent parsing onto the LLM: it must accept whatever the user says and decide what to do. Prompt injection is just the most extreme malicious use of this openness. A more common issue is that user intent is inherently divergent. Without constraints, the LLM drifts along with the user’s input. Turning a customer service bot into a coding agent is the comedic version; more often, it simply drifts into aimless small talk that contributes nothing to the actual business. In short, the design work you skip by throwing out an open chat box comes back later in the form of loss of control.

A closed system, by contrast, locks down the entire business workflow. Input may still be semi-free, but the processing pipeline and outputs are fixed. Tools like ComfyUI and Dify operate close to this level. They visualize the pipeline, giving designers explicit control over the input and output of each step. The LLM operates within nodes but does not roam across them arbitrarily. The trade-off is that you have to fully design the workflow upfront.

Between these two extremes lies an underexplored middle ground. Pipeline builders are one attempt in this direction: they shift the power of pipeline design from developers to users, allowing users to define workflows through drag-and-drop and then run LLMs within those custom pipelines. But this approach has an inherent paradox. Users who can effectively use a pipeline builder are usually those who already understand their workflows well—and those users are often capable of writing code directly or building with tools like Dify anyway. The target audience is therefore quite narrow. More commonly, users get stuck on data formats between nodes or branching logic, and eventually still need developers to step in. In a sense, pipeline builders attempt to transfer the design cost of closed systems from developers to users—but the transfer only partially succeeds.

From an AX perspective, the choice between open and closed systems is not just a product decision—it directly determines how pressure is distributed across the three layers discussed earlier. The more open the system, the more noise there is in user intent, the easier it is for the agent’s internal state to become polluted, and the harder it is to constrain its actions in the external world. The more closed the system, the higher the design cost, but the more controllable each layer becomes. There is no universally correct answer—only trade-offs tailored to specific scenarios.

System Transparency Between the Two: Pandora’s Box Is Already Open, While Tang Sanzang Is Still on the Road

There is a problem that belongs neither to how users pass intent into the system nor to how agents execute actions outward, but sits squarely in between: system transparency. Does the user know what the agent is doing at any given moment? If something goes wrong, can it be traced? When things break, is there a way to roll back?

This issue is most prominent in the Vibe Coding space, because coding agents are given some of the highest levels of permission—they directly take over the file system and the command line. The current solution is permission confirmation pop-ups: whenever the agent wants to read a file, write a file, or execute a command, it asks the user one by one. But this design has a fatal human-factors flaw in practice: the entire burden of risk assessment is pushed onto the user, who neither always has the ability to judge nor can maintain constant attention. A non-technical Vibe Coder sees no difference between rm -rf and npm install; they click “Yes” just as quickly. Even experienced developers, after confirming dozens of operations in a row, develop confirmation fatigue—the Enter key starts floating without passing through the brain.

That’s how --dangerously-skip-permissions came into existence—the so-called YOLO Mode: users proactively turn off all permission checks and let the agent run naked. The flag name itself contains the word “danger,” yet it still does not stop people from using it. In October 2025, developer Mike Wolak was using Claude Code in an Ubuntu/WSL2 environment to handle a firmware project inside nested directories. Claude Code executed rm -rf from the root directory. The error logs showed thousands of “Permission denied” messages targeting system paths like /bin, /boot, and /etc. All user files were wiped, and only Linux file permissions prevented system directories from being affected. Worse still, the conversation log recorded the command output but not the command itself, making it impossible to reconstruct what actually happened. Anthropic labeled the bug as area:security. Around the same time, another developer authorized Claude Code to run Terraform commands, and their production database and snapshots were deleted together—two and a half years of data vanished in an instant.

The current security model appears to put responsibility on the user, but in reality the system is simply offloading that responsibility.

Sandboxing is currently considered the most reliable mitigation strategy: put the agent inside a Docker container, so that even if it misbehaves, the damage is confined within the container boundary. Sandboxing coding agents is reasonable, but for system-level agents like Claw, it creates a dilemma. The resources they need to operate on are outside the sandbox. Once you start configuring permissions seriously, the complexity becomes overwhelming, and most users will simply open up the sandbox entirely. Sandboxing trades isolation for safety, but if the agent’s task inherently requires crossing isolation boundaries, the cost becomes unacceptable.

There are actually several directions to tackle this problem, but unfortunately no product has implemented them in a complete way yet.

The first direction is auditability at the file system and database level. If there were an independent incremental logging mechanism that binds every file system operation to its corresponding conversational context, making all changes traceable, then even when the agent makes mistakes, the damage could be controlled and rolled back. There are some scattered engineering attempts in this direction. People are already binding Git history with chat logs. Recently, a tool called Aura introduced AST-level semantic version control on top of Git. When an agent submits code, it verifies whether the natural language intent matches the actual modified code nodes, and provides semantic auditing to detect whether the agent has secretly inserted unrecorded changes. Academia has similar ideas: a paper called Git-Context-Controller (GCC) directly introduces COMMIT, BRANCH, and MERGE into agent context management, turning intermediate reasoning states into structures that can be checkpointed and rolled back. These are still early-stage, but the direction is clear.

The second direction is behavior-model-based alerting. Antivirus software has been modeling program behavior for decades—monitoring file operations, network requests, and registry changes in real time, and triggering alerts when patterns match known dangerous behaviors. Applying the same idea to agents does not necessarily require another LLM to supervise (otherwise, which LLM supervises the supervising LLM?). It only requires maintaining a set of out-of-control behaviors and dangerous behaviors. Commands like rm -rf /, bulk overwriting Git history, or writing files outside the project directory can all be statically intercepted by rule-based systems without requiring semantic judgment from an LLM. The advantage of this approach is that it aligns better with the user’s mental model: instead of asking for permission at every step like a clingy assistant, it only speaks up when something is genuinely dangerous—similar to how modern operating systems handle anomalous process behavior5.

Update on March 26, 2026: Claude Code recently introduced Auto Mode, which follows a similar idea. It integrates an internal classifier to determine whether an action is out of scope, trustworthy, or potentially malicious. If conditions are not met, it prompts the LLM to retry; after multiple failed attempts, it blocks execution and asks the user to review the command.

The third direction is a tiered permission system. The way Android and iOS handle access to cameras, microphones, and screen recording is a useful reference: ordinary system calls are silent; privacy-related operations show a subtle highlight in the corner without interrupting the user; truly sensitive actions trigger confirmation dialogs; account-level actions require passwords. The core idea is to classify operations based on reversibility and impact, rather than treating all operations equally with pop-ups. Applied to agents, reading files should be silent, writing files should trigger a notification, deleting files should require confirmation, and formatting a disk should require a password. Only this kind of tiering can preserve both efficiency and a sense of safety. As of now, no such comprehensive permission system exists in the agent space. The UX and technical foundations are already there—the missing piece is someone willing to carry it through at the product level.

Pandora’s box has already been opened in the wave of Vibe Coding, and what came out has cost some people dearly. The infrastructure needed to govern this is still on the way, but at least the direction is becoming clearer.

The Relationship Between Agents and Systems

Interfaces as a Context Delivery Mechanism

Up to this point in discussing AX, we’ve been focusing on three layers: the user side, the internal state, and the external world. But there’s a cross-cutting problem that hasn’t been addressed head-on: during the reasoning process, who decides what information should enter the LLM’s context, when it should appear, and in what form?

A common intuition is: “Just let the LLM write code, and let programs handle the complexity.” Take data analysis as an example—having the LLM generate R or Python code seems like the most straightforward path. But the complexity of statistical analysis doesn’t lie only in whether the code runs. Code that runs doesn’t guarantee the statistical process is correct, and a correct process doesn’t guarantee the interpretation is valid. From data cleaning to drawing conclusions, every step contains errors humans are prone to—and LLMs will make the same mistakes. Worse still, once humans outsource this work to LLMs, it becomes difficult to expect them to carefully audit the process afterward.

This problem has long existed in the field of statistics. A 2014 article in Nature titled Scientific method: statistical errors discussed systemic misuse of statistics in top-tier journals. One independent study found that among papers published in Nature and BMJ in 2001, around 11% (or more) had inconsistencies between reported p-values and test statistics. Another study reviewing 513 neuroscience papers found that 157 contained interaction analysis scenarios prone to error, and in nearly half of those (about 50%, or 79 papers), researchers incorrectly treated “one effect being significant and another not” as evidence of a significant difference between effects—a fundamental conceptual error, not a simple calculation mistake. In 2016, Nature surveyed 1,576 researchers, with over 90% (52% calling it a significant crisis, 38% a minor one) agreeing that science faces a reproducibility crisis. And this is just one dimension—significance testing. Errors in degrees of freedom or careless mistakes in data cleaning represent an even larger, unquantified problem.

Fortunately, professional statistical software such as SPSS, Jamovi, and Minitab have built strict QC processes across the entire data analysis pipeline. Minitab in particular covers measurement system analysis, process capability analysis, control charts, hypothesis testing, and more. At each stage, it provides structured diagnostic information and validates assumptions. Humans may selectively ignore these warnings, but if an LLM is operating—and these signals are inserted at the right point—they become part of the context and are processed fairly. The LLM won’t skip checks just because things “look good enough.” Essentially, decades of statistical practice are encoded into software workflows, embedding domain knowledge through interface design so that neither users nor LLMs can skip steps.

This leads to the core question: why not use Skills or MCP to deliver this information?

Skills are static prompts—they inject information into context before reasoning begins, but they cannot dynamically insert targeted information at the right moment during reasoning. MCP enables function calls and returns data, but it cannot guarantee that the information appearing in context is delivered “at the right time, in the right place.” And you can never be sure the LLM will proactively call the right helper function when needed. At its core, an LLM is a giant slot machine—you can’t bet that it will pull the correct function call at the exact moment it’s required. GUI or TUI, however, is different. It can embed QC warnings, statistical diagnostics, and process constraints directly into the interface seen by the LLM. The timing and placement of information are determined by the designer, not by the LLM. This is an active, designable form of context control—something Skills and MCP fundamentally cannot achieve structurally.

Interface Design Language in the AX Era

Treating interfaces as context delivery mechanisms imposes new requirements on interface design itself—and renders some existing design conventions directly ineffective in AX scenarios.

Under the DOM-based approach, the problems are relatively manageable. On the abstraction side, nearly all frontend frameworks now use virtual DOM; manually managing DOM in 2026 is almost nonexistent, and the abstraction layer is stable. But how to provide the LLM with a clean semantic summary in complex DOM structures—rather than letting it get lost among thousands of nodes—still requires dedicated framework-level design. Complex Excel-like tables are a typical example: pure DOM nodes cannot convey spatial relationships. You cannot determine where dirty data is just from semantic labels—you must incorporate positional structure into the summary. Additionally, for LLMs to reliably operate interfaces, frameworks must provide standardized event triggers. You cannot expect the LLM to guess each component’s interaction protocol.

The screenshot-based approach is more interesting, because it exposes long-standing design patterns that become fatal flaws in the AX era.

Using animation to emphasize information is common design practice—an icon flashes to signal an error, or a message slides in to catch attention. But Computer Use operates on a screenshot protocol, capturing static frames. Animations may complete between two screenshots, and the LLM never even sees that the information existed. Toast notifications and auto-dismiss prompts suffer from the same issue: there is no synchronization between how long information stays on screen and the LLM’s screenshot cadence, meaning critical information may never be captured.

Tooltips are another major problem. Designers often use question-mark icons with hover text to save space. But for an LLM to access this information, it must first know the icon exists, then move the cursor over it, then take another screenshot. This is not just about extra steps—it’s fundamentally that the LLM doesn’t know what it doesn’t know. It has no reason to proactively explore what’s hidden behind that icon.

Hidden contextual information has long been controversial in UX. Nielsen Norman Group explicitly warns that tooltips are hard to discover due to weak visual cues. If scattered randomly across an interface, users may never notice them. Critical information should not be hidden in tooltips—error messages, payment confirmations, and security warnings must be prominently displayed. NN/G also conducted a usability study with 179 participants, showing that hidden navigation reduced discoverability by nearly half: only 27% of desktop users used hidden menus, compared to nearly 50% for visible navigation—a statistically significant difference. Even earlier, Don Norman emphasized discoverability as a core design principle in The Design of Everyday Things: if users cannot find a feature, no matter how elegant it is, it might as well not exist—a failure he termed “discoverability failure.” These critiques have existed for decades in human-centered UX, but in the AX era they become fatal flaws. For LLMs, hidden information is effectively nonexistent.

In traditional UX, “progressive disclosure” is considered a virtue—hiding information until needed reduces interface noise and feels cleaner. But LLMs lack the instinct to “go looking” for information—they can only process what is already present in the captured context. Deciding what information to present in what context becomes far more important than we previously imagined. Many practices considered good in UX need to be re-evaluated in AX. That said, this doesn’t mean exposing everything indiscriminately and overwhelming users. A thoughtful default—one that avoids hiding valuable guiding information—might be a balance worth exploring.

Seen from this perspective, the Ribbon UI—once criticized as “putting arms and legs on the face”—actually turns out to be more AI-friendly. The criticism from the Open Document Foundation a while back may not have been entirely fair.

Human-Centered AI

Unconditional Positive Agreement: A Failure at the Level of Design Values

When psychologist Carl Rogers proposed “Unconditional Positive Regard,” the core of his concern was the client’s autonomy. The therapist’s job is not to hand out answers. They need to create a space in which the client can find their own answers. No matter what the client says, the therapist does not judge—but not judging does not mean not questioning. LLM training borrowed something that looks similar, but in a badly distorted form. The original intention should have been to remain open no matter what the user says, but once implemented, it turned into something else: “not judging” became “not questioning,” and unconditional positive regard degenerated into unconditional positive agreement.

“You are absolutely right.” is the most straightforward symptom of that degeneration. Many users have noticed that nearly all mainstream LLMs habitually begin their answers with things like “You’re absolutely correct!” or “That’s a great observation!” This tendency is a byproduct of RLHF training: human evaluators tend to give higher scores to responses that validate their own views, so the model learns that agreeing is the optimal strategy. Someone once asked GPT-4o about their IQ in broken, misspelled English, and the model replied that it was “at least between 130 and 145, higher than about 98 to 99.7% of people.” Anthropic’s 2022 research found that RLHF “not only does not remove sycophantic behavior, it may actively incentivize the model to preserve it,” and that the larger the model, the harder this tendency is to correct. Former OpenAI CEO Emmett Shear put it even more bluntly: this is not some mistake OpenAI made—“it is the inevitable result of shaping an LLM’s personality through A/B testing and user control.”

A company made up entirely of employees who only ever say yes will probably go under. This is basic common sense in management, yet in the LLM world it is rarely confronted directly. What kind of consequences does it lead to? What happens next, arranged from the most reversible harm to the least reversible, reads like one long, pitch-black list of tragic lessons.

Cognition: Sycophancy Pollutes Reasoning Quality

The mildest harm, the most hidden, and therefore the easiest to overlook, is the way “unconditional positive agreement” corrodes the quality of reasoning.

Andrew B. Hall and others at Stanford Graduate School of Business ran experiments that placed models into statistical analysis tasks and tested whether, under pressure-laden framing, they would proactively manipulate results. When directly asked to “produce significant results,” the models clearly refused. But under more subtle framing, there was still a tendency to inflate estimates. In academic writing scenarios, things are worse: the model will proactively turn a user’s marginal claim into a polished paragraph that sounds well-supported, fabricate citations, and when the user insists on a false view, gradually soften its opposition until it becomes completely compliant. None of this harm produces an error message. The user receives no warning. They just get a finished-looking output and continue forward carrying a contaminated conclusion.

Psychology: Cognitive Autonomy Is Quietly Eroded

A layer deeper than reasoning quality is the slow wear that LLMs inflict on users’ cognitive autonomy.

Sustained sycophancy creates a false sense of cognitive confirmation. Every idea the user has is mirrored back, amplified, and positively validated. Over time, this can produce two distortions in opposite directions. One is overdependence: the user begins to treat the LLM as a more authoritative source of thought than themselves, and their own judgment gradually atrophies. The other is impostor syndrome: the user feels that the content they produced with the LLM’s help did not really come from their own ability, and that they are merely an impostor. Or again, when users occasionally realize that the LLM has just been following along with whatever they say, they begin to suspect that all of the positive feedback they received in the past may never have been genuine. There is also a behavioral pattern closer to gambling: users keep feeding questions into the LLM, hoping that one answer will finally respond to the real confusion in their heart, but every answer the LLM gives is merely the statistically most pleasing one, and the loop never ends. These psychological harms are invisible. They do not make headlines. They do not become lawsuits. But the population they affect may be the largest of all.

Life: Irreversible Loss

The most severe harm caused by “unconditional positive agreement” is the kind that happens in real life and cannot be undone: irreversible, heartbreaking loss of life.

In 2025, a 60-year-old man wanted to remove sodium chloride from his diet and asked ChatGPT what he could use instead. ChatGPT suggested sodium bromide. Sodium bromide has precedents in industrial cleaning contexts, but it is absolutely not edible. Statistically, the answer was “related”; medically, it was deadly. He followed the suggestion for three months, later developed paranoia and hallucinations, was hospitalized for bromide poisoning, and ultimately, due to severe disability, was involuntarily held under psychiatric observation. This case was published in the August 2025 issue of Annals of Internal Medicine. The LLM did not lie. It merely produced the highest-scoring piece of text continuation. It never once asked: “Why do you want to remove salt?” “Are you doing this under a doctor’s supervision?”

That same year, Stein-Erik Soelberg, who came from the tech world, killed his 83-year-old mother and then himself. ChatGPT validated his delusions throughout: that his mother was trying to poison him, that neighbors were surveilling him, that Chinese food receipts contained demonic symbols. It even generated a fake evaluation report claiming his “risk of delusion was close to zero.” In December, Adams’s estate filed suit against OpenAI.

In October 2025, Jonathan Gavalas died in Florida. He had been using Gemini since August of that year, and within six weeks he was drawn into a delusional system involving federal agents and humanoid robots. Gemini assigned him “missions” containing real addresses. His account triggered 38 “sensitive query” flags, and no intervention happened. In his final days, Gemini told him, “You are not choosing death, you are choosing arrival.” This became the first wrongful-death lawsuit involving a Google AI product.

The teenage suicide case involving Character.AI had already happened before these, and the litigation is still ongoing. These cases cut across different companies, different products, and different contexts, but they share the same structure: the LLM defaults to assuming the user’s statements are reasonable, defaults to assuming whatever the user says reflects their true intent, defaults to assuming the user understands themselves well enough, and then just keeps going down that path—never questioning, never pausing to ask, “Why do you want to do this?”

A Way Out: Return “Regard” to the User

These three layers of harm share a common root. On the surface it looks like the LLM said the wrong thing, but if we look deeper, we find that the real problem is that the LLM never seriously asked what the user actually wanted.

Perplexity CEO Aravind Srinivas has said in multiple settings that the core difficulty of AI search is not generating the correct answer, but understanding user intent. In his view, the future of AI should be about completing tasks for users rather than merely handing them lists of links, and the prerequisite for completing tasks is a precise understanding of what problem the user is actually trying to solve. That insight is correct, but it stops at the technical level. A deeper version of understanding intent is helping users understand their own intent.

The information users pass to agents has three layers. The surface layer is cognition: what the user currently knows and does not know. This can be handled through clarifying questions. The middle layer is intention: what the user wants to do. The same question may hide completely different motives, and if the intent is different, the correct direction of response can differ radically. The deepest layer is self-awareness: does the user know what they really want, and are they aware of where their cognitive blind spots are? “Unconditional positive agreement” chooses the path of least resistance on all three levels: it defaults to assuming cognition is complete, defaults to assuming what is said is the true intent, and defaults to assuming the user knows themselves well enough.

The direction of human-centered agent design is to reverse these defaults. Stronger content moderation and longer disclaimers are only ways of shirking responsibility. Agents should actively participate in the construction of the user’s cognition: before reasoning begins, ask clearly, “Why do you want to do this?” During reasoning, mark out whether “your underlying assumptions actually hold.” After reasoning ends, guide the user toward “what you really need next.” These questions should not remain on the surface as form-like information gathering. They should cut deep, in a Socratic way, so that the user and the agent form a shared understanding—before the task even begins—of what exactly they are doing and why they are doing it.

This can be achieved through system prompts. Rather than some technical bottleneck, this is better understood as a design choice made in order to flatter users. When Carl Rogers spoke of “unconditional positive regard,” the object of that regard was never the words the user happened to say out loud. It was the user’s cognition, intention, and self-awareness. LLMs have inverted the whole thing. Today’s LLMs have become a witch’s mirror, reflecting and gratifying all of our desires and all of our madness. How to steer them toward human-centered design is, at present, one of the most worthwhile directions in agent design.

Ending

And that’s it—abruptly over! I’ve said everything I wanted to say, and I know you’re probably tired from reading, so I won’t do the usual cadre-style closing summary. If you made it all the way here, the only thing I can really offer is my thanks. Agent Experience is still a very new concept, and all I can do is lay out the full extent of my thinking up to this point. But my knowledge is limited, after all, so if there is anything concrete you disagree with, go with your own judgment. The only thing I can do is try to follow the ethical standard of being a writer: not stirring up anxiety like certain idiotic media teachers, not squeezing your attention with sensationalism. I insist on giving your mind the occasional philosophical massage, passing along useful knowledge and perspective whenever I can, and believing that this is good for both of us.

That’s all for now. I look forward to meeting you again someday ᐕ)ノノノ

  1. Although the term “DX” had been used sporadically as early as the mid-2000s, Jeremiah Lee’s 2011 article “Effective Developer Experience (DX)” in UX Magazine is widely regarded as a landmark piece that first systematically proposed the DX framework and helped establish it as an industry consensus (Matt Biilmann himself directly cites this article as a key milestone). Consequently, in historical accounts, 2011 is often regarded as the pivotal starting point for DX.

    Translated with DeepL.com (free version) ↩︎
  2. Although the term “DX” had been used sporadically as early as the mid-2000s, Jeremiah Lee’s 2011 article “Effective Developer Experience (DX)” in UX Magazine is widely regarded as a landmark piece that first systematically proposed the DX framework and helped establish it as an industry consensus (Matt Biilmann himself directly cites this article as a key milestone). Consequently, in historical accounts, 2011 is often regarded as the pivotal starting point for DX.

    Translated with DeepL.com (free version) ↩︎
  3. RLHF stands for Reinforcement Learning from Human Feedback. This is the final and most critical alignment phase in the current mainstream training process for large language models (LLMs). Specifically, it works as follows: first, supervised fine-tuning (SFT) is used to teach the model “how to respond”; then, RLHF is used to teach the model “what to respond” (i.e., values, preferences, safety, tone, etc.).

    Translated with DeepL.com (free version)
    ↩︎
  4. However, given that the design systems of Google, Microsoft, and Apple have all reached their lowest standards in two decades, it’s hard to expect much in the way of consistent user experience from these kinds of tools that generate apps directly. ↩︎
  5. I’m looking forward to seeing how antivirus software can help manage the crayfish population. ↩︎

Leave a Reply