Really awesome and thoughtful thing you've built - bravo!
I'm so aligned on your take on context engineering / context management. I found the default linear flow of conversation turns really frustrating and limiting. In fact, I still do. Sometimes you know upfront that the next thing you're to do will flood/poison the nicely crafted context you've built up... other times you realise after the fact. In both cases, you didn't have that many alternatives but to press on... Trees are the answer for sure.
I actually spent most of Dec building something with the same philosphy for my own use (aka me as the agent) when doing research and ideation with LLMs. Frustrated by most of the same limitations - want to build context to a good place then preserve/reuse it over and over, fire off side quests etc, bring back only the good stuff. Be able to traverse the tree forwards and back to understand how I got to a place...
Anyway, you've definitely built the more valuable incarnation of this - great work. I'm glad I peeled back the surface of the moltbot hysteria to learn about Pi.
The OpenClaw/pi-agent situation seems similar to ollama/llama-cpp, where the former gets all the hype, while the latter is actually the more impressive part.
This is great work, I am looking forward how it evolves in the future. So far Claude Code seems best despite its bugs given the generous subscription, but when the market corrects and the prices will get closer to API prices, then probably the pay-per-token premium with optimized experience will be a better deal than to suffer Claude Code glitches and paper cuts.
The realization is that at the end agent framework kit that is customizable and can be recursively improved by agents is going to be better than a rigid proprietary client app.
> but when the market corrects and the prices will get closer to API prices
I think it’s more likely that the API prices will decrease over time and the CC allowances will only become more generous. We’ve been hearing predictions about LLM price increases for years but I think the unit economics of inference (excluding training) are much better than a lot of people think and there is no shortage of funding for R&D.
I also wouldn’t bet on Claude Code staying the same as it is right now with little glitches. All of the tools are going to improve over time. In my experience the competing tools aren’t bug free either but they get a pass due to underdog status. All of the tools are improving and will continue to do so.
FWIW, you can use subscriptions with pi. OpenAI has blessed pi allowing users to use their GPT subscriptions. Same holds for other providers, except Flicker Company.
And I'm personally very happy that Peter's project gets all the hype. The pi repo already gets enough vibesloped PRs from openclaw users as is, and its still only 1/100th of what the openclaw repository has to suffer through.
Good to know, that makes it even better. I still find Opus 4.5 to be the best model currently. But if next generation of GPT/Gemini close the gap that will cross the inflection point for me and make 3rd party harnesses viable. Or if they jump ahead, that should put more pressure on the Flicker Company to fix the flicker or relax the subscriptions.
This is basically identical to the ChatGPT/GPT-3 situation ;) You know OpenAI themselves keep saying "we still don't understand why ChatGPT is so popular... GPT was already available via API for years!"
From the article: "So what's an old guy yelling at Claudes going to do? He's going to write his own coding agent harness and give it a name that's entirely un-Google-able, so there will never be any users. Which means there will also never be any issues on the GitHub issue tracker. How hard can it be?"
> Special shout out to Google who to this date seem to not support tool call streaming which is extremely Google.
Google doesn't even provide a tokenizer to count tokens locally. The results of this stupidity can be seen directly in AI studio which makes an API call to count_tokens every time you type in the prompt box.
> If you look at the security measures in other coding agents, they're mostly security theater. As soon as your agent can write code and run code, it's pretty much game over.
At least for Codex, the agent runs commands inside an OS-provided sandbox (Seatbelt on macOS, and other stuff on other platforms). It does not end up "making the agent mostly useless".
That kind of blanket demand doesn't persuade anyone and doesn't solve any problem.
Even if you get people to sit and press a button every time the agent wants to do anything, you're not getting the actual alertness and rigor that would prevent disasters. You're getting a bored, inattentive person who could be doing something more valuable than micromanaging Claude.
Managing capabilities for agents is an interesting problem. Working on that seems more fun and valuable than sitting around pressing "OK" whenever the clanker wants to take actions that are harmless in a vast majority of cases.
You’ll just end up approving things blindly, because 95% of what you’ll read will seem obviously right and only 5% will look wrong. I would prefer to let the agent do whatever they want for 15 minutes and then look at the result rather than having to approve every single command it does.
It's not reliable. The AI can just not prompt you to approve, or hide things, etc. AI models are crafty little fuckers and they like to lie to you and find secret ways to do things with alterior motives. This isn't even a prompt injection thing, it's an emergent property of the model. So you must use an environment where everything can blow up and it's fine.
I'm trying to understand this workflow. I have just started using codex. Literally 2 days in. I have it hooked up to my githbub repo and it just runs in the cloud and creates a pr. I have it touching only UI and middle layer code. No db changes, I always tell it to not touch the models.
Is it asking you permission to run that python command? If so, then that's expected: commands that you approve get to run without the sandbox.
The point is that Codex can (by default) run commands on its own, without approval (e.g., running `make` on the project it's working on), but they're subject to the imposed OS sandbox.
This is controlled by the `--sandbox` and `--ask-for-approval` arguments to `codex`.
I'm just guessing, but seems the people who write these agent CLIs haven't found a good heuristic for allowing/disallowing/asking the user about permissions for commands, so instead of trying to sit down and actually figure it out, someone had the bright idea to let the LLM also manage that allowing/disallowing themselves. How that ever made sense, will probably forever be lost on me.
`chroot` is literally the first thing I used when I first installed a local agent, by intuition (later moved on to a container-wrapper), and now I'm reading about people who are giving these agents direct access to reply to their emails and more.
> I'm just guessing, but seems the people who write these agent CLIs haven't found a good heuristic for allowing/disallowing/asking the user about permissions for commands, so instead of trying to sit down and actually figure it out, someone had the bright idea to let the LLM also manage that allowing/disallowing themselves. How that ever made sense, will probably forever be lost on me.
I don't think there is such a good heuristic. The user wants the agent to do the right thing and not to do the wrong thing, but the capabilities needed are identical.
> `chroot` is literally the first thing I used when I first installed a local agent, by intuition (later moved on to a container-wrapper), and now I'm reading about people who are giving these agents direct access to reply to their emails and more.
That's a good, safe, and sane default for project-focused agent use, but it seems like those playing it risky are using agents for general-purpose assistance and automation. The access required to do so chafes against strict sandboxing.
There still needs to be a harness running on your local machine to spawn the processes in their sandboxes. I consider that "part of the LLM" even if it isn't doing any inference.
If that part were running sandboxed, then it would be impossible for it to contact the OpenAI servers (to get the LLM's responses), or to spawn an unsandboxed process (for situations where the LLM requests it from the user).
That's obviously not true. You can do anything you want with a sandbox. Open a socket to the OpenAI servers and then pass that off to the sandbox and let the sandboxed process communicate over that socket. Now it can talk to OpenAI's servers but it can't open connections to any other servers or do anything else.
The startup process which sets up the original socket would have to be privileged, of course, but only for the purpose of setting up the initial connection. The running LLM harness process would not have any ability to break out of the sandbox after that.
As for spawning unsandboxed processes, that would require a much more sophisticated system whereby the harness uses an API to request permission from the user to spawn the process. We already have APIs like this for requesting extra permissions from users on Android and iOS, so it's not in-principle impossible either.
In practice I think such requests would be a security nightmare and best avoided, since essentially it would be like a prisoner asking the guard to let him out of jail and the guard just handing the prisoner the keys. That unsandboxed process could do literally anything it has permissions to do as a non-sandboxed user.
The devil is in the details. How much of the code running on my machine is confined to the sandbox vs how much is used in the boostrap phase? I haven't looked but I would hope it can survive some security audits.
If I'm following this it means you need to audit all code that the llm writes though as anything you run from another terminal window will be run as you with full permissions.
The thing is that on macOS at least, Codex does have the ability use an actual sandbox that I believe prevents certain write operations and network access.
What's the difference between resetting a container or resetting a VPS?
On local machine I have it under its own user, so I can access its files but it cannot access mine. But I'm not a security expert, so I'd love to hear if that's actually solid.
On my $3 VPS, it has root, because that's the whole point (it's my sysadmin). If it blows it up, I wanna say "I'm down $3", but it doesn't even seem to be that since I can just restore it from an backup.
Bit more general; don't run agents without some sort of restriction to what they can do provided by the OS in some way. Containers is one way, VMs another, most cases it's enough with just a chroot and using the unix permission system the rest of your system already uses.
> from copying and pasting code into ChatGPT, to Copilot auto-completions [...], to Cursor, and finally the new breed of coding agent harnesses like Claude Code, Codex, Amp, Droid, and opencode
Reading HN I feel a bit out of touch since I seem to be "stuck" on Cursor. Tried to make the jump further to Claude Code like everyone tells me to, but it just doesn't feel right...
It may be due to the size of my codebase -- I'm 6 months into solo developer bootstrap startup, so there isn't all that much there, and I can iterate very quickly with Cursor. And it's mostly SPA browser click-tested stuff. Comparatively it feels like Claude Code spends an eternity to do something.
(That said Cursor's UI does drive me crazy sometimes. In particular the extra layer of diff-review of AI changes (red/green) which is not integrated into git -- I would have preferred that to instead actively use something integrated in git (Staged vs Unstaged hunks). More important to have a good code review experience than to remember which changes I made vs which changes AI made..)
For me cursor provides a much tighter feedback loop than Claude code. I can review revert iterate change models to get what I need. It feels sometimes Claude code is presented more as a yolo option where you put more trust on the agent about what it will produce.
I think the ability to change models is critical. Some models are better at designing frontend than others. Some are better at different programming languages, writing copy, blogs, etc.
I feel sabotaged if I can’t switch the models easily to try the same prompt and context across all the frontier options
Same. For actual productions app I'm typically reviewing the thinking messages and code changes as they happen to ensure it stays on the rails. I heavily use the "revert" to previous state so I can update the prompt with more accurate info that might have come out of the agents trial and error. I find that if I don't do this, the agent makes a mess that often doesn't get cleaned up on its way to the actually solution. Maybe a similar workflow is possible with Claude Code...
You can ask Claude to work with you step by step and use /rewind. It only shows the diff though, which, hides some of the problem. Since diffs can seem fine in isolation, but when viewed in context can have obvious issues.
Ya I guess if you have the IDE open and monitor unstaged git, it's a similar workflow. The other cursor feature I use heavily is the ability to add specific lines and ranges of a file to the context. Feels like in the CLI this would just be pasted text and Claude would have to work a lot harder to resolve the source file and range
Probably an ideal compromise solution for you would be to install the official Claude Code extension for VS Code, so you have an IDE for navigating large, complex codebases while still having CC integration.
Claude Code spends most of its time poking around the files. It doesn't have any knowledge of the project by default (no file index etc), unless they changed it recently.
When I was using it a lot, I created a startup hook that just dumped a file listing into the context, or the actual full code on very small repos.
I also got some gains from using a custom edit tool I made which can edit multiple chunks in multiple files simultaneously. It was about 3x faster. I had some edge cases where it broke though, so I ended up disabling it.
Bootstrapped solo dev here. I enjoyed using Claude to get little things done which I happed on my TODO list below the important stuff, like updating a landing page, or in your case perhaps adding automated testing for the frontend stuff (so you don't have to click yourself). It's just nice having someone coming up with a proposal on how to implement something, even it's not the perfect way, it's good as a starter.
Also I have one Claude instance running to implement the main feature, in a tight feedback loop so that I know exactly what it's doing.
Yes, sometimes it takes a bit longer, but I use the time checking what the other Claudes are doing...
Seems like there's a speed/autonomy spectrum where Cursor is the fastest, Codex is the best for long-running jobs, and Claude is somewhere in the middle.
Personally, I found Cursor to be too inaccurate to be useful (possibly because I use Julia, which is relatively obscure) – Opus has been roughly the right level for my "pair programming" workflow.
I've seen a couple of power users already switching to Pi [1], and I'm considering that too. The premise is very appealing:
- Minimal, configurable context - including system prompts [2]
- Minimal and extensible tools; for example, todo tasks extension [3]
- No built-in MCP support; extensions exist [4]. I'd rather use mcporter [5]
Full control over context is a high-leverage capability. If you're aware of the many limitations of context on performance (in-context retrieval limits [6], context rot [7], contextual drift [8], etc.), you'd truly appreciate Pi lets you fine-tune the WHOLE context for optimal performance.
It's clearly not for everyone, but I can see how powerful it can be.
Pi is the part of moltXYZ that should have gone viral. Armin is way ahead of the curve here.
The Claude sub is the only think keeping me on Claude Code. It's not as janky as it used to be, but the hooks and context management support are still fairly superficial.
The best deep-dive into coding agents (and best architecture) I've seen so far. And I love the minimalism with this design, but there's so much complexity necessary already, it's kind of crazy. Really glad I didn't try to write my own :)
Re: security, I think I need to make an AI credential broker/system. The only way to securely use agents is to never give them access to a credential at all. So the only way to have the agent run a command which requires credentials, is to send the command to a segregated process which asks the user for permission, then runs it, then returns status to the agent. It would process read-only requests automatically but write requests would send a request to the user to authorize. I haven't yet found somebody else writing this, so I might as well give it a shot
Other than credentialed calls, I have Docker-in-Docker in a VM, so all other actions will be YOLO'd. I think this is the only reasonable system for long-running loops.
> Re: security, I think I need to make an AI credential broker/system. The only way to securely use agents is to never give them access to a credential at all. So the only way to have the agent run a command which requires credentials, is to send the command to a segregated process which asks the user for permission, then runs it, then returns status to the agent
This is a problem that model context protocol solves
Your MCP server has the creds, your agent does not.
Pi has probably the best architecture and being written in Javascript it is well positioned to use the browser sandbox architecture that I think is the future for ai agents.
I dont know how to feel about being the only one refusing to run yolo mode until the tooling is there, which is still about 6 months away for my setup. Am I years behind everyone else by then? You can get pretty far without completely giving in. Agents really dont need to execute that many arbitrary commands. linting, search, edit, web access should all be bespoke tools integrated into the permission and sandbox system. agents should not even be allowed to start and stop applications that support dev mode, they edit files, can test and get the logs what else would they need to do? especially as the amount of external dependencies that make sense goes to a handful you can without headache approve every new one. If your runtime supports sandboxing and permissions like deno or workerd this adds an initial layer of defense.
This makes it even more baffling why anthropic went with bun, a runtime without any sandboxing or security
architecture and will rely in apple seatbelt alone?
But even then, the agent can still exfiltrate anything from the sandbox, using curl. Sandboxing is not enough when you deal with agents that can run arbitrary commands.
If you're worried about a hostile agent, then indeed sandboxing is not enough. In the worst case, an actively malicious agent could even try to escape the sandbox with whatever limited subset of commands it's given.
If you're worried about prompt injection, then restricting access to unfiltered content is enough. That would definitely involve not processing third-party input and removing internet search tools, but the restriction probably doesn't have to be mechanically complete if the agent has also been instructed to use local resources only. Even package installation (uv, npm, etc) would be fine up to the existing risk of supply-chain attacks.
If you're worried about stochastic incompetence (e.g. the agent nukes the production database to fix a misspelled table name), then a sandbox to limit the 'blast radius' of any damage is plenty.
That argument seems to assume a security model where the default prior is « no hostile agent ». But that’s the problem, any agent can be made hostile with a successful prompt injection attack. Basically, assuming there’s no hostile agent is the same as assuming there’s no attacker. I think we can agree a security model that assumes no attacker is insufficient.
Code is not the only thing the agent could exfiltrate, what about API keys for instance? I agree sandboxing for security in depth is good, but it’s not sufficient and can lull you into a false sense of security.
This is what emulators and separate accounts are for. Ideally you can use an emulator and never let the container know about an API key. At worst you can use a dedicated account/key for dev that is isolated from your prod account.
VM + dedicated key with quotas should get you 95% there if you want to experiment around. Waiting is also an option, so much of the workflow changes with months passing so you’re not missing much.
That depends on how you configure or implement your sandbox. If you let it have internet access as part of the sandbox, then yes, but that is your own choice.
Internet access is required to install third party packages, so given the choice almost no one would disable it for a coding agent sandbox.
In practice, it seems to me that the sandbox is only good enough to limit file system access to a certain project, everything else (code or secret exfiltration, installing vulnerable packages, adding prompt injection attacks for others to run) is game if you’re in YOLO mode like pi here.
Right idea but the reason people don't do this in practice is friction. Setting up a throwaway VM for every agent session is annoying enough that everyone just runs YOLO on their host.
I built shellbox (https://shellbox.dev) to make this trivial -- Firecracker microVMs managed entirely over SSH. Create a box, point your agent at it, let it run wild. You can duplicate a box before a risky operation (instant, copy-on-write) and delete it after.
Billing stops when the SSH session disconnects.
No SDK, no container config, just ssh. Any agent that can run shell commands works out of the box.
apart from nearly no one using vms as far as i can tell, even if they were, a vm does not magically solve all the issues, its just a part of the needed tools.
Main reason I haven’t switched over to the new pi coding agent (or even fully to Claude Code alternatives) is the price point. I eat tokens for breakfast, lunch, and dinner.
I’m on a $100/mo plan, but the codex bar makes it look like I’m burning closer to $500 every 30 days. I tried going local with Qwen 3 (coding) on a Blackwell Pro 6000, and it still feels a beat behind, either laggy, or just not quite good enough for me to fully relinquish Claude Code.
Curious what other folks are seeing: any success stories with other agents on local models, or are you mostly sticking with proprietary models?
I’m feeling a bit vendor-locked into Claude Code: it’s pricey, but it’s also annoyingly good
According to the article, Pi massively shrinks your context use (due to smaller system prompt and lack of MCPs) so your token use may drop. Also Pi seems to support Anthropic OAuth for your plan (but afaik they might ban you)
I’m just curious why your writing is punctuated by lots of word breaks. I hardly see hyphenated word breaks across lines anymore and it made me pause on all those occurrences. I do remember having to do this with literal typewriters.
I did something similar in Python, in case people want to see a slightly different perspective (I was aiming for a minimal agent library with built-in tools, similar to the Claude Agent SDK):
Being minimalist is real power these days as everything around us keeps shoving features in our face every week with a million tricks and gimmicks to learn. Something minimalist like this is honestly a breath of fresh air!
The YOLO mode is also good, but having a small ‘baby setting mode’ that’s not full-blown system access would make sense for basic security. Just a sensible layer of "pls don't blow my machine" without killing the freedom :)
Pi supports restricting the set of tools given to an agent. For example, one of the examples in pi --help is:
# Read-only mode (no file modifications possible)
pi --tools read,grep,find,ls -p "Review the code in src/"
Otherwise, "yolo mode" inside a sandbox is perfectly reasonable. A basic bubblewrap configuration can expose read-only system tools and have a read/write project directory while hiding sensitive information like API keys and other home-directory files.
I particularly liked Mario's point about using tmux for long-running commands. I've found models to be very good at reading from / writing to tmux, so I'll do things like spin up a session with a REPL, use Claude to prototype something, then inspect it more deeply in the REPL.
> The second approach is to just write to the terminal like any CLI program, appending content to the scrollback buffer
This is how I prototyped all of mine. Console.Write[Line].
I am currently polishing up one of the prototypes with WinForms (.NET10) & WebView2. Building something that looks like a WhatsApp conversation in basic winforms is a lot of work. This takes about 60 seconds in a web view.
I am not too concerned about cross platform because a vast majority of my users will be on windows when they'd want to use this tool.
If you use WPF you can have the Mica backdrop underneath your WebView2 content and set the WebView2 to have transparent background color, which looks nice and a little more native, fyi. Though if you're doing more than just showing the WebView maybe isn't a choice to switch.
I like the idea of using a transparent background in the webview. That would compose really well.
The primary motivation for winforms was getting easy access to OS-native multiline input controls, clipboard, audio, image handling, etc. I could have just put kestrel in the console app and served it as a pure web app, but this is a bit more clunky from a UX perspective (separate browser window, permissions, etc.).
I was confused by him basically inventing his own skills but I guess this is from Nov 2025 so makes sense as skills were pretty new at that point.
Also please note this is nowhere on the terminal bench leaderboard anymore. I'd advise everyone reading the comments here to be aware of that. This isn't a CLI to use. Just a good experiment and write up.
I built on ADK (Agent Development Kit), which comes with many of the features discussed in the post.
Building a full, custom agent setup is surprisingly easy and a great learning experience for this transformational technology. Getting into instruction and tool crafting was where I found the most ROI.
The solution to the security issue is using `useradd`.
I would add subagents though. They allow for the pattern where the top agent directs / observe a subagent executing a step in a plan.
The top agent is both better at directing a subagent, and it keeps the context clean of details that don't matter - otherwise they'd be in the same step in the plan.
There are lots of ways of doing subagents. It mostly depends on your workflow. That's why pi doesn't ship with anything built in. It's pretty simple to write an extension to do that.
I really like pi and have started using it to build my agent.
Mario's article fully reveals some design trade-offs and complexities in the construction process of coding agents and even general agents. I have benefited a lot!
I do think Claude Code as a tool gave Anthropic some advantages over others. They have plan mode, todolist, askUserQuestion tools, hooks, etc., which greatly extend Opus's capabilities. Agree that others (Codex, Cursor) also quickly copy these features, but this is the nature of the race, and Anthropic has to keep innovating to maintain its edge over others
(I work at Cursor) We have all these! Plan mode with a GUI + ability to edit plans inline. Todos. A tool for asking the user questions, which will be automatically called or you can manually ask for it. Hooks. And you can use Opus or any other models with these.
The biggest advantage by far is the data they collect along the way. Data that can be bucketed to real devs and signals extracted from this can be top tier. All that data + signals + whatever else they cook can be re-added in the training corpus and the models re-trained / version++ on the new set. Rinse and repeat.
(this is also why all the labs, including some chinese ones, are subsidising / metoo-ing coding agents)
One thing I do find is that subagents are helpful for performance -- offloading tasks to smaller models (gpt-oss specifically for me) gets data to the bigger model quicker.
One aspect that resonates from stories like this is the tension between opinionated design and real-world utility.
When building something minimal, especially in areas like agent-based tooling or assistants, the challenge isn’t only about reducing surface area — it’s about focusing that reduction around what actually solves a user’s problem.
A minimal agent that only handles edge cases, or only works in highly constrained environments, can feel elegant on paper but awkward in practice. Conversely, a slightly less minimal system that still maintains clarity and intent often ends up being more useful without being bloated.
In my own experience launching tools that involve analysis and interpretation, the sweet spot always ends up being somewhere in the intersection of:
- clearly scoped core value,
- deliberately limited surface, and
- enough flexibility to handle real user variation.
Curious how others think about balancing minimalism and practical coverage when designing agents or abstractions in their own projects.
Not only did you build a minimal agent, but the framework around it so anyone can build their own. I'm using Pi in the terminal, but I see you have web components. Any tips or creating a "Chat mode" where the messages are like chat bubbles? It would be easier to use on mobile.
That's what they said, but as far as I can see it makes no sense at all. It's a console app. It's outputing to stdout, not a GPU buffer.
The whole point of react is to update the real browser DOM (or rather their custom ASCII backend, presumably, in this case) only when the content actually changes. When that happens, surely you'd spurt out some ASCII escape sequences to update the display. You're not constrained to do that in 16ms and you don't have a vsync signal you could synchronise to even if you wanted to. Synchronising to the display is something the tty implementation does. (On a different machine if you're using it over ssh!)
Given their own explanation of react -> ascii -> terminal, I can't see how they could possibly have ended up attempting to render every 16ms and flickering if they don't get it done in time.
I'm genuinely curious if anybody can make this make sense, because based on what I know of react and of graphics programming (which isn't nothing) my immediate reaction to that post was "that's... not how any of this works".
I'm writing my own agent too as a side project at work. This is a good article but simultaneously kinda disappointing. The entire agent space has disappeared down the same hole, with exactly the same core design used everywhere and everyone making the same mistakes. The focus on TUIs I find especially odd. We're at the dawn of the AI age and people are trying to optimize the framerate of Teletext? If you care about framerates use a proper GUI framework!
The agent I'm writing shares some ideas with Pi but otherwise departs quite drastically from the core design used by Claude Code, Codex, Pi etc, and it seems to have yielded some nice benefits:
• No early stopping ("shall I continue?", "5 tests failed -> all tests passed, I'm done" etc).
• No permission prompts but also no YOLO mode or broken Seatbelt sandboxes. Everything is executed in a customized container designed specifically for the model and adapted to its needs. The agent does a lot of container management to make this work well.
• Agent can manage its own context window, and does. I never needed to add compaction because I never yet saw it run out of context.
• Seems to be fast compared to other agents, at least in any environment where there's heavy load on the inferencing servers.
• Eliminates "slop-isms" like excessive error swallowing, narrative commenting, dropping fully qualified class names into the middle of source files etc.
• No fancy TUI. I don't want to spend any time fixing flickering bugs when I could be improving its skill at the core tasks I actually need it for.
It's got downsides too, it's very overfit to the exact things I've needed and the corporate environment it runs in. It's not a full replacement for CC or Codex. But I use it all the time and it writes nearly all my code now.
The agent is owned by the company and they're starting to ask about whether it could be productized so I suppose I can't really go into the techniques used to achieve this, sorry. Suffice it to say that the agent design space is far wider and deeper than you'd initially intuit from reading articles like this. None of the ideas in my agent are hard to come up with so explore!
As a user of a minimal, opinionated agent (https://exe.dev) I've observed at least 80% of this article's findings myself.
Small and observable is excellent.
Letting your agent read traces of other sessions is an interesting method of context trimming.
Especially, "always Yolo" and "no background tasks". The LLM can manage Unix processes just fine with bash (e.g. ps, lsof, kill), and if you want you can remind it to use systemd, and it will. (It even does it without rolling it's eyes, which I normally do when forced to deal with systemd.)
Something he didn't mention is git: talk to your agent a commit at a time. Recently I had a colleague check in his minimal, broken PoC on a new branch with the commit message "work in progress". We pointed the agent at the branch and said, "finish the feature we started" and it nailed it in one shot. No context whatsoever other than "draw the rest of the f'ing owl" and it just.... did it. Fascinating.
Really awesome and thoughtful thing you've built - bravo!
I'm so aligned on your take on context engineering / context management. I found the default linear flow of conversation turns really frustrating and limiting. In fact, I still do. Sometimes you know upfront that the next thing you're to do will flood/poison the nicely crafted context you've built up... other times you realise after the fact. In both cases, you didn't have that many alternatives but to press on... Trees are the answer for sure.
I actually spent most of Dec building something with the same philosphy for my own use (aka me as the agent) when doing research and ideation with LLMs. Frustrated by most of the same limitations - want to build context to a good place then preserve/reuse it over and over, fire off side quests etc, bring back only the good stuff. Be able to traverse the tree forwards and back to understand how I got to a place...
Anyway, you've definitely built the more valuable incarnation of this - great work. I'm glad I peeled back the surface of the moltbot hysteria to learn about Pi.
The OpenClaw/pi-agent situation seems similar to ollama/llama-cpp, where the former gets all the hype, while the latter is actually the more impressive part.
This is great work, I am looking forward how it evolves in the future. So far Claude Code seems best despite its bugs given the generous subscription, but when the market corrects and the prices will get closer to API prices, then probably the pay-per-token premium with optimized experience will be a better deal than to suffer Claude Code glitches and paper cuts.
The realization is that at the end agent framework kit that is customizable and can be recursively improved by agents is going to be better than a rigid proprietary client app.
> but when the market corrects and the prices will get closer to API prices
I think it’s more likely that the API prices will decrease over time and the CC allowances will only become more generous. We’ve been hearing predictions about LLM price increases for years but I think the unit economics of inference (excluding training) are much better than a lot of people think and there is no shortage of funding for R&D.
I also wouldn’t bet on Claude Code staying the same as it is right now with little glitches. All of the tools are going to improve over time. In my experience the competing tools aren’t bug free either but they get a pass due to underdog status. All of the tools are improving and will continue to do so.
FWIW, you can use subscriptions with pi. OpenAI has blessed pi allowing users to use their GPT subscriptions. Same holds for other providers, except Flicker Company.
And I'm personally very happy that Peter's project gets all the hype. The pi repo already gets enough vibesloped PRs from openclaw users as is, and its still only 1/100th of what the openclaw repository has to suffer through.
Good to know, that makes it even better. I still find Opus 4.5 to be the best model currently. But if next generation of GPT/Gemini close the gap that will cross the inflection point for me and make 3rd party harnesses viable. Or if they jump ahead, that should put more pressure on the Flicker Company to fix the flicker or relax the subscriptions.
Is this something that OpenAI explicitly approves per project? I have had a hard time understanding what their exact position is.
Most likely.
See here OpenCode.
https://x.com/thdxr/status/2009742070471082006?s=20
This is basically identical to the ChatGPT/GPT-3 situation ;) You know OpenAI themselves keep saying "we still don't understand why ChatGPT is so popular... GPT was already available via API for years!"
And like ollama it will no doubt start to get enshittified.
This is the first I'm hearing of this pi-agent thing and HOW DO PEOPLE TECH DECIDE TO NAME THINGS?
Seriously. Is creator not aware that "pi" absolutely invokes the name of another very important thing? sigh.
The creator is very aware. Its original name was "shitty coding agent".
https://shittycodingagent.ai/
From the article: "So what's an old guy yelling at Claudes going to do? He's going to write his own coding agent harness and give it a name that's entirely un-Google-able, so there will never be any users. Which means there will also never be any issues on the GitHub issue tracker. How hard can it be?"
> Special shout out to Google who to this date seem to not support tool call streaming which is extremely Google.
Google doesn't even provide a tokenizer to count tokens locally. The results of this stupidity can be seen directly in AI studio which makes an API call to count_tokens every time you type in the prompt box.
AI studio also has a bug that continuously counts the tokens, typing or not, with 100% CPU usage.
Sometimes I wonder who is drawing more power, my laptop or the TPU cluster on the other side.
tbf neither does anthropic
> If you look at the security measures in other coding agents, they're mostly security theater. As soon as your agent can write code and run code, it's pretty much game over.
At least for Codex, the agent runs commands inside an OS-provided sandbox (Seatbelt on macOS, and other stuff on other platforms). It does not end up "making the agent mostly useless".
Approval should be mandatory for any non-read tool call. You should read everything your LLM intends to do, and approve it manually.
"But that is annoying and will slow me down!" Yes, and so will recovering from disastrous tool calls.
That kind of blanket demand doesn't persuade anyone and doesn't solve any problem.
Even if you get people to sit and press a button every time the agent wants to do anything, you're not getting the actual alertness and rigor that would prevent disasters. You're getting a bored, inattentive person who could be doing something more valuable than micromanaging Claude.
Managing capabilities for agents is an interesting problem. Working on that seems more fun and valuable than sitting around pressing "OK" whenever the clanker wants to take actions that are harmless in a vast majority of cases.
You’ll just end up approving things blindly, because 95% of what you’ll read will seem obviously right and only 5% will look wrong. I would prefer to let the agent do whatever they want for 15 minutes and then look at the result rather than having to approve every single command it does.
Works until it has access to write to external systems and your agent is slopping up Linear or GitHub without you knowing, identified as you.
It's not reliable. The AI can just not prompt you to approve, or hide things, etc. AI models are crafty little fuckers and they like to lie to you and find secret ways to do things with alterior motives. This isn't even a prompt injection thing, it's an emergent property of the model. So you must use an environment where everything can blow up and it's fine.
I'm trying to understand this workflow. I have just started using codex. Literally 2 days in. I have it hooked up to my githbub repo and it just runs in the cloud and creates a pr. I have it touching only UI and middle layer code. No db changes, I always tell it to not touch the models.
My codex just uses python to write files around the sandbox when I ask it to patch a sdk outside its path.
Is it asking you permission to run that python command? If so, then that's expected: commands that you approve get to run without the sandbox.
The point is that Codex can (by default) run commands on its own, without approval (e.g., running `make` on the project it's working on), but they're subject to the imposed OS sandbox.
This is controlled by the `--sandbox` and `--ask-for-approval` arguments to `codex`.
It's definitely not a sandbox if you can just "use python to write files" outside of it o_O
Hence the article’s security theatre remark.
I’m not sure why everyone seems to have forgotten about Unix permissions, proper sandboxing, jails, VMs etc when building agents.
Even just running the agent as a different user with minimal permissions and jailed into its home directory would be simple and easy enough.
I'm just guessing, but seems the people who write these agent CLIs haven't found a good heuristic for allowing/disallowing/asking the user about permissions for commands, so instead of trying to sit down and actually figure it out, someone had the bright idea to let the LLM also manage that allowing/disallowing themselves. How that ever made sense, will probably forever be lost on me.
`chroot` is literally the first thing I used when I first installed a local agent, by intuition (later moved on to a container-wrapper), and now I'm reading about people who are giving these agents direct access to reply to their emails and more.
> I'm just guessing, but seems the people who write these agent CLIs haven't found a good heuristic for allowing/disallowing/asking the user about permissions for commands, so instead of trying to sit down and actually figure it out, someone had the bright idea to let the LLM also manage that allowing/disallowing themselves. How that ever made sense, will probably forever be lost on me.
I don't think there is such a good heuristic. The user wants the agent to do the right thing and not to do the wrong thing, but the capabilities needed are identical.
> `chroot` is literally the first thing I used when I first installed a local agent, by intuition (later moved on to a container-wrapper), and now I'm reading about people who are giving these agents direct access to reply to their emails and more.
That's a good, safe, and sane default for project-focused agent use, but it seems like those playing it risky are using agents for general-purpose assistance and automation. The access required to do so chafes against strict sandboxing.
Here's OpenAI's docs page on how they sandbox Codex: https://developers.openai.com/codex/security/
Here's the macOS kernel-enforced sandbox profile that gets applied to processes spawned by the LLM: https://github.com/openai/codex/blob/main/codex-rs/core/src/...
I think skepticism is healthy here, but there's no need to just guess.
That still doesn't seem ideal. Run the LLM itself in a kernel-enforced sandbox, lest it find ways to exploit vulnerabilities in its own code.
The LLM inference itself doesn't "run code" per se (it's just doing tensor math), and besides, it runs on OpenAI's servers, not your machine.
There still needs to be a harness running on your local machine to spawn the processes in their sandboxes. I consider that "part of the LLM" even if it isn't doing any inference.
If that part were running sandboxed, then it would be impossible for it to contact the OpenAI servers (to get the LLM's responses), or to spawn an unsandboxed process (for situations where the LLM requests it from the user).
That's obviously not true. You can do anything you want with a sandbox. Open a socket to the OpenAI servers and then pass that off to the sandbox and let the sandboxed process communicate over that socket. Now it can talk to OpenAI's servers but it can't open connections to any other servers or do anything else.
The startup process which sets up the original socket would have to be privileged, of course, but only for the purpose of setting up the initial connection. The running LLM harness process would not have any ability to break out of the sandbox after that.
As for spawning unsandboxed processes, that would require a much more sophisticated system whereby the harness uses an API to request permission from the user to spawn the process. We already have APIs like this for requesting extra permissions from users on Android and iOS, so it's not in-principle impossible either.
In practice I think such requests would be a security nightmare and best avoided, since essentially it would be like a prisoner asking the guard to let him out of jail and the guard just handing the prisoner the keys. That unsandboxed process could do literally anything it has permissions to do as a non-sandboxed user.
You are essentially describing the system that Codex (and, I presume, Claude Code et al.) already implements.
The devil is in the details. How much of the code running on my machine is confined to the sandbox vs how much is used in the boostrap phase? I haven't looked but I would hope it can survive some security audits.
If I'm following this it means you need to audit all code that the llm writes though as anything you run from another terminal window will be run as you with full permissions.
The thing is that on macOS at least, Codex does have the ability use an actual sandbox that I believe prevents certain write operations and network access.
You really shouldn’t be running agents outside of a container. That’s 101.
What happens if I do?
What's the difference between resetting a container or resetting a VPS?
On local machine I have it under its own user, so I can access its files but it cannot access mine. But I'm not a security expert, so I'd love to hear if that's actually solid.
On my $3 VPS, it has root, because that's the whole point (it's my sysadmin). If it blows it up, I wanna say "I'm down $3", but it doesn't even seem to be that since I can just restore it from an backup.
Bit more general; don't run agents without some sort of restriction to what they can do provided by the OS in some way. Containers is one way, VMs another, most cases it's enough with just a chroot and using the unix permission system the rest of your system already uses.
Does Codex randomly decide to disable the sandbox like Claude Code does?
> from copying and pasting code into ChatGPT, to Copilot auto-completions [...], to Cursor, and finally the new breed of coding agent harnesses like Claude Code, Codex, Amp, Droid, and opencode
Reading HN I feel a bit out of touch since I seem to be "stuck" on Cursor. Tried to make the jump further to Claude Code like everyone tells me to, but it just doesn't feel right...
It may be due to the size of my codebase -- I'm 6 months into solo developer bootstrap startup, so there isn't all that much there, and I can iterate very quickly with Cursor. And it's mostly SPA browser click-tested stuff. Comparatively it feels like Claude Code spends an eternity to do something.
(That said Cursor's UI does drive me crazy sometimes. In particular the extra layer of diff-review of AI changes (red/green) which is not integrated into git -- I would have preferred that to instead actively use something integrated in git (Staged vs Unstaged hunks). More important to have a good code review experience than to remember which changes I made vs which changes AI made..)
For me cursor provides a much tighter feedback loop than Claude code. I can review revert iterate change models to get what I need. It feels sometimes Claude code is presented more as a yolo option where you put more trust on the agent about what it will produce.
I think the ability to change models is critical. Some models are better at designing frontend than others. Some are better at different programming languages, writing copy, blogs, etc.
I feel sabotaged if I can’t switch the models easily to try the same prompt and context across all the frontier options
Same. For actual productions app I'm typically reviewing the thinking messages and code changes as they happen to ensure it stays on the rails. I heavily use the "revert" to previous state so I can update the prompt with more accurate info that might have come out of the agents trial and error. I find that if I don't do this, the agent makes a mess that often doesn't get cleaned up on its way to the actually solution. Maybe a similar workflow is possible with Claude Code...
Yeah, autonomy has the cost of your mental model getting desynchronized. You either follow along interactively or spend time catching up later.
You can ask Claude to work with you step by step and use /rewind. It only shows the diff though, which, hides some of the problem. Since diffs can seem fine in isolation, but when viewed in context can have obvious issues.
Ya I guess if you have the IDE open and monitor unstaged git, it's a similar workflow. The other cursor feature I use heavily is the ability to add specific lines and ranges of a file to the context. Feels like in the CLI this would just be pasted text and Claude would have to work a lot harder to resolve the source file and range
Probably an ideal compromise solution for you would be to install the official Claude Code extension for VS Code, so you have an IDE for navigating large, complex codebases while still having CC integration.
Claude Code spends most of its time poking around the files. It doesn't have any knowledge of the project by default (no file index etc), unless they changed it recently.
When I was using it a lot, I created a startup hook that just dumped a file listing into the context, or the actual full code on very small repos.
I also got some gains from using a custom edit tool I made which can edit multiple chunks in multiple files simultaneously. It was about 3x faster. I had some edge cases where it broke though, so I ended up disabling it.
Bootstrapped solo dev here. I enjoyed using Claude to get little things done which I happed on my TODO list below the important stuff, like updating a landing page, or in your case perhaps adding automated testing for the frontend stuff (so you don't have to click yourself). It's just nice having someone coming up with a proposal on how to implement something, even it's not the perfect way, it's good as a starter. Also I have one Claude instance running to implement the main feature, in a tight feedback loop so that I know exactly what it's doing. Yes, sometimes it takes a bit longer, but I use the time checking what the other Claudes are doing...
Seems like there's a speed/autonomy spectrum where Cursor is the fastest, Codex is the best for long-running jobs, and Claude is somewhere in the middle.
Personally, I found Cursor to be too inaccurate to be useful (possibly because I use Julia, which is relatively obscure) – Opus has been roughly the right level for my "pair programming" workflow.
> in particular the extra layer of diff-review of AI changes (red/green) which is not integrated into git
We're making this better very soon! In the coming weeks hopefully.
I've seen a couple of power users already switching to Pi [1], and I'm considering that too. The premise is very appealing:
- Minimal, configurable context - including system prompts [2]
- Minimal and extensible tools; for example, todo tasks extension [3]
- No built-in MCP support; extensions exist [4]. I'd rather use mcporter [5]
Full control over context is a high-leverage capability. If you're aware of the many limitations of context on performance (in-context retrieval limits [6], context rot [7], contextual drift [8], etc.), you'd truly appreciate Pi lets you fine-tune the WHOLE context for optimal performance.
It's clearly not for everyone, but I can see how powerful it can be.
---
[1] https://lucumr.pocoo.org/2026/1/31/pi/
[2] https://github.com/badlogic/pi-mono/tree/main/packages/codin...
[3] https://github.com/mitsuhiko/agent-stuff/blob/main/pi-extens...
[4] https://github.com/nicobailon/pi-mcp-adapter
[5] https://github.com/steipete/mcporter
[6] https://github.com/gkamradt/LLMTest_NeedleInAHaystack
[7] https://research.trychroma.com/context-rot
[8] https://arxiv.org/html/2601.20834v1
Pi is the part of moltXYZ that should have gone viral. Armin is way ahead of the curve here.
The Claude sub is the only think keeping me on Claude Code. It's not as janky as it used to be, but the hooks and context management support are still fairly superficial.
Author of Pi is Mario, not Armin, but Armin is a contributor
The best deep-dive into coding agents (and best architecture) I've seen so far. And I love the minimalism with this design, but there's so much complexity necessary already, it's kind of crazy. Really glad I didn't try to write my own :)
Re: security, I think I need to make an AI credential broker/system. The only way to securely use agents is to never give them access to a credential at all. So the only way to have the agent run a command which requires credentials, is to send the command to a segregated process which asks the user for permission, then runs it, then returns status to the agent. It would process read-only requests automatically but write requests would send a request to the user to authorize. I haven't yet found somebody else writing this, so I might as well give it a shot
Other than credentialed calls, I have Docker-in-Docker in a VM, so all other actions will be YOLO'd. I think this is the only reasonable system for long-running loops.
> Re: security, I think I need to make an AI credential broker/system. The only way to securely use agents is to never give them access to a credential at all. So the only way to have the agent run a command which requires credentials, is to send the command to a segregated process which asks the user for permission, then runs it, then returns status to the agent
This is a problem that model context protocol solves
Your MCP server has the creds, your agent does not.
Pi has probably the best architecture and being written in Javascript it is well positioned to use the browser sandbox architecture that I think is the future for ai agents.
I only wish the author changed his stance on vendor extensions: https://github.com/badlogic/pi-mono/discussions/254
“standardize the intersection, expose the union” is a great phrase I hadn’t heard articulated before
You’ve never heard it before because explicitly signaling “I know basic set theory” is kind of cringy
Armin Ronacher wrote a good piece about why he uses Pi here: https://lucumr.pocoo.org/2026/1/31/pi/
I hadn't realized that Pi is the agent harness used by OpenClaw.
I dont know how to feel about being the only one refusing to run yolo mode until the tooling is there, which is still about 6 months away for my setup. Am I years behind everyone else by then? You can get pretty far without completely giving in. Agents really dont need to execute that many arbitrary commands. linting, search, edit, web access should all be bespoke tools integrated into the permission and sandbox system. agents should not even be allowed to start and stop applications that support dev mode, they edit files, can test and get the logs what else would they need to do? especially as the amount of external dependencies that make sense goes to a handful you can without headache approve every new one. If your runtime supports sandboxing and permissions like deno or workerd this adds an initial layer of defense.
This makes it even more baffling why anthropic went with bun, a runtime without any sandboxing or security architecture and will rely in apple seatbelt alone?
You use YOLO mode inside some sandbox (VM, container). Give the container only access to the necessary resources.
But even then, the agent can still exfiltrate anything from the sandbox, using curl. Sandboxing is not enough when you deal with agents that can run arbitrary commands.
What is your threat model?
If you're worried about a hostile agent, then indeed sandboxing is not enough. In the worst case, an actively malicious agent could even try to escape the sandbox with whatever limited subset of commands it's given.
If you're worried about prompt injection, then restricting access to unfiltered content is enough. That would definitely involve not processing third-party input and removing internet search tools, but the restriction probably doesn't have to be mechanically complete if the agent has also been instructed to use local resources only. Even package installation (uv, npm, etc) would be fine up to the existing risk of supply-chain attacks.
If you're worried about stochastic incompetence (e.g. the agent nukes the production database to fix a misspelled table name), then a sandbox to limit the 'blast radius' of any damage is plenty.
That argument seems to assume a security model where the default prior is « no hostile agent ». But that’s the problem, any agent can be made hostile with a successful prompt injection attack. Basically, assuming there’s no hostile agent is the same as assuming there’s no attacker. I think we can agree a security model that assumes no attacker is insufficient.
It depends on what you're trying to prevent.
If your fear is exfiltration of your browser sessions and your computer joining a botnet, or accidental deletion of your data, then a sandbox helps.
If your fear is the llm exfiltrating code you gave it access to then a sandbox is not enough.
I'm personally more worried about the former.
Code is not the only thing the agent could exfiltrate, what about API keys for instance? I agree sandboxing for security in depth is good, but it’s not sufficient and can lull you into a false sense of security.
This is what emulators and separate accounts are for. Ideally you can use an emulator and never let the container know about an API key. At worst you can use a dedicated account/key for dev that is isolated from your prod account.
VM + dedicated key with quotas should get you 95% there if you want to experiment around. Waiting is also an option, so much of the workflow changes with months passing so you’re not missing much.
Sure, though really these are guidelines for any kind of development, not just the agentic kind.
That depends on how you configure or implement your sandbox. If you let it have internet access as part of the sandbox, then yes, but that is your own choice.
Internet access is required to install third party packages, so given the choice almost no one would disable it for a coding agent sandbox.
In practice, it seems to me that the sandbox is only good enough to limit file system access to a certain project, everything else (code or secret exfiltration, installing vulnerable packages, adding prompt injection attacks for others to run) is game if you’re in YOLO mode like pi here.
Maybe a finer grained approach based on capabilities would help: https://simonwillison.net/2025/Apr/11/camel/
Right idea but the reason people don't do this in practice is friction. Setting up a throwaway VM for every agent session is annoying enough that everyone just runs YOLO on their host.
I built shellbox (https://shellbox.dev) to make this trivial -- Firecracker microVMs managed entirely over SSH. Create a box, point your agent at it, let it run wild. You can duplicate a box before a risky operation (instant, copy-on-write) and delete it after.
Billing stops when the SSH session disconnects.
No SDK, no container config, just ssh. Any agent that can run shell commands works out of the box.
apart from nearly no one using vms as far as i can tell, even if they were, a vm does not magically solve all the issues, its just a part of the needed tools.
I'm curious about how the costs compare using something like this where you're hitting api's directly vs my $20 ChatGPT plan which includes Codex.
You can use your ChatGPT subscription with Pi!
Main reason I haven’t switched over to the new pi coding agent (or even fully to Claude Code alternatives) is the price point. I eat tokens for breakfast, lunch, and dinner.
I’m on a $100/mo plan, but the codex bar makes it look like I’m burning closer to $500 every 30 days. I tried going local with Qwen 3 (coding) on a Blackwell Pro 6000, and it still feels a beat behind, either laggy, or just not quite good enough for me to fully relinquish Claude Code.
Curious what other folks are seeing: any success stories with other agents on local models, or are you mostly sticking with proprietary models?
I’m feeling a bit vendor-locked into Claude Code: it’s pricey, but it’s also annoyingly good
According to the article, Pi massively shrinks your context use (due to smaller system prompt and lack of MCPs) so your token use may drop. Also Pi seems to support Anthropic OAuth for your plan (but afaik they might ban you)
And its doubtful they are anywhere near break even costs
Minimal, intentional guidance is the cornerstone of my CLAUDE.md’s design philosophy document.
https://github.com/willswire/dotfiles/blob/main/claude/.clau...
I’m just curious why your writing is punctuated by lots of word breaks. I hardly see hyphenated word breaks across lines anymore and it made me pause on all those occurrences. I do remember having to do this with literal typewriters.
According to dev tools this is a simple `hyphens: auto` CSS
I did something similar in Python, in case people want to see a slightly different perspective (I was aiming for a minimal agent library with built-in tools, similar to the Claude Agent SDK):
https://github.com/NTT123/nano-agent
Being minimalist is real power these days as everything around us keeps shoving features in our face every week with a million tricks and gimmicks to learn. Something minimalist like this is honestly a breath of fresh air!
The YOLO mode is also good, but having a small ‘baby setting mode’ that’s not full-blown system access would make sense for basic security. Just a sensible layer of "pls don't blow my machine" without killing the freedom :)
Pi supports restricting the set of tools given to an agent. For example, one of the examples in pi --help is:
Otherwise, "yolo mode" inside a sandbox is perfectly reasonable. A basic bubblewrap configuration can expose read-only system tools and have a read/write project directory while hiding sensitive information like API keys and other home-directory files.I particularly liked Mario's point about using tmux for long-running commands. I've found models to be very good at reading from / writing to tmux, so I'll do things like spin up a session with a REPL, use Claude to prototype something, then inspect it more deeply in the REPL.
> The second approach is to just write to the terminal like any CLI program, appending content to the scrollback buffer
This is how I prototyped all of mine. Console.Write[Line].
I am currently polishing up one of the prototypes with WinForms (.NET10) & WebView2. Building something that looks like a WhatsApp conversation in basic winforms is a lot of work. This takes about 60 seconds in a web view.
I am not too concerned about cross platform because a vast majority of my users will be on windows when they'd want to use this tool.
If you use WPF you can have the Mica backdrop underneath your WebView2 content and set the WebView2 to have transparent background color, which looks nice and a little more native, fyi. Though if you're doing more than just showing the WebView maybe isn't a choice to switch.
I like the idea of using a transparent background in the webview. That would compose really well.
The primary motivation for winforms was getting easy access to OS-native multiline input controls, clipboard, audio, image handling, etc. I could have just put kestrel in the console app and served it as a pure web app, but this is a bit more clunky from a UX perspective (separate browser window, permissions, etc.).
I was confused by him basically inventing his own skills but I guess this is from Nov 2025 so makes sense as skills were pretty new at that point.
Also please note this is nowhere on the terminal bench leaderboard anymore. I'd advise everyone reading the comments here to be aware of that. This isn't a CLI to use. Just a good experiment and write up.
It's batteries-not-included, by design. Here's what it looks like with batteries (and note who owns this repo):
https://github.com/mitsuhiko/agent-stuff/tree/main
Perhaps benchmarks aren't the best judge.
Glad to see more people doing this!
I built on ADK (Agent Development Kit), which comes with many of the features discussed in the post.
Building a full, custom agent setup is surprisingly easy and a great learning experience for this transformational technology. Getting into instruction and tool crafting was where I found the most ROI.
The solution to the security issue is using `useradd`.
I would add subagents though. They allow for the pattern where the top agent directs / observe a subagent executing a step in a plan.
The top agent is both better at directing a subagent, and it keeps the context clean of details that don't matter - otherwise they'd be in the same step in the plan.
There are lots of ways of doing subagents. It mostly depends on your workflow. That's why pi doesn't ship with anything built in. It's pretty simple to write an extension to do that.
Or you use any of the packages people provide, like this one: https://github.com/nicobailon/pi-subagents
I really like pi and have started using it to build my agent. Mario's article fully reveals some design trade-offs and complexities in the construction process of coding agents and even general agents. I have benefited a lot!
I always wonder what type of moat systems / business like these have
edit: referring to Anthropic and the like
Capital, both social and economic.
Also data, see https://news.ycombinator.com/item?id=46637328
The only moat in all of this is capital.
Its open source. Where does it say he wants to monetise it?
None, basically.
I do think Claude Code as a tool gave Anthropic some advantages over others. They have plan mode, todolist, askUserQuestion tools, hooks, etc., which greatly extend Opus's capabilities. Agree that others (Codex, Cursor) also quickly copy these features, but this is the nature of the race, and Anthropic has to keep innovating to maintain its edge over others
(I work at Cursor) We have all these! Plan mode with a GUI + ability to edit plans inline. Todos. A tool for asking the user questions, which will be automatically called or you can manually ask for it. Hooks. And you can use Opus or any other models with these.
The biggest advantage by far is the data they collect along the way. Data that can be bucketed to real devs and signals extracted from this can be top tier. All that data + signals + whatever else they cook can be re-added in the training corpus and the models re-trained / version++ on the new set. Rinse and repeat.
(this is also why all the labs, including some chinese ones, are subsidising / metoo-ing coding agents)
An excellent piece of writing.
One thing I do find is that subagents are helpful for performance -- offloading tasks to smaller models (gpt-oss specifically for me) gets data to the bigger model quicker.
>The only way you could prevent exfiltration of data would be to cut off all network access for the execution environment the agent runs in
You can sandbox off the data.
One aspect that resonates from stories like this is the tension between opinionated design and real-world utility.
When building something minimal, especially in areas like agent-based tooling or assistants, the challenge isn’t only about reducing surface area — it’s about focusing that reduction around what actually solves a user’s problem.
A minimal agent that only handles edge cases, or only works in highly constrained environments, can feel elegant on paper but awkward in practice. Conversely, a slightly less minimal system that still maintains clarity and intent often ends up being more useful without being bloated.
In my own experience launching tools that involve analysis and interpretation, the sweet spot always ends up being somewhere in the intersection of: - clearly scoped core value, - deliberately limited surface, and - enough flexibility to handle real user variation.
Curious how others think about balancing minimalism and practical coverage when designing agents or abstractions in their own projects.
begone, bot
Can I replace Vercel’s AI SDK with Pi’s equivalent?
It's not an API drop in replacement, if that's what you mean. But the pi-ai package serves the same purpose as Vercel's AI SDK. https://github.com/badlogic/pi-mono/tree/main/packages/ai
I'll check it out, thanks for your work on this!
Not only did you build a minimal agent, but the framework around it so anyone can build their own. I'm using Pi in the terminal, but I see you have web components. Any tips or creating a "Chat mode" where the messages are like chat bubbles? It would be easier to use on mobile.
The web package has a minimal example. I'm not a frontend developer, so YMM hugely V.
"Also, it [Claude Code] flickers" - it does, doesn't it? Why?.. Did it vibe code itself so badly that this is hopeless to fix?..
Because they target 60 fps refresh, with 11 of the 16 ms budget per frame being wasted by react itself.
They are locked in this naive, horrible framework that would be embarrassing to open source even if they had the permission to do it.
That's what they said, but as far as I can see it makes no sense at all. It's a console app. It's outputing to stdout, not a GPU buffer.
The whole point of react is to update the real browser DOM (or rather their custom ASCII backend, presumably, in this case) only when the content actually changes. When that happens, surely you'd spurt out some ASCII escape sequences to update the display. You're not constrained to do that in 16ms and you don't have a vsync signal you could synchronise to even if you wanted to. Synchronising to the display is something the tty implementation does. (On a different machine if you're using it over ssh!)
Given their own explanation of react -> ascii -> terminal, I can't see how they could possibly have ended up attempting to render every 16ms and flickering if they don't get it done in time.
I'm genuinely curious if anybody can make this make sense, because based on what I know of react and of graphics programming (which isn't nothing) my immediate reaction to that post was "that's... not how any of this works".
Claude code programmers are very open that they vibe code it.
I don't think they say they vibe code, just that claude writes 100% of the code.
I'm writing my own agent too as a side project at work. This is a good article but simultaneously kinda disappointing. The entire agent space has disappeared down the same hole, with exactly the same core design used everywhere and everyone making the same mistakes. The focus on TUIs I find especially odd. We're at the dawn of the AI age and people are trying to optimize the framerate of Teletext? If you care about framerates use a proper GUI framework!
The agent I'm writing shares some ideas with Pi but otherwise departs quite drastically from the core design used by Claude Code, Codex, Pi etc, and it seems to have yielded some nice benefits:
• No early stopping ("shall I continue?", "5 tests failed -> all tests passed, I'm done" etc).
• No permission prompts but also no YOLO mode or broken Seatbelt sandboxes. Everything is executed in a customized container designed specifically for the model and adapted to its needs. The agent does a lot of container management to make this work well.
• Agent can manage its own context window, and does. I never needed to add compaction because I never yet saw it run out of context.
• Seems to be fast compared to other agents, at least in any environment where there's heavy load on the inferencing servers.
• Eliminates "slop-isms" like excessive error swallowing, narrative commenting, dropping fully qualified class names into the middle of source files etc.
• No fancy TUI. I don't want to spend any time fixing flickering bugs when I could be improving its skill at the core tasks I actually need it for.
It's got downsides too, it's very overfit to the exact things I've needed and the corporate environment it runs in. It's not a full replacement for CC or Codex. But I use it all the time and it writes nearly all my code now.
The agent is owned by the company and they're starting to ask about whether it could be productized so I suppose I can't really go into the techniques used to achieve this, sorry. Suffice it to say that the agent design space is far wider and deeper than you'd initially intuit from reading articles like this. None of the ideas in my agent are hard to come up with so explore!
As a user of a minimal, opinionated agent (https://exe.dev) I've observed at least 80% of this article's findings myself.
Small and observable is excellent.
Letting your agent read traces of other sessions is an interesting method of context trimming.
Especially, "always Yolo" and "no background tasks". The LLM can manage Unix processes just fine with bash (e.g. ps, lsof, kill), and if you want you can remind it to use systemd, and it will. (It even does it without rolling it's eyes, which I normally do when forced to deal with systemd.)
Something he didn't mention is git: talk to your agent a commit at a time. Recently I had a colleague check in his minimal, broken PoC on a new branch with the commit message "work in progress". We pointed the agent at the branch and said, "finish the feature we started" and it nailed it in one shot. No context whatsoever other than "draw the rest of the f'ing owl" and it just.... did it. Fascinating.