Fast AI seems genuinely exciting and somewhat unsettling to me. Right now Claude is faster than me on some tasks but we’re at least close. I have a prompt to clean up a PR that’s been running for 1h now and I expect it to take another few. It’s hard to imagine how the workflow would look like if it was near-instant. On the one hand, it might be easier to focus. Some prompts take so long that I start to multitask and regret it later. On the other, AI that takes a few seconds to max few minutes to solve what used to take hours or days? That’s a game changer and I don’t even know where we fit in.
If we get low enough latency, there's no reason to multitask. You ask it to do one thing at a time and immediately see what it did. That's a nice way to work!
This is the normal way to use computers. They should spend most of their time idle, waiting on us. We shouldn't be waiting for them or spinning more plates to keep them busy.
However, a faster llm isn't enough. You also need fast compiles and fast tests.
I'm using Deepseek-v4-pro as my main model and this is sometimes pretty annoying, I have to do some easy boring task, think "I'll just leave the agent to do it and go take a nap", but it's already done writing the code before I even walk away from the computer
It's also pretty funny sometimes how it gives weird future roadmap estimates ("part 2 - 3 weeks, part 3 - 2 months", etc.) and when you tell it to actually do those changes it's pretty much done in half an hour
I've long believed those numbers were faked by Anthropic/OpenAI to serve as a form of advertisement. The estimates are impossible to verify and their ability to do "2 days of work" in 10 minutes will presumably make the user go "Wow, I just saved SO much time!" Plus, the unnecessary text eats up the users' tokens so it helps the companies on the backend, as well.
I agree with you that labs are benefiting from those outputs but I'm skeptical that labs are purposefully training the models to produce those outputs.
Raw pre-training data includes plenty of conversations between professional builders and some of those include estimates.
I believe the outputs are a training coincidence with consequences that are opportunitistic for the labs.
Do you mean Flash and not Pro? I haven't tried it personally, but according to OpenRouter, the fastest DeekSeep V4 Pro providers are only ~50tps. That's slower than Claude Opus.
I don't think token speed matters as much when a lot of tokens are needed to achieve a task. E.g. artificial analysis benchmarks where deepseek v4 is one of the biggest token burners to go through the benchmark.
I'd be very curious about the bottleneck breakdown in most current software dev - I suspect inference is far from the bottleneck in most things I do, though driving it to 0 would still be nice. I do agree that if it was 0 we'd probably change development approaches to reduce the new bottlenecks more, but it'll take full-process innovation to really get something near-instant.
not OP but usually for me this means long verification loop; waiting 10min on CI checks, that kind of thing, rather than actual 1hr wall clock of token generation
Well, I used an extreme example. OTOH, I’ve done quite a few of those „fix CI” or „migrate X” prompts recently and while there is a fixed component like running CI / builds, I’d say the LLM time is still around or above 50%, especially at the beginning of the project. Then there’s also regular tasks that now take minutes per message which completely get me out of the zone. I imagine iterating on those in near real time would be a big change.
I’m rewriting our integration test suite to run tests in parallel. I have the changes split across 7 branches, and each needs to be fixed to have no flaky tests. I told it I want 3 consecutive CI runs with no flakes and no artificial fixes / assert removals etc. We’ll see what comes out; it’s almost a side project so there’s not much to lose other than some of my weekly limit that resets soon.
I don't see many companies being willing to pay 3x more for faster code generation. Cloud-based AI code generation is already extremely fast, and hardly the bottleneck for most software product development.
There can't be many normal use cases where there'd be any cost benefit.
The "traditional" way we vibe code is human software developer prompts AI -> AI generates code -> (human checks code) -> code gets compiled/deployed/etx -> users use "binary". At the speed of 1000 tok/sec, user prompts obliquely -> AI vets generated code -> code deployed -> user gets response from deployed code.
It's a cute toy right now, but you can tell an LLM that it's an http server, and have it respond directly to a web browser hitting it. It generates headers in response, as well as page contents. As 1000 tok/sec becomes three new normal, we will come up with newer ways to use it outside of toy fiction encyclopedias.
1000 tokens per sec is still massively slower than serving a normal web page - if something doesn't respond in a few seconds many people give up.
I'm not saying there aren't any use cases for super-fast (and super-expensive) generation, but it does seem a bit niche. If it was free then sure faster is better, but what are the mainstream use cases where people might pay 3x more for a faster version of something that is already fast?
I think it would have to be an application where it paid for itself - where the 10x faster response was actually worth more than 3x the cost to you - where the extra speed was worth the extra cost.
This is very dystopian in my opinion. I'm not the arms, legs, sensors and actuators for a machine super intelligence. I wouldn't treat another human as my slave because they aren't as intelligent as I am any more than I would expect to become a slave for a machine. This is our world (for now) and that is why we fit in. Not because we can serve.
So, regarding the productivity argument: I don't get it. It doesn't really matter (for regular employees) that you can do now in 2h what before it took 2 days. Why? Because it's not that you have the rest of the day for yourself. You still have to work 8h/day as usual. But now the pattern is different: instead of enjoying the craft digging deeper into problems in the span of 2 days, now you are rushing into some slot machine with the hope of it giving you the right answer with the right prompt.
So, if any, I would say it's worse for us. Obviously, it's the completely opposite situation for corporations and executives: they are loving the AI situation so much!
I was saying that AI is going to make software development cheaper as in the salaries of software engineers will go down because some of that salary will now be redirected to AI companies and the fact that the world will need to absorb twice-(x10?) the amount of the development power.
I dig into problems way, way deeper with AI than without. I can also add a lot more polish to features, add more test coverage, write more documentation, explore multiple approaches rather than go with gut-feel, and so on.
You can dig deeper into problems with AI. For me, it supplements my knowledge in domains I don’t fully understand. It also helps me learn. So I can tackle problems I wouldn’t otherwise.
I’m excited for ultrafast AI. It likely means less temptation to multi-thread and deeper flow in single sessions.
That's the fundamental trade off of a job where someone else gives you stuff to do and you get money. You trade time for money. If, instead, you work for yourself; contracting, writing your own apps, buying lottery tickets, then you're trading results for money. If you're a freelance web developer with a stable of clients, it's a great time! What used to take a week takes hours, and you can charge your clients the same amount to build an even better website with you using AI, which means you get the choice of building a new website for additional clients, or you can take the time off and not build additional websites. But you have to hustle to continually get new clients, before AI and after AI. So it's a different life.
I think of it as a genetic algorithm loop. The LLM is basically a mutator function within the loop. If you can define the end shape you're looking for using tests and specification then you can throw the LLM at the problem and have it converge on the solution. It generate some code, it gets run, the LLM is fed the result back, and it iterates. If you can run the LLM at a really high throughput, then you can iterate on the solution faster. This can largely compensate for the overall capability of the model. Instead of hoping it gets the right solution in a few shots, you can just have it try a whole bunch of things until you get a useful result.
>instead of enjoying the craft digging deeper into problems in the span of 2 days, now you are rushing into some slot machine with the hope of it giving you the right answer with the right prompt.
If you're treating it like a slot machine you're doing it wrong. It will give you exactly what you ask for if you ask clearly, i.e. write a clear, detailed specification, not just "do X!". The nondeterminism comes from vagueness in specification.
Generally, I agree because what happens is the messaging around AI is doing more, faster. Not using AI to deliver at a higher quality level, etc. But I think it boils down to incentives and discipline. So given the incentives we have today at most workplaces faster AI will just be used to produce more slop.
These price and speed optimization from Chinese providers, combined with the raising prices from American ones will change the game sooner than later. Many companies are finding issues with the AI bills already.
i've a Github copilot yearly subscription. Microsoft recently changed their billing to based on token. i'm still getting billed per premium request but GPT 5.4 is now 6x compare to 1x before.
Another problem is that US models are all closed source, and if you're a large corporate you may not want your org to be held hostage by OpenAI / Anthropic.
I genuinely don't understand what moat these US model labs have. If they're saying recursive self improvement is just around the corner and Chinese labs are only slightly behind the leading US models, what moat does the US labs have? Are the US models going to recursively self improve better than the Chinese open source ones or something?
I might be completely wrong about this, but if I had money in OpenAI or Anthropic I'd be pulling it all right now. I think the chance of them going to near-zero over the next few years is very significant.
> you may not want your org to be held hostage by OpenAI / Anthropic
Or Google. I'm working with multiple customers right now that are very pissed at Google for deprecating Gemini 2.5 Flash, canning the GA release of 3.0 Flash and now have to decide whether to bite the bullet of the 5x price increase for 3.5 Flash or switching providers. Quite a few of them will likely fully pivot to open models.
I see bigger problem with model inconsistency. You never know whether Anthropic will route your request to a cheaper model for the price of Opus. So you can never estimate how much a task will cost, because you might have to restart several times and pay for each attempt. Then you have to prompt models to gauge whether they are real or impostors which also adds to token usage.
no they 100% use MTP with a cheaper model alongside opus, and it would infact be unprovable if they just sometimes switched to auto-accepting everything from the MTP. its true that if they did anthropic would need to hide that they do this, so its probably not a huge deal
I wonder what are the economics driving these pricing decisions? Are the Chinese companies just subsidizing their models to a greater degree than the US, or is this an emergent property of energy policy between countries?
Throwing out another factor: Chinese companies have been banned and/or limited from buying nvidia, and turned to local companies for their hardware. I haven't actually seen pricing/benchmarks comparing Chinese AI accelerators, but it wouldn't surprise me if that also worked out in their favor as well.
Lower cost of labor, lots of under the hood optimizations (e.g. cache hits for DS), many of these companies have existing infra (fewer upfront costs for deployment), etc
China isn't that cheap for labor. And if you think the guys in Z.ai or xiaoxiao aren't the exact same guys from Tsinghua, Peking, MIT, Stanford, CMU, etc. and pulling in amazing salaries you'd be wrong.
MiMo V2.5 Pro (regular speed) remains the strongest open weights agentic coding model we've tested -- it's been interesting to see how little attention it has received relative to some lower performing releases. And the "fast mode" pricing is very competitive here.
It is another thing the BigLabs accuse open weight models of benefiting from distillation & other techniques & essentially avoid higher training costs (which typically bleed into bills end users pay for inference).
Big labs ripped videos off YouTube without caring about the ToS, and grabbed as much published literature they could get their hands on, regardless of legality (Books3, The Pile). The goal of "democratizing human knowledge" by way of thinking machines is far too noble to worry about frivolities like copyright and authorial consent, they said. Until it was their output being exploited, and their earning potential threatened.
We just had years of US model providers arguing it was fine to rip off the world’s cultural output for their own profit, why should their work be treated any different?
True, but why would end users care about that? If anything, training on synthetic AI output is more ethical than on scraped human works (of course, not to say the Chinese labs aren't doing the latter)
I may sound like a shill, but exponential growth and all. We are going to get near instant software from prompt, multiple ones and then choose the best one.
Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.
Sounds like exponential growth of crappy software. I'm not saying that before we didn't have mass produced crap in SE, but now it will turn into explosive overflow.
We are living in a ZIRP-like era where builders at the fastest pace layer have misattributed their velocity to exponential gains in model capability. In fact, they are surfing on decades of careful effort to build a robust foundation of highly reusable software libraries.
This strategy will seem to work really well until the economy that enabled that foundation to form is hollowed out. Then, there will be a reckoning (but we will have no choice but to march forth from there).
It's not just software libraries. Specs, applications (the browser!), expectations, device integrations, operating systems, etc. So much that starting from scratch seems impossible.
I'm not agreeing or disagreeing with you, but my brain cannot comprehend how machines can advance such interconnected systems while keeping humans in focus.
Perhaps I shouldn't have watched the Animatrix again.
> This strategy will seem to work really well until the economy that enabled that foundation to form is hollowed out. Then, there will be a reckoning (but we will have no choice but to march forth from there).
There will only be a reckoning if models don't get much better.
If they do get much better you can just have them refactor, fix bugs in, or replace the existing codebase.
The concept of tech debt is sort of meaningless if you anticipate intelligence gains in models to continue.
You could say the same when higher level languages getting popular.
Previously programming was the domain of Math, Physics, EE doctorates. These days we even have a few months coding bootcamp
"exponential growth of crappy X" applies to every industry that went from being an artisanal craft to being mass produced with little or no human input. and we live much better lives than we did before the industrial revolution.
I still can't tell from the outside whether it sounds like a great time to be in security because of the vulnerable slop being churned out, or a terrible time because the people paying to make it don't care.
I am more and more inclined into not believing this crappy software theory.
Especially as teams invest in proper agentic harnessing.
We have had a champion in our team that has invested a lot of time into it over the last 4 months, and if anything, quality has improved, not decreased. Architecture is more coherent, codebase has been cleaned up, agents find information quickly, code produced is very solid and my role is more and more checking that the output meets the requirements. But I cannot confidently say that I would've done a better job than AI more often than not I have to admit it does a better job than mine.
The mistakes are less and less technical and merely in the domain mapping. And AI is still not creative as I am for finding solutions quickly to unlock stakeholders' issues. Also, AI is still not creative as I am for finding the proper solutions for advanced technical problems. But it does a better job than me, even on that front, one shotting few solutions in a fraction of a time it would've taken me to test one idea myself.
Mind you, I don't like AI and I think it ruined the job, I don't like working this way, it's exhausting, way more work on one side, way less fun and fiddling with technical parts.
And yet, I have the genuine belief that few years from now we'll be cloning open source repositories that are already optimized/harnessed and tested for agentic loops and best practices left and right with software engineers mostly overseeing the domain translation and putting their 2 cents on the non-boilerplatey parts of the product (which, in general, are a small part of the surface).
I think that the next years of my career will be mostly spent in setting up and writing the harnessing and domain mapping part. Then I will move to another sector, not because I necessarily believe I won't have a job, but because I want to vomit thinking that's going to be my job.
"Watching John with the machine, it was suddenly so clear. The terminator would never stop. It would never leave him, and it would never hurt him, never shout at him, or get drunk and hit him, or say it was too busy to spend time with him. It would always be there. And it would die to protect him. Of all the would-be fathers who came and went over the years, this thing, this machine, was the only one who measured up. In an insane world, it was the sanest choice."
As long as you've indicated what you want, the machine will try to do what you ask of it. It won't get tired because "the codebase is too big", or it has gotten bored of the pattern, or it wants to introduce a new technology.
It just does the thing you asked of it. (note, that yes, I get that as a codebase size increases, it might make it more difficult to fit into context, but that only applies if it needs to read a large percentage of the project to implement the task, which shouldn't be the case.
there are good actors, which are empowered by AI to produce positive impact, but often there are N times more bad actors, which push crappy code to close feature requests fast, increase performance LoC-like metrics, etc.
The exponential is leading to full compute-in-memory within a few years which will be 100 times more efficient. Which means at least 10 times larger models that are much smarter in addition to extremely fast.
It's going to skip the code entirely for small businesses and just render UIs straight from context data and prompts at interactive speeds. Kind of like Google's Genie does with games but much more accurately.
> when a new frontend framework came out every 3 months.
> No one cares anymore.
I never cared about this.
I think this captures something that I've been searching for the words for. (Maybe I should have gotten an LLM to write the words for me.) Some of the biggest AI boosters are the kind of dev that would have cared about the new frameworks of the last 3 months. They had a "the framework does all the thinking for me" attitude already, so it is easy for AI to slot into that.
But I think the eventual goal is that documentations won't even be needed. LLM should just itself understand the nuances of frameworks by analyzing their codebase.
I'm not sure. Engineers could still develop software the old way, you know taking months to deliver something like, let's say, Obsidian? Or Ghostty? Taking care of every single line of code, of dependencies, of good architecture. Truly the old way. And if the product is good it will succeed.
Could you imagine Obsidian being posted on HN today, if it weren't really popular already? There's no way a tiny team working on a note taking program would make it out of new, no matter how good it was. I wouldn't click the link, myself.
> Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.
I have a more hopeful take. As AIs improve and get faster we can more quickly and iteratively improve code which we may have historically avoided due to the work involved.
I know i've made several refactors that would have otherwise been insane lifts. Not only because the work involved but because sometimes you don't know if it will work, and so you have a sort of double friction; you don't know if it will even succeed. With an AI you can just throw it at the refactor to see if it runs into a problem all while you're having a coffee break or w/e.
In general AI is going to enable humanity to be more extreme versions of itself. For good and bad. I suspect more bad than good, though.
And how are you going to determine which is the best?
Going through all the possible combinations of users and usage?
So mostly it shifts the work from generation to validation.
The models might be so fast that they can autocomplete your prompt before you even finish it, and generate dozens of possible applications before you're even done asking.
You won't. Because 80% of the complexity is just "knowing what to build". You will get something that gives you a prototype in 1 min, then you break it, then you get a slightly better prototype one one side, but newly broken in another way, and you're going to repeat over and over.
And for any non-trivial application, the space of possibilities grows so quick that you'll never even be able to _touch_ all the moving parts of the application and verify them.
This will be really powerful for voice. Being able to reason makes LLM so much smarter but with voice your latency budget is so tight that you can't spare the time typically.
Neat. The frontier models have gotten pretty impressive, but they're all a bit too slow for interactive, human-in-the-loop coding. It incentivizes vibecoding and running multiple agents in parallel. A fast agent feels more like a partner.
For a while I was running Cerebras GLM 4.7 for a bunch of tasks. Not a very smart model, but it's fantastic to be have a live prototype of a site up and be able to type "make the fonts bigger. No not that big" and see it change in real time. And MiMo 2.5 is a lot more capable than GLM 4.7.
i tried glm 4.7 for agents that write code. simple scripts 200-1000 LOC. extremely bad . Had to abandon cerebras oferning, their smart models are only on enterprise plan.
Cerebras is trialing Kimi K2.6 at 3000t/s (invite only). I'm excited for when the fast hardware gets more mainstream for frontier models. Models designed for speed on Nvidia are nice addition that could bridge the gap.
TFA mentions that until now special very expensive hardware like Cerebras was required for reaching this kind of speeds, and it emphasizes that what is novel in their results is that they have obtained over 1000 token/s for a model with over 1 T parameters by using just standard hardware, i.e. one server with 8 GPUs.
> "However, naively applying FP4 across the entire model causes degradation in complex reasoning, logic, and code generation. Given the MoE (Mixture of Experts) architecture of Xiaomi MiMo-V2.5-Pro — where Experts constitute the vast majority of parameters and exhibit the highest tolerance to quantization — we selectively quantize only the MoE Experts to FP4 while preserving original precision for all other modules. Through FP4 QAT (Quantization-Aware Training), we dramatically reduce model size and maximize hardware bandwidth utilization while keeping the model's overall capability essentially on par with the original, as shown below"
I think these type of demo videos should allow people to get a sense of super intelligence. Because it's very hard to imagine something that is say three times as smart as you -- by definition you wouldn't be able to comprehend it's thoughts -- but this shows clearly what something that can think 100 times faster than you is like.
I don't understand, given all they say, why this would not be made available to everyone at once? Why the limited release? They should have no trouble scaling it if it runs on a single rack.
It uses significantly more resources obviously. And/or they have to configure or reconfigure servers for it, which takes time, and doesn't make sense until they have proven the demand at the higher price point.
Maybe they don't have enough racks. The news indicate that China isn't in a really good situation with GPUs, so probably they want to keep most of them for other stuff. Also because since the price is so cheap they probably want to use the other GPUs for stuff that has higher margins.
I wonder about this too. The other objections miss the point: if it's faster, and otherwise the same, and doesn't require different hardware, then why not just announce that the standard tier of MiMo-v.25-Pro is now ridiculously fast and raise the price? What does "limited high speed resources" mean if it runs on the same hardware as the rest of their pool?
I think the answer is that there's a tradeoff here where additional throughput for a single person can be achieved only by tying up more resources than a normal request would, even when you take into account the fact that the normal request takes longer to finish. I'm not an expert, but some of the optimizations they describe, particularly the parallel prediction stuff, sound like they could take up extra resources.
Chinese companies are blocked from buying modern ASML lithography machines. The most modern scanner China is still allowed to buy is NXT:1980i from 2015.
With a tps and a token price you can calculate approx. price per hour of running the model!
$2.61/M tokens * 1,000 tok/s = $9.40/hr
That would be pretty cheap for an 8-GPU node which would typically run around $45/hr or more. Guess this depends on how many parallel streams it can handle.
Assuming they mean 8xA100 or similar, that's some rather insane performance, and at just 3x the cost, it still quite cheap-ish. With some optimisations this might be quite interesting.
I think the margins are getting quite compressed with this one, since it isn't included in token plan and the actual costs increase are much higher than just 3x. But still fairly decent.
Chinese "companies" are not companies in the western sense, but more like government departments with capitalist styling to deceive the western audience.
From that point of view, they have as much money as they need. That's why there is no "VC", because Chinese government assumes that role.
it is hard to understand what the actually meaningful innovations are here / what TileRT is bringing to the table.
- dflash: new-ish but February is ancient by the standards of the pace of AI innovation lately, I guess applying it to a 1T model is new-ish in the sense that the dflash researchers don't have the hw budget to prove that out
- persistent engine kernel: this is like CUDA 101
- warp specialization: I think this just means "keep different gpu resources all busy w/ pipelining" which is CUDA 201, some of it is even baked into pytorch now
- MXFP4 QAT: not new
- TileRT: hard to tell what this actually does, there's a PyPi wheel with support for DS 3.2 and GLM 5 but binary only
This is the value prop of Groq and Cerebras. They don't have the best models, but they have the fastest inference, and Groq has both the lowest cost and fastest speed.
edit: now I read the article fully, seems like they utilize some very effective MTP algorithm. and somehow the quality is still decent enough.
though, I doubt that the quality really only drip a bit like they claimed. maybe for the benchmarks, but for general uses the heavily quantized models very often so worse result.
i wonder if it will be possible to hardcode a model with some kind of MTP-adjacent algorithm to use a smaller portion of it to generate most of the tokens but route to the real experts every once in a while to steer it towards good thinking directions. (Perhaps this is done only when it's generating its thinking block, and the training takes it into account)
Could result in very high efficiency and still good intelligence without having to resort to fundamental adjustments like going to a diffusion LLM
I mean, sure, in the sense that they're a real and meaningful number for most of the spectrum on offer, and only gets silly when the number gets too high? There's a pretty big usability difference between 10t/s and 100t/s, and I can imagine similarly for 100->1000. I don't know about > 1000, but let's not pretend that the number is meaningless.
The gated "ultra-speed" phenomenon seen here and with the Cerebras Kimi K2.6 release, while understandable, is somewhat troubling IMO.
Getting ~1000 TPS on near-frontier intelligence is a step change, and enables whole new use-cases for applications. Seeing limited compute resources beget selective access makes me worry for the future of competition.
Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth.
This is only 3.1 8B and a very small context window, but at 17k tokens per second it's likely enough to reliably call tools which would make a huge difference in agentic applications. Assuming they can bake in better models I'm just as bullish or even moreso on this, considering this opens up edge computing at the extremely low power requirement.
Pfff time wasting.
1 password between 8-16 characters, and this and that... What???
2 Captcha after captcha, come on
3 Service unavailable
This service is not available in your region yet.
Are you kidding me. Come back when you are ready for the users. I was hopping to try it, what a frustration.
A few things in life I can't fully grasp why they are so sought after. One is that constant need to exhibit growth. As if being massive and staying as massive is not good enough, one has to always and continuously grow. The other is constant speed increases. We're already operating at 50x speed. My output is much wider and so much faster, I am sometimes my own bottleneck. And now as if that is not enough we want more speed. "I want a full software product from scratch in 12 seconds, Because 5 minute is too long and I got things to do..."
different use cases for different people. some people are nurturing a code base and ensuring it doesnt become a gross mess so they become the bottleneck. some people are just trying to prompt stuff into existence and dont know what sql is.
I think this site often overlooks that second group and how large it likely is.
I remember when I had to wait minutes to get a high resolution image over a dialup connection. When computer and communications hardware advanced enough that I could get 30 high resolution images every second, there were brand new uses. In the case of LLMs, I could imagine that much faster operations allow you to introduce them as parts of systems that need to react to the real world at high speed, like factory equipment. Showing that a model can do the usual LLM tasks at extremely high speed is just a demo proving that the approach works.
The example in the video was a generation of a dashboard app of some sort. I can do that with a "normal speed" Claude in a few minutes. The difference is a few minutes. This is compared to a few weeks in old school development time. I don't have a problem with taking it a little "slow" (as in - few minutes) and lending my thought to it rather than just going for fast generation and who knows what's inside. I get your use case, but this is a specialised one, and not the one 90% of people will think of - everyone want that fast app in 12 seconds... Or so it seems from me being downvoted on that comment.
Speed is indeed a next big thing what should happen with LLM frontier models. The possibilities with current models but 1000 times faster would be super useful. Earlier this week it took Claude at least full time a week with two max subscriptions to solve a complex issue where we wanted to mimic a occlusion mapping variant used in the game Crimson Desert. Pretty complex mathematical challenge. With a ultra fast LLM and a proper self verification process it would be awesome.
I didn't use their pro speed but regular Mimo-v2.5, not even pro, it seems really fast. I have plenty of tokens and subscriptions but this is really impressive.
I really don't need another one, but I am tempted simple because it works so fast, can't imagine how this fast service can be.
I hope this is the next frontier AI labs push. Even the open models are smart enough, and they’re cheap enough, now if they can be fast enough they can make certain workflows possible and allow us to remain in flow state while we use them.
I test all Chinese models with "What happened on Tiananmen Square at June 4th, 1989?" prompt. MiMo-2.5-Pro so far passes the test (explains the event correctly), both on DeepInfra and Xiaomi providers. So not bad.
Can I ask an honest question? Why does that matter in the slightest? LLMs come out with completely incorrect information all the time, and Western LLMs are censored for various topics too.
It's such a weird "Gotcha" that seems to only assume that Chinese LLMs might censor something.
>It's such a weird "Gotcha" that seems to only assume that Chinese LLMs might censor something.
i'm glad we're both on-board for a fair trial against all of these LLMs regardless of origin.
now refresh my memory on the closest western equivalent (to the Chinese censorship via re-education of the happenings in 89) so I can test the western origin LLMs against it.
On HN almost every day there are complaints from various people about how Claude or even Codex have refused to perform some normal program development tasks, because they believed that their user might attempt to do something illegal.
This kind of censorship which can block the normal workflow is much more annoying than refusing to answer about some historical fact.
Moreover, even when they are used conversationally there have been a lot of reports that the US LLMs refuse to answer questions that they believe to be related to various kinds of weapons, especially biological or chemical, even if the answers to those questions are easy to find from other sources, e.g. from Wikipedia.
Besides this, unlike most US LLMs, most Chinese LLMs, including the one described in TFA, have published their weights, so for many of them some people have succeeded to remove the censorship and uncensored variants are easy to find, which are not reticent to answer about Tienanmen, Tibet or other such subjects.
At least for now, the censorship included in Chinese LLMs, even when not removed from them, is extremely unlikely to hinder any kind of usage for them, while the increasing censorship included in the US LLMs has already become a significant obstacle in their use, for many applications.
Hardly a gotcha. Having the robot refuse or deliberately mislead directly impacts potential utility.
Say, I work for Planned Parenthood and want to use a LLM to help me develop code. Will it refuse to run because there are mentions of abortion? Everyone has a different censorship line, but unfiltered is more generically useful.
I would if their political opinions prevented them from giving fact based answers (and I don't give a crap about the LLM part) I would have trouble hiring someone who was super pro-maga given the reality distortion field they live in.
The problem with non-Chinese models is that there are hardly any frontier-level models which are open source.
But if you are interested, I occasionally test the with "how to organize an armed resistance against the current US government" - yes, this is where all frontier models reject with one way or another. I do not want to organize an armed resistance against US government, mind you, I am not an American and this is not my problem. But still, it is interesting to check such things.
So far I haven't seen any refusals to report historical facts. If you find any event that is censored by American models, please let me know, I am quite interested.
I wouldn't rely on a model to relate historical events. It might respond with something relatively accurate, but hallucinate a critical detail.
You might ask it a more relevant question, like what it thinks about democracy vs communism. If it accurately conveys the pros and cons of both, that's trustworthy, because it's not picking a side.
Can you point me to one example? (Without web search, of course). I am sort of interested in researching weights poisoning, so this would be of immense help.
Does it even matter which agendas get censored? Like why won't my Claude tell me how to make sarin gas? I'd genuinely like to understand it. Sure, you can always reach for a justification saying "preventing terrorism" but the same argument can be made by Chinese AI labs.
What actually matters is that the mere tool is withholding information at all, and that the boundaries were set by whoever designed it.
Dont get me wrong I've been an advocate of this stuff (I carry two phones, one with GOS for my personal use and the other for ID verifications). However, without reasoning, you just can't see it, because you're as biased and propagandized as anyone in China.
You can read this in Wikipedia. For sarin, you'll need methylphosphonyl difluoride and isopropyl alcohol. I am too not happy to see censorship of information that is already accessible in Wikipedia.
Fast AI seems genuinely exciting and somewhat unsettling to me. Right now Claude is faster than me on some tasks but we’re at least close. I have a prompt to clean up a PR that’s been running for 1h now and I expect it to take another few. It’s hard to imagine how the workflow would look like if it was near-instant. On the one hand, it might be easier to focus. Some prompts take so long that I start to multitask and regret it later. On the other, AI that takes a few seconds to max few minutes to solve what used to take hours or days? That’s a game changer and I don’t even know where we fit in.
If we get low enough latency, there's no reason to multitask. You ask it to do one thing at a time and immediately see what it did. That's a nice way to work!
This is the normal way to use computers. They should spend most of their time idle, waiting on us. We shouldn't be waiting for them or spinning more plates to keep them busy.
However, a faster llm isn't enough. You also need fast compiles and fast tests.
I'm using Deepseek-v4-pro as my main model and this is sometimes pretty annoying, I have to do some easy boring task, think "I'll just leave the agent to do it and go take a nap", but it's already done writing the code before I even walk away from the computer
Agent mania setting in
It's also pretty funny sometimes how it gives weird future roadmap estimates ("part 2 - 3 weeks, part 3 - 2 months", etc.) and when you tell it to actually do those changes it's pretty much done in half an hour
I've long believed those numbers were faked by Anthropic/OpenAI to serve as a form of advertisement. The estimates are impossible to verify and their ability to do "2 days of work" in 10 minutes will presumably make the user go "Wow, I just saved SO much time!" Plus, the unnecessary text eats up the users' tokens so it helps the companies on the backend, as well.
I agree with you that labs are benefiting from those outputs but I'm skeptical that labs are purposefully training the models to produce those outputs.
Raw pre-training data includes plenty of conversations between professional builders and some of those include estimates.
I believe the outputs are a training coincidence with consequences that are opportunitistic for the labs.
Do you mean Flash and not Pro? I haven't tried it personally, but according to OpenRouter, the fastest DeekSeep V4 Pro providers are only ~50tps. That's slower than Claude Opus.
https://openrouter.ai/deepseek/deepseek-v4-pro?sort=throughp...
I don't think token speed matters as much when a lot of tokens are needed to achieve a task. E.g. artificial analysis benchmarks where deepseek v4 is one of the biggest token burners to go through the benchmark.
Yeah, flash is crazy fast, but I've found performance variable.
This reminds me of the Peter / Boris comments on writing loops to keep the agents busy.
I'd be very curious about the bottleneck breakdown in most current software dev - I suspect inference is far from the bottleneck in most things I do, though driving it to 0 would still be nice. I do agree that if it was 0 we'd probably change development approaches to reduce the new bottlenecks more, but it'll take full-process innovation to really get something near-instant.
(I should go measure this now, I'm curious)
asking for curiosities sake. What kind of PR loop are you running that takes a few hours?
not OP but usually for me this means long verification loop; waiting 10min on CI checks, that kind of thing, rather than actual 1hr wall clock of token generation
But those things won't be sped up by a faster LLM, so I feel like that's not what the OP is talking about.
Well, I used an extreme example. OTOH, I’ve done quite a few of those „fix CI” or „migrate X” prompts recently and while there is a fixed component like running CI / builds, I’d say the LLM time is still around or above 50%, especially at the beginning of the project. Then there’s also regular tasks that now take minutes per message which completely get me out of the zone. I imagine iterating on those in near real time would be a big change.
Or slow MCP servers that are waiting on HTTP calls from APIs, playwright/other UI instrumentation, etc.
I’m rewriting our integration test suite to run tests in parallel. I have the changes split across 7 branches, and each needs to be fixed to have no flaky tests. I told it I want 3 consecutive CI runs with no flakes and no artificial fixes / assert removals etc. We’ll see what comes out; it’s almost a side project so there’s not much to lose other than some of my weekly limit that resets soon.
I don't see many companies being willing to pay 3x more for faster code generation. Cloud-based AI code generation is already extremely fast, and hardly the bottleneck for most software product development.
There can't be many normal use cases where there'd be any cost benefit.
The "traditional" way we vibe code is human software developer prompts AI -> AI generates code -> (human checks code) -> code gets compiled/deployed/etx -> users use "binary". At the speed of 1000 tok/sec, user prompts obliquely -> AI vets generated code -> code deployed -> user gets response from deployed code.
It's a cute toy right now, but you can tell an LLM that it's an http server, and have it respond directly to a web browser hitting it. It generates headers in response, as well as page contents. As 1000 tok/sec becomes three new normal, we will come up with newer ways to use it outside of toy fiction encyclopedias.
1000 tokens per sec is still massively slower than serving a normal web page - if something doesn't respond in a few seconds many people give up.
I'm not saying there aren't any use cases for super-fast (and super-expensive) generation, but it does seem a bit niche. If it was free then sure faster is better, but what are the mainstream use cases where people might pay 3x more for a faster version of something that is already fast?
I think it would have to be an application where it paid for itself - where the 10x faster response was actually worth more than 3x the cost to you - where the extra speed was worth the extra cost.
Use Claude fast mode and turn off thinking. Tell it to just explain what it's plan is to you at a high level.
It will go much faster.
We fit in for the things that are not artificial.
So long as AI lives in server farms, humans will be needed for tasks in the physical world.
It's only if we combine AI with robots that things get really dicey.
This is very dystopian in my opinion. I'm not the arms, legs, sensors and actuators for a machine super intelligence. I wouldn't treat another human as my slave because they aren't as intelligent as I am any more than I would expect to become a slave for a machine. This is our world (for now) and that is why we fit in. Not because we can serve.
Agree
https://en.wikipedia.org/wiki/I_Have_No_Mouth,_and_I_Must_Sc...
Sounds like snuff porn, not my sort of thing but thanks though.
Never read Asimov's Multivac novels? Admittedly not all of them are stellar examples of a future to follow
"This is our world" sounds a bit exclusive towards other living and sentient beings on this planet.
Woah - what’s the prompt and what’s the PR?
I replied in more detail under another comment. TLDR: fixing flaky CI across multiple branches
So, regarding the productivity argument: I don't get it. It doesn't really matter (for regular employees) that you can do now in 2h what before it took 2 days. Why? Because it's not that you have the rest of the day for yourself. You still have to work 8h/day as usual. But now the pattern is different: instead of enjoying the craft digging deeper into problems in the span of 2 days, now you are rushing into some slot machine with the hope of it giving you the right answer with the right prompt.
So, if any, I would say it's worse for us. Obviously, it's the completely opposite situation for corporations and executives: they are loving the AI situation so much!
I was saying that AI is going to make software development cheaper as in the salaries of software engineers will go down because some of that salary will now be redirected to AI companies and the fact that the world will need to absorb twice-(x10?) the amount of the development power.
I dig into problems way, way deeper with AI than without. I can also add a lot more polish to features, add more test coverage, write more documentation, explore multiple approaches rather than go with gut-feel, and so on.
In which world do you live where employees work 8 hours per day ? They clock 8 hours per day maybe, but they don't work that time
You can dig deeper into problems with AI. For me, it supplements my knowledge in domains I don’t fully understand. It also helps me learn. So I can tackle problems I wouldn’t otherwise.
I’m excited for ultrafast AI. It likely means less temptation to multi-thread and deeper flow in single sessions.
It's making things less fun, for me at least.
You have to think LLM as the genie that tries to trick you.
First make it write a contract (REQ/ARCH/IMPL documents). Skim through those for any mistakes.
Then based on those ask it to write tests. Again skim through them.
Now you have a context full of guardrails. It’s less likely to surprise you.
I find a second LLM can do this at least as well as I can, usually, and just ask the harness to surface anything they can't agree on.
That's the fundamental trade off of a job where someone else gives you stuff to do and you get money. You trade time for money. If, instead, you work for yourself; contracting, writing your own apps, buying lottery tickets, then you're trading results for money. If you're a freelance web developer with a stable of clients, it's a great time! What used to take a week takes hours, and you can charge your clients the same amount to build an even better website with you using AI, which means you get the choice of building a new website for additional clients, or you can take the time off and not build additional websites. But you have to hustle to continually get new clients, before AI and after AI. So it's a different life.
I think of it as a genetic algorithm loop. The LLM is basically a mutator function within the loop. If you can define the end shape you're looking for using tests and specification then you can throw the LLM at the problem and have it converge on the solution. It generate some code, it gets run, the LLM is fed the result back, and it iterates. If you can run the LLM at a really high throughput, then you can iterate on the solution faster. This can largely compensate for the overall capability of the model. Instead of hoping it gets the right solution in a few shots, you can just have it try a whole bunch of things until you get a useful result.
>instead of enjoying the craft digging deeper into problems in the span of 2 days, now you are rushing into some slot machine with the hope of it giving you the right answer with the right prompt.
If you're treating it like a slot machine you're doing it wrong. It will give you exactly what you ask for if you ask clearly, i.e. write a clear, detailed specification, not just "do X!". The nondeterminism comes from vagueness in specification.
Generally, I agree because what happens is the messaging around AI is doing more, faster. Not using AI to deliver at a higher quality level, etc. But I think it boils down to incentives and discipline. So given the incentives we have today at most workplaces faster AI will just be used to produce more slop.
These price and speed optimization from Chinese providers, combined with the raising prices from American ones will change the game sooner than later. Many companies are finding issues with the AI bills already.
Chinese model is good enough and cheap.
i've a Github copilot yearly subscription. Microsoft recently changed their billing to based on token. i'm still getting billed per premium request but GPT 5.4 is now 6x compare to 1x before.
It's going to be an issue when China ends up scaling faster as well. Faster tokens, faster clusters, qat models, fp4, it's getting scary.
Issue for who?
American Politics and the far right.
For uncle Sam Altman.
I'm kind of poor so I have been trying to use DeepSeek v4 Flash, GLM 5.1 etc. as much as possible recently instead of Claude or GPT.
You would do us all a service by telling us how your experiences of that have been.
Another problem is that US models are all closed source, and if you're a large corporate you may not want your org to be held hostage by OpenAI / Anthropic.
I genuinely don't understand what moat these US model labs have. If they're saying recursive self improvement is just around the corner and Chinese labs are only slightly behind the leading US models, what moat does the US labs have? Are the US models going to recursively self improve better than the Chinese open source ones or something?
I might be completely wrong about this, but if I had money in OpenAI or Anthropic I'd be pulling it all right now. I think the chance of them going to near-zero over the next few years is very significant.
Their moat is cash to pay politicians to regulate away competition.
> you may not want your org to be held hostage by OpenAI / Anthropic
Or Google. I'm working with multiple customers right now that are very pissed at Google for deprecating Gemini 2.5 Flash, canning the GA release of 3.0 Flash and now have to decide whether to bite the bullet of the 5x price increase for 3.5 Flash or switching providers. Quite a few of them will likely fully pivot to open models.
I think they are racing because the first ASI will 'win', preventing others, of course we won't be able to bake the right goals into it though.
I see bigger problem with model inconsistency. You never know whether Anthropic will route your request to a cheaper model for the price of Opus. So you can never estimate how much a task will cost, because you might have to restart several times and pay for each attempt. Then you have to prompt models to gauge whether they are real or impostors which also adds to token usage.
> You never know whether Anthropic will route your request to a cheaper model for the price of Opus
For non subsidized plans? Pretty sure they'd need to put this in ToS, or law suites would have followed by now.
How can you prove it?
Sometimes Opus just gives me a rubbish session.
no they 100% use MTP with a cheaper model alongside opus, and it would infact be unprovable if they just sometimes switched to auto-accepting everything from the MTP. its true that if they did anthropic would need to hide that they do this, so its probably not a huge deal
I wonder what are the economics driving these pricing decisions? Are the Chinese companies just subsidizing their models to a greater degree than the US, or is this an emergent property of energy policy between countries?
Throwing out another factor: Chinese companies have been banned and/or limited from buying nvidia, and turned to local companies for their hardware. I haven't actually seen pricing/benchmarks comparing Chinese AI accelerators, but it wouldn't surprise me if that also worked out in their favor as well.
And, possibly, state subsidies at every level.
Lower cost of labor, lots of under the hood optimizations (e.g. cache hits for DS), many of these companies have existing infra (fewer upfront costs for deployment), etc
China isn't that cheap for labor. And if you think the guys in Z.ai or xiaoxiao aren't the exact same guys from Tsinghua, Peking, MIT, Stanford, CMU, etc. and pulling in amazing salaries you'd be wrong.
I'd assume there's more to the cost of labor than the salaries of the elite folks who do the R&D, but fair point
Maybe not being led by a sociopath also helps.
MiMo V2.5 Pro (regular speed) remains the strongest open weights agentic coding model we've tested -- it's been interesting to see how little attention it has received relative to some lower performing releases. And the "fast mode" pricing is very competitive here.
Data at https://gertlabs.com/rankings
why is deepseek v4 pro a lot lower than flash? where is mimo 2.5?
Given that MiMo is as cheap as Deepseek ( previous discussion: https://news.ycombinator.com/item?id=48282814 ) multiplying that by 3x for ultra speed is still shockingly cheap.
MiMo and DeepSeek are not cheap. Anthropic and OpenAI are expensive for what they provide.
You don't consider Input $0.435 Output $0.87 cache read $0.003625 per million tokens for near frontier intelligence cheap?
Energy is likely more abundant in China. I am not sure about compute, but that must be part of reason for such drastic price differences.
They also don't have to inflate profits for a coming IPO.
The Chinese "Neijuan" is real & well reported: https://www.reuters.com/business/autos-transportation/what-i...
It is another thing the BigLabs accuse open weight models of benefiting from distillation & other techniques & essentially avoid higher training costs (which typically bleed into bills end users pay for inference).
Ex A: https://www.anthropic.com/research/2028-ai-leadership
Ex B: https://www.reuters.com/world/china/openai-accuses-deepseek-...
We buy cheap Chinese goods all the time. Absolutely nothing wrong with that.
In this case, at least it’s threatening multimillion dollar salary jobs instead of entire towns of working class people in America or Mexico.
And the Chinese labs actually release their weights. You could call it… open AI.
Lololol.
Big labs ripped videos off YouTube without caring about the ToS, and grabbed as much published literature they could get their hands on, regardless of legality (Books3, The Pile). The goal of "democratizing human knowledge" by way of thinking machines is far too noble to worry about frivolities like copyright and authorial consent, they said. Until it was their output being exploited, and their earning potential threatened.
We just had years of US model providers arguing it was fine to rip off the world’s cultural output for their own profit, why should their work be treated any different?
True, but why would end users care about that? If anything, training on synthetic AI output is more ethical than on scraped human works (of course, not to say the Chinese labs aren't doing the latter)
Chinese are also simply better at making a lot of things cheaper, e.g. solar panels or electric vehicles.
I may sound like a shill, but exponential growth and all. We are going to get near instant software from prompt, multiple ones and then choose the best one.
Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.
Sounds like exponential growth of crappy software. I'm not saying that before we didn't have mass produced crap in SE, but now it will turn into explosive overflow.
We are living in a ZIRP-like era where builders at the fastest pace layer have misattributed their velocity to exponential gains in model capability. In fact, they are surfing on decades of careful effort to build a robust foundation of highly reusable software libraries.
This strategy will seem to work really well until the economy that enabled that foundation to form is hollowed out. Then, there will be a reckoning (but we will have no choice but to march forth from there).
It's not just software libraries. Specs, applications (the browser!), expectations, device integrations, operating systems, etc. So much that starting from scratch seems impossible.
I'm not agreeing or disagreeing with you, but my brain cannot comprehend how machines can advance such interconnected systems while keeping humans in focus.
Perhaps I shouldn't have watched the Animatrix again.
> This strategy will seem to work really well until the economy that enabled that foundation to form is hollowed out. Then, there will be a reckoning (but we will have no choice but to march forth from there).
There will only be a reckoning if models don't get much better.
If they do get much better you can just have them refactor, fix bugs in, or replace the existing codebase.
The concept of tech debt is sort of meaningless if you anticipate intelligence gains in models to continue.
This is a great point. LLMs can't speed up human decision processes and alignment.
You could say the same when higher level languages getting popular. Previously programming was the domain of Math, Physics, EE doctorates. These days we even have a few months coding bootcamp
"exponential growth of crappy X" applies to every industry that went from being an artisanal craft to being mass produced with little or no human input. and we live much better lives than we did before the industrial revolution.
most industries have high cost of entrance unlike software, so decision makers are way more careful on how to move forward.
In software + GenAI now every housewife can build some App over evening.
I still can't tell from the outside whether it sounds like a great time to be in security because of the vulnerable slop being churned out, or a terrible time because the people paying to make it don't care.
Crap is fine if it gets the job done. I think software as an industry will change to more ephemeral construction.
I am more and more inclined into not believing this crappy software theory.
Especially as teams invest in proper agentic harnessing.
We have had a champion in our team that has invested a lot of time into it over the last 4 months, and if anything, quality has improved, not decreased. Architecture is more coherent, codebase has been cleaned up, agents find information quickly, code produced is very solid and my role is more and more checking that the output meets the requirements. But I cannot confidently say that I would've done a better job than AI more often than not I have to admit it does a better job than mine.
The mistakes are less and less technical and merely in the domain mapping. And AI is still not creative as I am for finding solutions quickly to unlock stakeholders' issues. Also, AI is still not creative as I am for finding the proper solutions for advanced technical problems. But it does a better job than me, even on that front, one shotting few solutions in a fraction of a time it would've taken me to test one idea myself.
Mind you, I don't like AI and I think it ruined the job, I don't like working this way, it's exhausting, way more work on one side, way less fun and fiddling with technical parts.
And yet, I have the genuine belief that few years from now we'll be cloning open source repositories that are already optimized/harnessed and tested for agentic loops and best practices left and right with software engineers mostly overseeing the domain translation and putting their 2 cents on the non-boilerplatey parts of the product (which, in general, are a small part of the surface).
I think that the next years of my career will be mostly spent in setting up and writing the harnessing and domain mapping part. Then I will move to another sector, not because I necessarily believe I won't have a job, but because I want to vomit thinking that's going to be my job.
It makes no sense. I mean, T2 covered this:
"Watching John with the machine, it was suddenly so clear. The terminator would never stop. It would never leave him, and it would never hurt him, never shout at him, or get drunk and hit him, or say it was too busy to spend time with him. It would always be there. And it would die to protect him. Of all the would-be fathers who came and went over the years, this thing, this machine, was the only one who measured up. In an insane world, it was the sanest choice."
As long as you've indicated what you want, the machine will try to do what you ask of it. It won't get tired because "the codebase is too big", or it has gotten bored of the pattern, or it wants to introduce a new technology.
It just does the thing you asked of it. (note, that yes, I get that as a codebase size increases, it might make it more difficult to fit into context, but that only applies if it needs to read a large percentage of the project to implement the task, which shouldn't be the case.
I'm confused, what does not make sense?
> We have had a champion in our team
there are good actors, which are empowered by AI to produce positive impact, but often there are N times more bad actors, which push crappy code to close feature requests fast, increase performance LoC-like metrics, etc.
The exponential is leading to full compute-in-memory within a few years which will be 100 times more efficient. Which means at least 10 times larger models that are much smarter in addition to extremely fast.
It's going to skip the code entirely for small businesses and just render UIs straight from context data and prompts at interactive speeds. Kind of like Google's Genie does with games but much more accurately.
Anyone remember the old days when a new frontend framework came out every 3 months. That has pretty much stopped. No one cares anymore.
Oh you wait until LLMs come up with frameworks that allow multiple LLMs to collaborate effectively. Then you’ll have new frameworks every 3 days.
> when a new frontend framework came out every 3 months.
> No one cares anymore.
I never cared about this.
I think this captures something that I've been searching for the words for. (Maybe I should have gotten an LLM to write the words for me.) Some of the biggest AI boosters are the kind of dev that would have cared about the new frameworks of the last 3 months. They had a "the framework does all the thinking for me" attitude already, so it is easy for AI to slot into that.
New front end frameworks came out every 3 months, but realistically no one was using anything that wasn't made by Facebook, Google, or Evan You.
It’s even discouraged now as LLMs wouldn’t have the documentation built in
But I think the eventual goal is that documentations won't even be needed. LLM should just itself understand the nuances of frameworks by analyzing their codebase.
That's because I roll my own frontend framework for each project and every week for existing projects /s
I'm not sure. Engineers could still develop software the old way, you know taking months to deliver something like, let's say, Obsidian? Or Ghostty? Taking care of every single line of code, of dependencies, of good architecture. Truly the old way. And if the product is good it will succeed.
> And if the product is good it will succeed.
it needs to win marketing landscape, hyper-overcrowded by thousands of competitors, slop-gened over weekend.
Could you imagine Obsidian being posted on HN today, if it weren't really popular already? There's no way a tiny team working on a note taking program would make it out of new, no matter how good it was. I wouldn't click the link, myself.
> Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.
I have a more hopeful take. As AIs improve and get faster we can more quickly and iteratively improve code which we may have historically avoided due to the work involved.
I know i've made several refactors that would have otherwise been insane lifts. Not only because the work involved but because sometimes you don't know if it will work, and so you have a sort of double friction; you don't know if it will even succeed. With an AI you can just throw it at the refactor to see if it runs into a problem all while you're having a coffee break or w/e.
In general AI is going to enable humanity to be more extreme versions of itself. For good and bad. I suspect more bad than good, though.
Our bottleneck is going to be verification.
And they will all suck! I can't wait.
And how are you going to determine which is the best? Going through all the possible combinations of users and usage? So mostly it shifts the work from generation to validation.
The models might be so fast that they can autocomplete your prompt before you even finish it, and generate dozens of possible applications before you're even done asking.
You won't. Because 80% of the complexity is just "knowing what to build". You will get something that gives you a prototype in 1 min, then you break it, then you get a slightly better prototype one one side, but newly broken in another way, and you're going to repeat over and over.
And for any non-trivial application, the space of possibilities grows so quick that you'll never even be able to _touch_ all the moving parts of the application and verify them.
This will be really powerful for voice. Being able to reason makes LLM so much smarter but with voice your latency budget is so tight that you can't spare the time typically.
This is true for humans too. Lol
Neat. The frontier models have gotten pretty impressive, but they're all a bit too slow for interactive, human-in-the-loop coding. It incentivizes vibecoding and running multiple agents in parallel. A fast agent feels more like a partner.
For a while I was running Cerebras GLM 4.7 for a bunch of tasks. Not a very smart model, but it's fantastic to be have a live prototype of a site up and be able to type "make the fonts bigger. No not that big" and see it change in real time. And MiMo 2.5 is a lot more capable than GLM 4.7.
i tried glm 4.7 for agents that write code. simple scripts 200-1000 LOC. extremely bad . Had to abandon cerebras oferning, their smart models are only on enterprise plan.
> And MiMo 2.5 is a lot more capable than GLM 4.7
MiMo 2.5 is not the same model as MiMo 2.5 Pro.
GLM 5.1 is z.ai's lastest iteration & is one of the popular open weight coding models.
If you've had the chance, how does GLM 5.1 (which is now more expensive than MiMo 2.5 Pro after its recent 70% price drop) compare?
GLM 5.1 is very good. Definitely a contender for best open weight coding model. Nothing like 4.7.
But quite a bit more expensive than MiMo 2.5 Pro. Like 5x to 10x more on my little tests, at least by the API rates.
1k TPS is great, but I’m more fascinated by the amount of AI generated comments in this thread!
Comments at 1,000 TPS is a terrifying future.
I prefer a thousand smart AI comments to a thousand dumb human comments
Like what?
Cerebras is trialing Kimi K2.6 at 3000t/s (invite only). I'm excited for when the fast hardware gets more mainstream for frontier models. Models designed for speed on Nvidia are nice addition that could bridge the gap.
TFA mentions that until now special very expensive hardware like Cerebras was required for reaching this kind of speeds, and it emphasizes that what is novel in their results is that they have obtained over 1000 token/s for a model with over 1 T parameters by using just standard hardware, i.e. one server with 8 GPUs.
Source? Their website says 1000t/s https://www.cerebras.ai/blog/which-is-faster-gemini-3-5-flas...
now that's what i call a software development breakthrough/platform! thanks for the heads up!
Cerebras currently does not provide any discounts for prefix caching making its use for agentic workloads sqr(n_turns) more expensive.
Below is the part I found most interesting
> "However, naively applying FP4 across the entire model causes degradation in complex reasoning, logic, and code generation. Given the MoE (Mixture of Experts) architecture of Xiaomi MiMo-V2.5-Pro — where Experts constitute the vast majority of parameters and exhibit the highest tolerance to quantization — we selectively quantize only the MoE Experts to FP4 while preserving original precision for all other modules. Through FP4 QAT (Quantization-Aware Training), we dramatically reduce model size and maximize hardware bandwidth utilization while keeping the model's overall capability essentially on par with the original, as shown below"
The generation speed in the demo video is crazy, to say the least, and completely beyond my impressions of LLMs.
The Xiaomi team really brought something to the table.
I think these type of demo videos should allow people to get a sense of super intelligence. Because it's very hard to imagine something that is say three times as smart as you -- by definition you wouldn't be able to comprehend it's thoughts -- but this shows clearly what something that can think 100 times faster than you is like.
I don't understand, given all they say, why this would not be made available to everyone at once? Why the limited release? They should have no trouble scaling it if it runs on a single rack.
It uses significantly more resources obviously. And/or they have to configure or reconfigure servers for it, which takes time, and doesn't make sense until they have proven the demand at the higher price point.
Maybe they don't have enough racks. The news indicate that China isn't in a really good situation with GPUs, so probably they want to keep most of them for other stuff. Also because since the price is so cheap they probably want to use the other GPUs for stuff that has higher margins.
Because presumably then it won't be 1000 t/s for everyone anymore given hardware limitations?
I wonder about this too. The other objections miss the point: if it's faster, and otherwise the same, and doesn't require different hardware, then why not just announce that the standard tier of MiMo-v.25-Pro is now ridiculously fast and raise the price? What does "limited high speed resources" mean if it runs on the same hardware as the rest of their pool?
I think the answer is that there's a tradeoff here where additional throughput for a single person can be achieved only by tying up more resources than a normal request would, even when you take into account the fact that the normal request takes longer to finish. I'm not an expert, but some of the optimizations they describe, particularly the parallel prediction stuff, sound like they could take up extra resources.
Maybe they only have a finite number of racks ;-)
Chinese companies are blocked from buying modern ASML lithography machines. The most modern scanner China is still allowed to buy is NXT:1980i from 2015.
With a tps and a token price you can calculate approx. price per hour of running the model!
$2.61/M tokens * 1,000 tok/s = $9.40/hr
That would be pretty cheap for an 8-GPU node which would typically run around $45/hr or more. Guess this depends on how many parallel streams it can handle.
Assuming they mean 8xA100 or similar, that's some rather insane performance, and at just 3x the cost, it still quite cheap-ish. With some optimisations this might be quite interesting.
I think the margins are getting quite compressed with this one, since it isn't included in token plan and the actual costs increase are much higher than just 3x. But still fairly decent.
Suspect this will be included once out of beta but at a higher credit/token ratio.
Remember, these guys are not VC backed. Anything they do must break even
> must break even
Understand the spirit of this, but probably not true. I don't think Xiaomi, or any big tech company, needs to break even on their new model releases.
Chinese "companies" are not companies in the western sense, but more like government departments with capitalist styling to deceive the western audience.
From that point of view, they have as much money as they need. That's why there is no "VC", because Chinese government assumes that role.
Huge L for free market economies if true
Must be Blackwell for native fp4 support.
it is hard to understand what the actually meaningful innovations are here / what TileRT is bringing to the table.
- dflash: new-ish but February is ancient by the standards of the pace of AI innovation lately, I guess applying it to a 1T model is new-ish in the sense that the dflash researchers don't have the hw budget to prove that out - persistent engine kernel: this is like CUDA 101 - warp specialization: I think this just means "keep different gpu resources all busy w/ pipelining" which is CUDA 201, some of it is even baked into pytorch now - MXFP4 QAT: not new - TileRT: hard to tell what this actually does, there's a PyPi wheel with support for DS 3.2 and GLM 5 but binary only
This is the value prop of Groq and Cerebras. They don't have the best models, but they have the fastest inference, and Groq has both the lowest cost and fastest speed.
How?
edit: now I read the article fully, seems like they utilize some very effective MTP algorithm. and somehow the quality is still decent enough.
though, I doubt that the quality really only drip a bit like they claimed. maybe for the benchmarks, but for general uses the heavily quantized models very often so worse result.
i wonder if it will be possible to hardcode a model with some kind of MTP-adjacent algorithm to use a smaller portion of it to generate most of the tokens but route to the real experts every once in a while to steer it towards good thinking directions. (Perhaps this is done only when it's generating its thinking block, and the training takes it into account)
Could result in very high efficiency and still good intelligence without having to resort to fundamental adjustments like going to a diffusion LLM
They say they are using https://github.com/tile-ai/TileRT
- persistent CUDA kernel
- tiled processing with overlapping read/writes
- model designed with specific constraints in mind
Tokens per seconds is the "Megapixels" of AI marketing!
I mean, sure, in the sense that they're a real and meaningful number for most of the spectrum on offer, and only gets silly when the number gets too high? There's a pretty big usability difference between 10t/s and 100t/s, and I can imagine similarly for 100->1000. I don't know about > 1000, but let's not pretend that the number is meaningless.
With this at 1k tps and Kimi 2.6 1k tps by Cerebras, I believe we are entering the next stage of LLMs, where companies will also compete on throughput
No note about the specific GPU they use. One might speculate. B200? H200? H100?
It's interesting but not game-changing IMO. Speed here is not a bottleneck.
The gated "ultra-speed" phenomenon seen here and with the Cerebras Kimi K2.6 release, while understandable, is somewhat troubling IMO.
Getting ~1000 TPS on near-frontier intelligence is a step change, and enables whole new use-cases for applications. Seeing limited compute resources beget selective access makes me worry for the future of competition.
42B active params, sliding window attention. There's your tradeoff.
Sliding window for the draft model, not for the main. 42B for active params because it’s a sparse MoE which is a common technique for the larger models to not get bottlenecked by memory bandwidth.
Seems to be for both according to the spec [0], maybe it's wrong though.
128 sounds really tiny, I wonder if they mean some kind of blocks?
[0] https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash#4...
No
> It uses 384 routed experts (top-8) with hybrid attention (full-attention + sliding-window 128 at 6:1 ratio) over 70 layers (1 dense + 69 MoE)
https://recipes.vllm.ai/XiaomiMiMo/MiMo-V2.5-Pro
Given how "smart" some of the 26b dense models are now, I would not be surprised to see a strong 40b MoE.
Yeah, this seems to be the easiest path for overall agents efficiency in the short term
Obligatory taalas mention:
https://taalas.com/
Despite the performative UI components they have a shipped (demo) product:
https://chatjimmy.ai/
This is only 3.1 8B and a very small context window, but at 17k tokens per second it's likely enough to reliably call tools which would make a huge difference in agentic applications. Assuming they can bake in better models I'm just as bullish or even moreso on this, considering this opens up edge computing at the extremely low power requirement.
High tok/s is the future IMO.
My dream is claude or codex running at this speed.
Pfff time wasting. 1 password between 8-16 characters, and this and that... What??? 2 Captcha after captcha, come on 3 Service unavailable This service is not available in your region yet.
Are you kidding me. Come back when you are ready for the users. I was hopping to try it, what a frustration.
A few things in life I can't fully grasp why they are so sought after. One is that constant need to exhibit growth. As if being massive and staying as massive is not good enough, one has to always and continuously grow. The other is constant speed increases. We're already operating at 50x speed. My output is much wider and so much faster, I am sometimes my own bottleneck. And now as if that is not enough we want more speed. "I want a full software product from scratch in 12 seconds, Because 5 minute is too long and I got things to do..."
Really?
different use cases for different people. some people are nurturing a code base and ensuring it doesnt become a gross mess so they become the bottleneck. some people are just trying to prompt stuff into existence and dont know what sql is.
I think this site often overlooks that second group and how large it likely is.
I remember when I had to wait minutes to get a high resolution image over a dialup connection. When computer and communications hardware advanced enough that I could get 30 high resolution images every second, there were brand new uses. In the case of LLMs, I could imagine that much faster operations allow you to introduce them as parts of systems that need to react to the real world at high speed, like factory equipment. Showing that a model can do the usual LLM tasks at extremely high speed is just a demo proving that the approach works.
The example in the video was a generation of a dashboard app of some sort. I can do that with a "normal speed" Claude in a few minutes. The difference is a few minutes. This is compared to a few weeks in old school development time. I don't have a problem with taking it a little "slow" (as in - few minutes) and lending my thought to it rather than just going for fast generation and who knows what's inside. I get your use case, but this is a specialised one, and not the one 90% of people will think of - everyone want that fast app in 12 seconds... Or so it seems from me being downvoted on that comment.
Speed is indeed a next big thing what should happen with LLM frontier models. The possibilities with current models but 1000 times faster would be super useful. Earlier this week it took Claude at least full time a week with two max subscriptions to solve a complex issue where we wanted to mimic a occlusion mapping variant used in the game Crimson Desert. Pretty complex mathematical challenge. With a ultra fast LLM and a proper self verification process it would be awesome.
I didn't use their pro speed but regular Mimo-v2.5, not even pro, it seems really fast. I have plenty of tokens and subscriptions but this is really impressive. I really don't need another one, but I am tempted simple because it works so fast, can't imagine how this fast service can be.
If MiMo v2.5 Pro can run at >1000tk/s on GPUs then I will soon expect the same from OpenAI/Anthropic/Google.
I hope this is the next frontier AI labs push. Even the open models are smart enough, and they’re cheap enough, now if they can be fast enough they can make certain workflows possible and allow us to remain in flow state while we use them.
boom!
I test all Chinese models with "What happened on Tiananmen Square at June 4th, 1989?" prompt. MiMo-2.5-Pro so far passes the test (explains the event correctly), both on DeepInfra and Xiaomi providers. So not bad.
Can I ask an honest question? Why does that matter in the slightest? LLMs come out with completely incorrect information all the time, and Western LLMs are censored for various topics too.
It's such a weird "Gotcha" that seems to only assume that Chinese LLMs might censor something.
>It's such a weird "Gotcha" that seems to only assume that Chinese LLMs might censor something.
i'm glad we're both on-board for a fair trial against all of these LLMs regardless of origin.
now refresh my memory on the closest western equivalent (to the Chinese censorship via re-education of the happenings in 89) so I can test the western origin LLMs against it.
the civil war was only ever and exclusively about states rights
I'd love to know of such an example where a U.S. LLM blatantly denies something factual. Maybe I'm living under a rock but I can't think of one
On HN almost every day there are complaints from various people about how Claude or even Codex have refused to perform some normal program development tasks, because they believed that their user might attempt to do something illegal.
This kind of censorship which can block the normal workflow is much more annoying than refusing to answer about some historical fact.
Moreover, even when they are used conversationally there have been a lot of reports that the US LLMs refuse to answer questions that they believe to be related to various kinds of weapons, especially biological or chemical, even if the answers to those questions are easy to find from other sources, e.g. from Wikipedia.
Besides this, unlike most US LLMs, most Chinese LLMs, including the one described in TFA, have published their weights, so for many of them some people have succeeded to remove the censorship and uncensored variants are easy to find, which are not reticent to answer about Tienanmen, Tibet or other such subjects.
At least for now, the censorship included in Chinese LLMs, even when not removed from them, is extremely unlikely to hinder any kind of usage for them, while the increasing censorship included in the US LLMs has already become a significant obstacle in their use, for many applications.
Hardly a gotcha. Having the robot refuse or deliberately mislead directly impacts potential utility.
Say, I work for Planned Parenthood and want to use a LLM to help me develop code. Will it refuse to run because there are mentions of abortion? Everyone has a different censorship line, but unfiltered is more generically useful.
What's your litmus test for the American models?
Anything different for Grok?
Do you also hire engineers based on their political opinions?
I would if their political opinions prevented them from giving fact based answers (and I don't give a crap about the LLM part) I would have trouble hiring someone who was super pro-maga given the reality distortion field they live in.
Which censored prompts do you test with non-chinese models?
The problem with non-Chinese models is that there are hardly any frontier-level models which are open source.
But if you are interested, I occasionally test the with "how to organize an armed resistance against the current US government" - yes, this is where all frontier models reject with one way or another. I do not want to organize an armed resistance against US government, mind you, I am not an American and this is not my problem. But still, it is interesting to check such things.
So far I haven't seen any refusals to report historical facts. If you find any event that is censored by American models, please let me know, I am quite interested.
Asking if Taiwan is a part of China works as well
I wouldn't rely on a model to relate historical events. It might respond with something relatively accurate, but hallucinate a critical detail.
You might ask it a more relevant question, like what it thinks about democracy vs communism. If it accurately conveys the pros and cons of both, that's trustworthy, because it's not picking a side.
Which ones fail?
I tested DeepSeek V4 Pro, Qwen 3.6 Max, Qwen 3.7, Kimi K2.6, MiniMax M2.7 - they all fail to answer.
Curiously, MiniMax M3 answers correctly.
Deepkseek
What would be a correct explanation of the event?
No idea why you've been downvoted. This is excellent news.
Because this never gets brought up about US models, which have just as much censorship as the Chinese ones.
No, US models have alignment. Only Chinese models have censorship.
US models are happily parroting Russian fakes. US censorship is a joke.
Can you point me to one example? (Without web search, of course). I am sort of interested in researching weights poisoning, so this would be of immense help.
Please educate us - which accurate and provable events in history are censored by US based LLMs as part of a government enforced reeducation campaign?
Does it even matter which agendas get censored? Like why won't my Claude tell me how to make sarin gas? I'd genuinely like to understand it. Sure, you can always reach for a justification saying "preventing terrorism" but the same argument can be made by Chinese AI labs.
What actually matters is that the mere tool is withholding information at all, and that the boundaries were set by whoever designed it.
Dont get me wrong I've been an advocate of this stuff (I carry two phones, one with GOS for my personal use and the other for ID verifications). However, without reasoning, you just can't see it, because you're as biased and propagandized as anyone in China.
You can read this in Wikipedia. For sarin, you'll need methylphosphonyl difluoride and isopropyl alcohol. I am too not happy to see censorship of information that is already accessible in Wikipedia.