>_TheQuery
← All Articles

The Chip That Only Does One Thing. And Does It 28x Faster Than NVIDIA.

March 8, 2026

Every AI chip on the market today makes the same promise. Flexibility. Run any model. Switch between GPT and Llama and Mistral without changing your hardware. Load the weights, run the inference, swap them out when something better arrives.

It is a reasonable promise. It is also why inference is still slow and expensive for applications that cannot afford to be either.

Taalas is making a different promise. One model. Forever. 28x faster than an NVIDIA B200.


The Flexibility Tax

When you run a model on an NVIDIA H100 or a Groq chip or a Cerebras wafer, the weights load into memory at runtime. The chip reads them, processes them, produces output. The hardware is general purpose. It can handle any model you throw at it.

That flexibility has a cost. General purpose hardware cannot be perfectly optimized for any specific model because it has to be ready for all of them. The memory bandwidth, the interconnects, the cooling requirements — all of it designed for the worst case, not the optimal case.

The result is inference speeds that feel fast until you build something that needs to be faster.

The NVIDIA B200 manages roughly 594 tokens per second per user. Groq pushes that to around 600. Cerebras reaches approximately 2,000. These numbers represent billions of dollars of engineering and years of iteration.

Taalas HC1: 17,000 tokens per second per user.


What Taalas Actually Built

The HC1 chip runs Llama 3.1 8B. Only Llama 3.1 8B. It cannot run anything else.

The reason it is 28x faster than a B200 is that it does not load model weights at runtime. The weights are baked directly into the transistors during chip manufacturing. Llama 3.1 8B is not software running on hardware. It is the hardware.

Manufactured on TSMC's 6nm process, the HC1 measures 815mm\u00B2 and packs 53 billion transistors. Taalas calls this the Hardcore AI architecture — model parameters embedded directly into custom silicon instead of being executed in software at runtime.

No HBM memory required. HBM is the expensive, power-hungry, physically complex memory that makes modern AI chips large and hot and costly. Remove the need to store weights in memory and you remove the need for HBM entirely.

No liquid cooling. No complex packaging. A dramatically simpler chip that costs less to manufacture and less to run.

The inference cost drops to approximately $0.0075 per million tokens. Standard cloud GPU inference runs $0.20 to $0.50 per million tokens. Taalas is roughly 30 to 60x cheaper per token.

The company has raised $219 million total, with $169 million in its latest round from investors including Quiet Capital, Fidelity, and veteran chip investor Pierre Lamond.


The Obvious Problem

The HC1 runs Llama 3.1 8B. When Llama 4 arrives and Llama 3.1 8B becomes obsolete, the chip becomes obsolete with it.

You cannot update it. You cannot swap the model. You tape out a new chip for the new model and the old one becomes expensive silicon.

In a field where the state of the art shifts every few months this sounds like a fatal flaw. Why build infrastructure that expires?

The answer is that not every application needs the latest model. Some applications need speed.


What 17,000 Tokens Per Second Actually Unlocks

At 600 tokens per second, real-time voice AI has noticeable latency. The gap between the user speaking and the model responding is long enough to feel unnatural. You build around it. You add filler sounds. You design the product to hide the delay.

At 17,000 tokens per second, that problem disappears. The latency drops below the threshold of human perception. Voice AI stops feeling like talking to a machine with a slow connection and starts feeling like talking to something present.

The same threshold shift applies to robotics. Industrial control systems. Gaming NPCs that need to respond to player behavior faster than a player can register the response. Any application where the bottleneck was inference latency rather than model quality.

There is a large and mostly untapped category of applications where a capable 8B parameter model is good enough and speed is the actual constraint. Taalas HC1 changes the economics and the experience for all of them simultaneously.

$0.0075 per million tokens at sub-millisecond latency opens use cases that were not viable at $0.30 per million tokens at 600 tokens per second. The math changes. The product changes with it.


The Broader Bet

Taalas is not trying to replace NVIDIA. They are not building a general purpose AI accelerator.

They are betting that the AI infrastructure market will bifurcate. On one side: flexible hardware for research, development, and applications that need to run the latest foundation models. On the other side: specialized hardware for production applications where the model is stable and speed and cost are the primary constraints.

The flexibility tax is worth paying when you need flexibility. It is not worth paying when you do not.

For applications that have found their model, locked their use case, and need to scale inference as cheaply and quickly as possible, Taalas is building something that the flexible hardware market structurally cannot match.

Taalas has already announced that its second model — a mid-sized reasoning LLM still on the HC1 platform — is expected in their labs this spring.

The chip does one thing. It does it 28x faster than the alternative.

For the right application, that is not a limitation. That is the entire value proposition.


Sources: Taalas HC1 hardwired Llama-3.1 8B AI accelerator delivers up to 17,000 tokens/s — CNX Software, February 2026. Taalas Etches AI Models Onto Transistors To Rocket Boost Inference — The Next Platform, February 2026. Chip startup Taalas raises $169 million to help build AI chips to take on Nvidia — Reuters via Yahoo Finance. AI chip startup Taalas raises $169m, unveils HC1 processor optimized for Llama 3.1 8B — Data Center Dynamics.