The Bitter Lesson for Edge AI

We’ve spent the last decade obsessing over how to scale models up to trillions of parameters, but we’ve remained remarkably mediocre at scaling them down.

There is a quiet, frustrating crisis happening at the edge of the network: our best vision models are essentially "cloud-native." They are mathematically elegant in 32-bit floating point on an H100, but they start to become unreliable or even "hallucinate" the moment they touch the cheap, low-power silicon found in drones, CCTV cameras, and IoT devices.

The culprit is usually a hidden precision wall.

To keep things snappy on edge hardware, we rely on fast convolution techniques—specifically Winograd transforms. These algorithms are the workhorses of efficient vision, but they are notoriously fragile. As you drop the bit-depth to save power, the numerical stability of these transforms starts to crumble. In the industry, we often handle this with a kind of "numerical voodoo"—manually picking transform points and praying the rounding errors don't explode into a catastrophe. It’s a bit like trying to balance a needle on its tip while the table is shaking.

I’ve shared a preprint titled NOVA (https://arxiv.org/abs/2512.18453) that tries to tackle this from a more "bitter lesson" pilled perspective.

The core realization is that we shouldn't be hand-tuning these transform points based on 1980s heuristics. Instead, we should treat the discovery of numerical primitives as an optimization problem. In the paper, we show how to use numerical optimization to search for "well-conditioned" Winograd transforms through Vandermonde arithmetic.

By searching for points that are inherently robust to low-precision regimes, we can find configurations that remain stable where traditional methods fail. It turns out that if you give the right objective function and enough search, you can discover numerical primitives that are significantly more reliable than the ones we've been using for decades.

The broader goal here is about Reliability. If we want AI to be truly pervasive, it has to be robust enough to handle the "messiness" of reality and the constraints of cheap hardware without needing a PhD to babysit the deployment. We need algorithms that "want" to be stable.

NOVA is a small step toward making high-performance AI more of a commodity and less of a high-wire act.

Appendix

Why Vandermonde? Standard approaches often lead to ill-conditioned matrices as the transform size increases. By optimizing over the Vandermonde structure, we can maintain better "health" for the math even in 8-bit or 16-bit regimes.

The "Bitter Lesson" connection: Sutton's point about leveraging compute applies even at the lowest levels of the stack. We are moving from "engineering the math" to "searching for the math."