June 12, 2025

A knockout blow for LLMs?

Gary Marcus:

If you can’t use a billion dollar AI system to solve a problem that Herb Simon one of the actual “godfathers of AI”, current hype aside) solved with AI in 1957, and that first semester AI students solve routinely, the chances that models like Claude or o3 are going to reach AGI seem truly remote.

Many of the AI criticisms I see are anecdotal, based on self-experimentation poking fun at general mistakes that, at the core, is more like gossip than evidence of a broken technology.

What we need is more evidence and that clearly compares LLM limitations to human capabilities and frames it through the context of historical research in the field. Gary cites an Apple report that tests the current “reasoning” capabilities of AIs like Claude and DeepSeek, highlighting that LLMs fail to scale reasoning like humans do. As cited in Gary’s post, “[t]hey think MORE up to a point… [t]hen give up early when they have plenty left to compute. […] [T]hey’re super expensive pattern matchers that break as soon as we step outside their training distribution.”

The ultimate evidence is AI’s performance at Hanoi, a game of reason and logic, where you stack varying-sized discs across three pegs to reconstruct the tower with the discs ordered correctly by size. The more discs, the more complicated.

As you can intuit from the opening quote, AI failed at the game, which Gary says, “[w]ith practice, a bright (and patient) seven-year-old can do it. And it’s trivial for a computer.”

I just so happened to also land on Heydon Pickering’s post about “Accessible Rickrolling” this morning moments before Gary’s post, and I think Heydon strikes the same vein, though in a practical real-world application:

In fact, you can’t even automate testing for good writing. No automated testing rig can adequately determine whether your labels “describe [the] topic or purpose”, as mandated by WCAG’s Headings and Labels criterion.

Why? Ultimately, because no parser is complex enough to know if your labels befit your intentions. Only a human, having pressed a button labeled “reset”, can judge whether an adequately reset-like state change, characteristic to the context, can be said to have taken place. It’s highly subjective.

AI slop can be funny. AI failing to reason can be scary, misleading, and downright dangerous. I continue to struggle hard with a technology that is so clearly hyped and force-fed to the masses that is also so clearly half-baked. The AI field dates back decades and seems to have been co-opted by modern day capitalists in search of capital over true innovation and progress.

Leave a Reply

Markdown supported