Why IQ is a poor test for AI

During a recent press appearance, OpenAI CEO Sam Altman said that he’s observed the “IQ” of AI rapidly improve over the past several years.

“Very roughly, it feels to me like — this is not scientifically accurate, this is just a vibe or spiritual answer — every year we move one standard deviation of IQ,” Altman said.

Altman isn’t the first to use IQ, an estimation of a person’s intelligence, as a benchmark for AI progress. AI influencers on social media have given models IQ tests and ranked the results.

But many experts say that IQ is a poor measure of a model’s capabilities — and a misleading one.

“It can be very tempting to use the same measures we use for humans to describe capabilities or progress, but this is like comparing apples with oranges,” Sandra Wachter, a researcher studying tech and regulation at Oxford, told TechCrunch.

In his comments at the presser, Altman equated IQ with intelligence. Yet IQ tests are relative — not objective — measures of certain kinds of intelligence. There’s some consensus that IQ is a reasonable test of logic and abstract reasoning. But it doesn’t measure practical intelligence — knowing how to make things work — and it’s at best a snapshot.

“IQ is a tool to measure human capabilities — a contested one no less — based on what scientists believe human intelligence looks like,” Wachter noted. “But you can’t use the same measure to describe AI capabilities. A car is faster than humans, and a submarine is better at diving. But this doesn’t mean cars or submarines surpass human intelligence. You’re equivocating one aspect of performance with human intelligence, which is much more complex.”

To excel at an IQ test, the origins of which some historians trace back to eugenics, the widely discredited scientific theory that people can be improved through selective breeding, a test taker must have a strong working memory and knowledge of Western cultural norms. This invites the opportunity for bias, of course, which is why one psychologist has called IQ tests “ideologically corruptible mechanical models” of intelligence.

That a model might do well on an IQ test indicates more about the test’s flaws than the model’s performance, according to Os Keyes, a doctorate candidate at the University of Washington studying ethical AI.

“[These] tests are pretty easy to game if you have a practically infinite amount of memory and patience,” Keyes said. “IQ tests are a highly limited way of measuring cognition, sentience, and intelligence, something we’ve known since before the invention of the digital computer itself.”

AI likely has an unfair advantage on IQ tests, as well, considering that models have massive amounts of memory and internalized knowledge at their disposal. Often, models are trained on public web data, and the web is full of example questions taken from IQ tests.

“Tests tend to repeat very similar patterns — a pretty foolproof way to raise your IQ is to practice taking IQ tests, which is essentially what every [model] has done,” said Mike Cook, a research fellow at King’s College London specializing in AI. “When I learn something, I don’t get it piped into my brain with perfect clarity 1 million times, unlike AI, and I can’t process it with no noise or signal loss, either.”

Ultimately, IQ tests — biased as they are — were designed for humans, Cook added — intended as a way to evaluate general problem-solving abilities. They’re inappropriate for a technology that approaches solving problems in a very different way than people do.

“A crow might be able to use a tool to recover a treat from a box, but that doesn’t mean it can enroll at Harvard,” Cook said. “When I solve a mathematics problem, my brain is also contending with its ability to read the words on the page correctly, to not think about the shopping I need to do on the way home, or if it’s too cold in the room right now. In other words, human brains contend with a lot more things when they solve a problem — any problem at all, IQ tests or otherwise — and they do it with a lot less help [than AI.]”

All this points to the need for better AI tests, Heidy Khlaaf, chief AI scientist at the AI Now Institute, told TechCrunch.

“In the history of computation, we haven’t compared computing abilities to that of humans’ precisely because the nature of computation means systems have always been able to complete tasks already beyond human ability,” Khlaaf said. “This idea that we directly compare systems’ performance against human abilities is a recent phenomenon that is highly contested, and what surrounds the controversy of the ever-expanding — and moving — benchmarks being created to evaluate AI systems.”

#poor #test