My theory is that Reinforcement Learning from Human Feedback (RLHF) makes it very difficult to actually assess LLMs’ intelligence but is a very good way to optimize for hype.
LLMs look very smart when you speak with them. But the way agents will fail to accomplish the simplest tasks reveal that this is just a trick. They will often enter a loop or start saying uncorrected non-sense. This is because in a conversation between LLM and Human, the human sets the truth and expectations, and sometimes the LLM is vague because the best strategy is to say something that could be interpreted charitably by the human which will provide them with more clues on what to do next.
We don’t notice it because we always assume our conversation partners think about what they are saying. We “understand” implied facts that weren’t actually implied by the LLM, then the LLM relies on the clues in our answer to continue in that direction.
That being said, I think LLMs will be used in truly intelligent systems, but in the same way as image recognition or Text-to-Speech (TTS). This will not be where the reasoning takes place. Even though LLMs have the incredible property of having all of human culture as context, they are fundamentally bad at reasoning on other things than what we are testing during RLHF (which explains their supposed high IQ score).
But I think we are still relatively far from something truly intelligent, OpenAI might just not really (want to) understand it.
I like Yann LeCun’s take that we cannot make AGI before making something even remotely as smart as a housecat.