Stanford University’s recent research, conducted in collaboration with Tsinghua University, has revealed a surprising shift in how we evaluate the performance of large language models (LLMs). Rather ...