It’s a Saturday evening, time to breakdown how you’d test an AI model.
Just like there are many ways to achieve satisfaction with your Taco Bell order, testing AI models depends on your end goal and how comprehensive you want to get. You could go simple with a basic taco (one narrow test), but if you want the full experience—something that combines multiple ingredients and really tests whether everything works together—you need the Crunchwrap Supreme of AI testing.
Enter Massive Multitask Language Understanding (MMLU).
A model is given 5 multiple choice questions and their answers for each of 57 subjects ranging from philosophy to astronomy. After those initial 5 sample questions, about 100 additional questions are asked per subject.
Based on the answers to those questions, the AI model is evaluated.
The goal is to evaluate the AI’s depth of knowledge in each field, not just memorizing facts but applying knowledge.
A reliable generalist.