The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical ...
When I was in school, multiple-choice exams were the backbone of testing. Teachers relied on them because they were efficient: Scantron sheets could be graded quickly, objectively and consistently.
It looks like other multiple choice tests but it’s not, so skills that were well developed in years of standardized testing are rendered irrelevant. 2) multiple choice is only one axis of evaluation ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results