Vision-language models (VLMs) show promise in medicine, but their evaluation remains challenging due to their open-ended nature. Current metrics often fail to capture nuances in human judgment, while model-based evaluations are computationally expensive and unstable. We propose converting open-ended questions into multiple-choice format to address these limitations. Using an agent-based framework with GPT-4, we transform questions through iterative refinement. Our results demonstrate strong correlation between multiple-choice and open-ended performance across three datasets. We evaluate 18 models on these converted datasets, showing improved capability discrimination. This work contributes a novel evaluation framework, aiming to enable easier and more consistent VLM evaluation in medicine.
We discovered challenges in open-ended medical VQA evaluation and we chose converting to multi-choice format as an alternative solution. Given an open-ended format question, answer, and corresponding image, we aim to output three challenge distractors. Then, combine question, answer and distrators as a multiple-choice format question
We revisit open-ended medical VQA evaluation, finding critical limitations of existing methods
We find that: (Left) rule-based metrics has relatively low correlation with human evaluation results and penalize models that do not strictly follow the expected format. (Right) model-based evaluations are time-consuming and expensive. Using two different versions of GPT yields substantially different scores, making comparisons inconsistent and raising reproducibility issues.
Given the challenge of evaluating open-ended questions for vision language models (VLMs) detailed in the previous section, how can we mitigate these issues? We propose to convert open-ended questions into a multiple-choice format, capitalizing on the simplicity and objectivity of evaluating multiple-choice questions. However, traditionally, creating multiple-choice questions, especially reasonable yet challenging distractor options, requires substantial human expertise and effort. Therefore, we present AutoConverter, an agentic pipeline that automatically generates high-quality multiple-choice questions from open-ended ones.
In AutoConverter, we divided the pipeline into 2 stages, targeting to increase the difficulty and ensure the correctness of generated distractors.
(Left) We compare a naïve distractor generation method (“create 3 distractors for this question”) with our agentic pipeline for question generation, using model accuracy as the evaluation metric. Results show that questions generated by our agentic pipeline are significantly more challenging for the model than those created using the naïve approach.
(Right) We use a model-based score as a substitute for human evaluation. By comparing the correlation between model-based scores and results from rule-based and multiple-choice formats, we find that our multiple-choice questions exhibit a higher correlation with model evaluation scores, indicating its great potential to serve as a more stable and efficient substitute for open-ended evaluation.
@article{Medical_AutoConverter-ML4H-24,
title={Converting Open-ended Questions to Multiple-choice Questions Simplifies Biomedical Vision-Language Model Evaluation},
author={Su, Yuchang and Zhang, Yuhui and Liu, Yiming and Schmidt, Ludwig and Yeung-Levy, Serena},
journal={ML4H 2024},
year={2024} }
}