Converting Open-ended Questions to Multiple-choice Questions Simplifies Biomedical Vision-Language Model Evaluation

\(^1\)Tsinghua University \(^2\)Stanford University
Preliminarily Accepted to ML4H

Abstract

Vision-language models (VLMs) show promise in medicine, but their evaluation remains challenging due to their open-ended nature. Current metrics often fail to capture nuances in human judgment, while model-based evaluations are computationally expensive and unstable. We propose converting open-ended questions into multiple-choice format to address these limitations. Using an agent-based framework with GPT-4, we transform questions through iterative refinement. Our results demonstrate strong correlation between multiple-choice and open-ended performance across three datasets. We evaluate 18 models on these converted datasets, showing improved capability discrimination. This work contributes a novel evaluation framework, aiming to enable easier and more consistent VLM evaluation in medicine.


What are we doing ?

We discovered challenges in open-ended medical VQA evaluation and we chose converting to multi-choice format as an alternative solution. Given an open-ended format question, answer, and corresponding image, we aim to output three challenge distractors. Then, combine question, answer and distrators as a multiple-choice format question



Why are we doing ?

We revisit open-ended medical VQA evaluation, finding critical limitations of existing methods

We find that: (Left) rule-based metrics has relatively low correlation with human evaluation results and penalize models that do not strictly follow the expected format. (Right) model-based evaluations are time-consuming and expensive. Using two different versions of GPT yields substantially different scores, making comparisons inconsistent and raising reproducibility issues.



How are we doing it ?

Given the challenge of evaluating open-ended questions for vision language models (VLMs) detailed in the previous section, how can we mitigate these issues? We propose to convert open-ended questions into a multiple-choice format, capitalizing on the simplicity and objectivity of evaluating multiple-choice questions. However, traditionally, creating multiple-choice questions, especially reasonable yet challenging distractor options, requires substantial human expertise and effort. Therefore, we present AutoConverter, an agentic pipeline that automatically generates high-quality multiple-choice questions from open-ended ones.


In AutoConverter, we divided the pipeline into 2 stages, targeting to increase the difficulty and ensure the correctness of generated distractors.

  • (Left) In stage 1, we define 5 different types of errors and design expert agents to create distractors based on these types. Each expert creates 6 distractors. Then, the reviewer agent comments on each distractor, and the expert develops the distractors based on these comments.
  • (Right) In stage 2, we first use a selector to identify the 3 hardest distractors from a pool of 30 choices. Next, we employ an evaluator to determine their accuracy and refine them iteratively. Finally, once the correctness score of the distractors meets the specified threshold, we will output the final selections.



How good can we do it ?

  • Convert 3 open-ended datasets (VQA-RAD, PathVQA, Slake) to multiple-choice format
  • Evaluate 18 models on open-ended and multiple-choice questions
  • Conduct experiments on correlation and difficulty

(Left) We compare a naïve distractor generation method (“create 3 distractors for this question”) with our agentic pipeline for question generation, using model accuracy as the evaluation metric. Results show that questions generated by our agentic pipeline are significantly more challenging for the model than those created using the naïve approach.

(Right) We use a model-based score as a substitute for human evaluation. By comparing the correlation between model-based scores and results from rule-based and multiple-choice formats, we find that our multiple-choice questions exhibit a higher correlation with model evaluation scores, indicating its great potential to serve as a more stable and efficient substitute for open-ended evaluation.


Explore Our converted datasets!


BibTeX

@article{Medical_AutoConverter-ML4H-24, 
      title={Converting Open-ended Questions to Multiple-choice Questions Simplifies Biomedical Vision-Language Model Evaluation}, 
      author={Su, Yuchang and Zhang, Yuhui and Liu, Yiming and Schmidt, Ludwig and Yeung-Levy, Serena}, 
      journal={ML4H 2024}, 
      year={2024} }
}