Abstract

Chest X-rays (CXRs) play an integral role in driving critical decisions in disease management and patient care. While recent innovations have led to specialized models for various CXR interpretation tasks, these solutions often operate in isolation, limiting their practical utility in clinical practice.

We present MedRAX, the first versatile AI agent that seamlessly integrates state-of-the-art CXR analysis tools and multimodal large language models into a unified framework. MedRAX dynamically leverages these models to address complex medical queries without requiring additional training.

To rigorously evaluate its capabilities, we introduce ChestAgentBench, a comprehensive benchmark containing 2,500 complex medical queries across 7 diverse categories. Our experiments demonstrate that MedRAX achieves state-of-the-art performance compared to both open-source and proprietary models, representing a significant step toward the practical deployment of automated CXR interpretation systems.

Key Contributions

  • MedRAX, a specialized AI agent framework that seamlessly integrates multiple CXR analysis tools without additional training, dynamically orchestrating specialized components for complex medical queries.
  • ChestAgentBench, a comprehensive evaluation framework with 2,500 complex medical queries across 7 categories, built from 675 expert-curated clinical cases to assess multi-step reasoning in CXR interpretation.
  • Experiments show that MedRAX outperforms both general-purpose and biomedical specialist models, demonstrating substantial improvements in complex reasoning tasks while maintaining transparent workflows.
  • Development of a user-friendly interface, enabling flexible deployment options from local to cloud-based solutions that address healthcare privacy requirements.

MedRAX Framework

We present MedRAX, an open-source agent-based framework that can dynamically reason, plan, and execute multi-step CXR workflows. MedRAX integrates multimodal reasoning abilities with structured tool-based decision-making, allowing real-time CXR interpretation without unnecessary computational overhead. Our framework integrates heterogeneous machine learning models—from lightweight classifiers to large LMMs—specialized for diverse downstream tasks, allowing it to decompose and solve complex medical queries by reasoning across multiple analytical skills.

Figure 1

ChestAgentBench

ChestAgentBench is a medical VQA benchmark that offers several distinctive advantages:

  • It represents one of the largest medical VQA benchmarks, with 2,500 questions derived from 675 expert-validated clinical cases, each with comprehensive radiological findings, detailed discussions, and multi-modal imaging data.
  • The benchmark combines complex multi-step reasoning assessment with a structured six-choice format, enabling both rigorous evaluation of advanced reasoning capabilities and straightforward, reproducible evaluation.
  • The benchmark features diverse questions across seven core competencies in CXR interpretation, requiring integration of multiple visual findings and reasoning to mirror the complexity of real-world clinical decision-making.

We established seven core competencies alongside reasoning that are essential for CXR interpretation:

  • Detection: Identifying specific findings. (e.g., "Is there a nodule present in the right upper lobe?")
  • Classification: Classifying specific findings. (e.g., "Is this mass benign or malignant in appearance?")
  • Localization: Precise positioning of findings. (e.g., "In which bronchopulmonary segment is the mass located?")
  • Comparison: Analyzing relative sizes and positions. (e.g., "How has the pleural effusion volume changed compared to prior imaging?")
  • Relationship: Understanding relationship of findings. (e.g., "Does the mediastinal lymphadenopathy correlate with the lung mass?")
  • Diagnosis: Interpreting findings for clinical decisions. (e.g., "Given the CXR, what is the likely diagnosis?")
  • Characterization: Describing specific finding attributes. (e.g., "What are the margins of the nodule - smooth, spiculated, or irregular?")
  • Reasoning: Explaining medical rationale and thought. (e.g., "Why do these findings suggest infectious rather than malignant etiology?")

Experiments

We evaluate MedRAX against four models: LLaVA-Med, a finetuned LLaVA-13B model for biomedical visual question answering (Li et al. 2024), CheXagent, a Vicuna-13B VLM trained for CXR interpretation (CheXagent), along with GPT-4o and Llama-3.2-90B Vision as popular closed and open-source multimodal LLMs respectively.

We evaluate models on two complementary benchmarks:

  • (1) ChestAgentBench, our proposed benchmark, which assesses comprehensive CXR reasoning through 2,500 six-choice questions across seven categories: detection, classification, localization, comparison, relationship, characterization, and diagnosis. Model performance is measured by accuracy across all questions.
  • (2) CheXbench, a popular benchmark that evaluates seven clinically-relevant CXR interpretation tasks. We specifically focus on the visual question answering (238 questions from Rad-Restruct and SLAKE datasets) and fine-grained image-text reasoning (380 questions from OpenI dataset) subsets, as they most closely mirror complex clinical workflows that require precise differentiation between similar findings.

Case Study

We present two representative cases that compare MedRAX to GPT-4o.

Medical Device Identification (Eurorad Case 17576)

This question asks the model to determine the type of tube present in the CXR. GPT-4o incorrectly suggests an endotracheal tube based on the central positiong of the tube alone. MedRAX, integrated findings from multiple tools like report generation and visual QA, and correctly identifies a chest tube despite one tool (LLaVA-Med) suggesting otherwise. This demonstrates MedRAX's ability to resolve conflicting tool outputs through systematic reasoning.

Multi-step Disease Diagnosis (Eurorad Case 16703)

This questions asks about diagnosing the predominant disease and comparing its severity across lungs. GPT-4o misinterprets the CXR as showing pneumonia with right lung predominance. MedRAX, through sequential tool application of report generation for disease identification and segmentation for lung opacity analysis, correctly determines left pneumothorax as the main finding. This demonstrates MedRAX's ability to break down complex queries into targeted analytical steps.

Conclusion

MedRAX establishes a new benchmark in AI-driven CXR interpretation by integrating structured tool orchestration with large-scale reasoning. Our evaluation on ChestAgentBench demonstrates its superiority over both general-purpose and domain-specific models, reinforcing the advantages of explicit stepwise reasoning in medical AI. These findings highlight the potential of combining foundation models with specialized tools, a principle that could be applied to broader domains in healthcare and beyond. Future work should focus on optimizing tool selection, uncertainty-aware reasoning, and expanding MedRAX's capabilities to multimodal medical imaging for greater clinical impact.

BibTeX

@misc{fallahpour2025medraxmedicalreasoningagent, title={MedRAX: Medical Reasoning Agent for Chest X-ray}, author={Adibvafa Fallahpour and Jun Ma and Alif Munim and Hongwei Lyu and Bo Wang}, year={2025}, eprint={2502.02673}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.02673}, }

This website template is borrowed from here. Thank you!