The recent study “Artificial intelligence for TNM staging in NSCLC – a critical appraisal of segmentation utility in [¹⁸F]FDG PET/CT” provides a critical evaluation of the clinical value of artificial intelligence (AI)–based segmentation in non-small cell lung cancer (NSCLC).
While the majority of publications in this field primarily focus on technical performance metrics, this translational study specifically investigates whether seemingly strong segmentation results translate into clinically meaningful outcomes—namely accurate lesion detection, correct TNM/UICC classification, and ultimately, informed treatment decisions.
Study Design and Methodology
This retrospective, single-center study analyzed [¹⁸F]FDG PET/CT scans from 306 treatment-naïve patients with newly diagnosed NSCLC.
Reference Standard: Manual lesion segmentations generated in consensus by two hybrid imaging experts
- Reporting and staging performed using the CE-certified structured reporting platform mint Lesion
- TNM classification according to the 9th edition of the TNM staging system, incorporating multidisciplinary tumor board decisions. TNM/UICC classification was semi-automatically performed using mint Lesion, reviewed by experts, and deemed accurate
AI Comparison: Lesion segmentations generated using the best-performing algorithm from the autoPET III Challenge
- Segmentation outputs were classified according to the 9th edition TNM system
The rule-based structured segmentation framework in mint Lesion enables reliable semi-automated TNM and UICC classification. In addition, mint Lesion provided the technical foundation of the study by enabling manual lesion segmentation and structured data export for downstream analysis.
Key Study Findings
Technical Segmentation Performance
- Mean Dice Similarity Coefficient (DSC): 0.64
- Systematic volumetric overestimation by the AI algorithm
(mean volume difference: +56.1 mL compared with manual segmentation)
Lesion Detection
- Very high lesion-level sensitivity: 95.8%
- T category: 96.7%
- N category: 95.9%
- M category: 94.8%
Precision and Sources of Error
- Moderate precision in the M category (PPV: 73.7%)
- Most frequent error source: false-positive distant metastases
- 70.4% of false-positive M lesions represented clinically relevant but benign or non-oncologic findings, including:
- Degenerative musculoskeletal changes
- Inflammatory processes such as pneumonia
Impact on Clinical Staging
- UICC stage concordance with the reference standard in only 67.4% of patients
- Upstaging observed in 88 of 306 cases
- Primary drivers of staging discrepancies:
- False-positive M lesions
- Undersegmentation in the hilar region
Clinical Interpretation and Conclusions
The study concludes that, despite excellent lesion detection sensitivity (95.8%), the best-performing autoPET III algorithm achieved only 67.6% concordance in UICC staging, indicating substantial limitations for autonomous clinical use.
The clinical relevance of segmentation errors varied considerably, with false-positive lesions leading to upstaging identified as the main cause of staging discrepancies.
Key takeaway: High technical performance does not necessarily equate to clinical reliability.
Accordingly, the authors clearly recommend:
- AI as a tool for workflow optimization and decision support—not as a replacement
- Mandatory expert oversight, particularly for:
- M-stage predictions
- Complex cases involving multiple lesions
Conclusion
This study underscores the importance of task-oriented AI evaluation that extends beyond conventional segmentation metrics. For the safe clinical integration of AI in oncologic hybrid imaging, structured reporting, transparent workflows, and physician expertise remain indispensable.
For a comprehensive understanding of the methodology, detailed error analysis, and clinical implications, readers are encouraged to consult the original publication.
This work was supported by the German Federal Ministry for Research, Technology and Space Affairs (Bundesministerium für Forschung, Technologie und Raumfahrt, BMFTR) under the ‘DataXperiment’ funding initiative (project ID FKZ 01KD2431).
Heimer, Maurice M. et al. 2025. „Artificial intelligence for TNM staging in NSCLC: a critical appraisal of segmentation utility in [18F]FDG PET/CT”, European Journal of Nuclear Medicine and Molecular Imaging. https://doi.org/10.1007/s00259-025-07677-2.



