VLM3D Challenge – Task 4: Text‑Conditional CT Generation¶
Welcome to Task 4 of the Vision‑Language Modeling in 3D Medical Imaging (VLM3D) Challenge. In this task, teams must synthesize realistic 3‑D chest CT volumes from free‑form radiology text prompts.
Contents¶
- Overview
- Dataset
- Task Objective
- Participation Rules
- Evaluation & Ranking
- Prizes & Publication
- Citation
- Contact
Overview¶
Synthetic 3‑D imaging unlocks:
- Data augmentation for scarce pathologies
- Privacy‑preserving sharing of realistic cases
- Pre‑training of downstream models
Task 4 challenges participants to convert clinical text (radiology reports) into high‑fidelity chest CT scans that faithfully reflect the described anatomy and pathology.
Dataset¶
Split | Patients | CT Volumes | Reports | Source |
---|---|---|---|---|
Train | 20 000 | ≈ 47 k | 20 000 | Istanbul Medipol University |
Validation | 1 304 | ≈ 3 k | 1564 | Istanbul Medipol University |
Internal Test | 2 000 | 2 000 | hidden | Istanbul Medipol University |
External Test | 1 024 | 1 024 | hidden | Boston University Hospital |
Each report is the conditioning prompt; each nifti volume is the target output.
Task Objective¶
Given a radiology report, generate a 3‑D nifti chest CT that:
- Matches anatomical context (lungs, mediastinum, pleura)
- Reflects all described pathologies (e.g., “right lower‑lobe nodule 5 mm”)
- Exhibits realistic Hounsfield distributions, spacing & slice thickness
Output volume must keep the same voxel spacing specified in the submission template.
Participation Rules¶
- Inference: Fully automatic – no manual editing.
- Training data: CT‑RATE + public data/models permitted.
- External masks: Allowed, but output must be a full CT volume.
- Submissions: One compressed archive per scan, max 1 run/day; last run counts.
- Organizers: Visible on leaderboard, not prize‑eligible.
Evaluation & Ranking¶
Generation Metrics¶
Metric | Role |
---|---|
FVDI3D | Spatio‑temporal fidelity of slice sequence |
FVDCT‑Net | CT‑specific Fréchet distance (anatomical realism) |
CT‑CLIP Score | Text‑image semantic alignment |
FID | Global visual realism |
Metrics are averaged over the test set.
Final Ranking¶
- Compute all metrics above.
- For each metric, run a two‑sided permutation test (10 000 samples) between team pairs.
- Award 1 point per significant win; sum points across metrics.
- Order teams by total points (higher = better). Ties share the same rank.
Missing volumes receive the worst score for that scan.
Prizes & Publication¶
- Awards – details TBA.
- Every team with a valid submission will be invited to co‑author the joint challenge paper (MedIA / IEEE TMI).
- An overview manuscript describing baseline results will appear on arXiv before the test phase closes.
Citation¶
@article{hamamci2024developing, title = {Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography}, author = {Hamamci, Ibrahim Ethem and Er, Sezgin and others}, journal = {arXiv preprint arXiv:2403.17834}, year = {2024} }
Contact¶
Technical questions: open an issue or post on the challenge forum. Other inquiries: use “Help → Email organizers” on the challenge site.