VLM3D Challenge – Task 4: Text‑Conditional CT Generation

Welcome to Task 4 of the Vision‑Language Modeling in 3D Medical Imaging (VLM3D) Challenge. In this task, teams must synthesize realistic 3‑D chest CT volumes from free‑form radiology text prompts.


Contents

  1. Overview
  2. Dataset
  3. Task Objective
  4. Participation Rules
  5. Evaluation & Ranking
  6. Prizes & Publication
  7. Citation
  8. Contact

Overview

Synthetic 3‑D imaging unlocks:

  • Data augmentation for scarce pathologies
  • Privacy‑preserving sharing of realistic cases
  • Pre‑training of downstream models

Task 4 challenges participants to convert clinical text (radiology reports) into high‑fidelity chest CT scans that faithfully reflect the described anatomy and pathology.


Dataset

Split Patients CT Volumes Reports Source
Train 20 000  ≈ 47 k 20 000 Istanbul Medipol University
Validation 1 304  ≈ 3 k 1564 Istanbul Medipol University
Internal Test 2 000 2 000 hidden Istanbul Medipol University
External Test 1 024 1 024 hidden Boston University Hospital

Each report is the conditioning prompt; each nifti volume is the target output.


Task Objective

Given a radiology report, generate a 3‑D nifti chest CT that:

  • Matches anatomical context (lungs, mediastinum, pleura)
  • Reflects all described pathologies (e.g., “right lower‑lobe nodule 5 mm”)
  • Exhibits realistic Hounsfield distributions, spacing & slice thickness

Output volume must keep the same voxel spacing specified in the submission template.


Participation Rules

  • Inference: Fully automatic – no manual editing.
  • Training data: CT‑RATE + public data/models permitted.
  • External masks: Allowed, but output must be a full CT volume.
  • Submissions: One compressed archive per scan, max 1 run/day; last run counts.
  • Organizers: Visible on leaderboard, not prize‑eligible.

Evaluation & Ranking

Generation Metrics

Metric Role
FVDI3D Spatio‑temporal fidelity of slice sequence
FVDCT‑Net CT‑specific Fréchet distance (anatomical realism)
CT‑CLIP Score Text‑image semantic alignment
FID Global visual realism

Metrics are averaged over the test set.

Final Ranking

  1. Compute all metrics above.
  2. For each metric, run a two‑sided permutation test (10 000 samples) between team pairs.
  3. Award 1 point per significant win; sum points across metrics.
  4. Order teams by total points (higher = better). Ties share the same rank.

Missing volumes receive the worst score for that scan.


Prizes & Publication

  • Awards – details TBA.
  • Every team with a valid submission will be invited to co‑author the joint challenge paper (MedIA / IEEE TMI).
  • An overview manuscript describing baseline results will appear on arXiv before the test phase closes.

Citation

@article{hamamci2024developing,
  title   = {Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography},
  author  = {Hamamci, Ibrahim Ethem and Er, Sezgin and others},
  journal = {arXiv preprint arXiv:2403.17834},
  year    = {2024}
}

Contact

Technical questions: open an issue or post on the challenge forum. Other inquiries: use “Help → Email organizers” on the challenge site.