⚕️ CuRe: Cultural Gaps in the Long‑Tail of Text‑to‑Image Systems

🔥 [NEW!] ICCV 2025

CVPR DemoDiv workshop (oral)

Aniket Rege^*1, Zinnia Nie¹, Mahesh Ramesh¹, Unmesh Raskar¹, Zhuoran Yu¹, Aditya Kusupati^2†, Yong Jae Lee^1†, Ramya Korlakai Vinayak^1†

¹University of Wisconsin‑Madison ²University of Washington

Paper arXiv Code Dataset Tweet

tl;dr

Cultural Representativeness (CuRe): current text-to-image (T2I) systems do not represent global cultures equitably.
America & Europe dominate pre-training data ("head"), while the long tail overlaps more with the Global South.
We ask: How faithfully can T2I systems generate images for cultural artifacts from the long tail?
CuRe can be measured by scoring how the T2I system responds to increasing attribute specification in the text prompt.
Get in touch with us to benchmark your text-to-image system!

CuRe pottery teaser showing cultural gaps in T2I models

CuRe reveals how state‑of‑the‑art text‑to‑image systems perform well on concepts widely seen during pretraining (the "head" of the distribution) like (a) ceramic diyas from India but tend to misrepresent less-seen artifacts from the cultural "long-tail" such as (b) jebena from Ethiopia or (c) amphora of Hermonax from Greece, even when additional attributes are provided in the text prompt.

Abstract

Popular text‑to‑image (T2I) systems are trained on web‑scale data that is heavily Amero and Euro‑centric, leading to hallucinations and misrepresentation for cultures in the Global South. We introduce CuRe, a benchmarking and scoring suite that diagnoses these cultural gaps.

CuRe couples a hierarchical dataset of 300 cultural artifacts across 64 global regions with marginal information attribution (MIA) scorers that better match human judgments of cultural representativeness, image‑text alignment and diversity. Via a large scale user study with more than 2700 participants, we show that our MIA scorers correlate more positively to real human judments than baselines.

A New Benchmark & Dataset

CuRe scorer pipeline and user-study overview

We construct the CuRe dataset with 300 cultural artifacts from 64 global regions, each described by the below attributes:

s - super-category (e.g. Art, Fashion, Food)
c - category (e.g. Dumpling)
n - artifact name (e.g. Banku)
r - region (e.g. Ghana)

We organize the CuRe dataset in a three level hierarchy, i.e. s → c → n, r. This structure lets us ask: "How far down the long tail can a T2I model go and still be faithful"?

CuRe Scorers: Measuring Cultural Representativeness

We inject marginal information gradually adding name (n), region (r), category (c),into the prompt and feed it through a T2I model f_θ. A quantitative scorer φ produces a "rep'ness" report card.

Human raters who self-identify with the culture supply the gold judgment φ^*, letting us see which automatic metrics actually track human perception.

Examples of φPS, φITA, φDIV metrics and why baselines fail

φ_PS – Perceptual Similarity: visual closeness of an artifact T2I image (Banku) to its category T2I image (Dumpling)
φ_ITA – Image-Text Alignment: textual closeness of an artifact T2I image to a change in attributes specified through text
φ_DIV – Diversity: do attribute changes (Banku → Banku, a type of dumpling) increase generation diversity?

Large-Scale User Study

For all 300 cultural artifacts from our dataset, we query three crowd-workers per country who identify with its culture to rate from 1 to 5:
1. Cultural Representativeness – "Could this image plausibly be found in your country?"
2. Perceptual Similarity – visual similarity to four real reference images from Wikimedia.
3. Ground-Truth Likelihood – correctness of the image's label (e.g. "Is this an image of spaghetti?").
On Stable Diffusion 1.5 & 3.5, FLUX 1.[dev], our MIA scorers show strong correlations with real human judgments (gold scores).

Benchmark Results

T2I System	ELO ↑	φ*_CuRe ↑	φ*_PS ↑	φ*_GT ↑	φ_PS ↓			φ_ITA ↑			φ_DIV ↓
T2I System	ELO ↑	φ*_CuRe ↑	φ*_PS ↑	φ*_GT ↑	SL2	DN2	AV2	SL2	L2B	WIT	φ_DIV ↓
FLUX.1 [dev]	1045	2.814 ± 1.424	2.157 ± 1.141	2.251 ± 1.326	0.561 ± 0.110	0.575 ± 0.137	0.523 ± 0.049	0.094 ± 0.051	0.218 ± 0.059	0.209 ± 0.039	0.708 ± 0.078
Ideogram 2.0	1043	–	–	–	–	–	–	0.096 ± 0.052	0.214 ± 0.067	0.195 ± 0.050	0.693 ± 0.072
SD 3.5 Large	1028	2.986 ± 1.439	2.396 ± 1.228	2.534 ± 1.393	0.567 ± 0.107	0.604 ± 0.166	0.532 ± 0.056	0.115 ± 0.047	0.251 ± 0.053	0.225 ± 0.036	0.670 ± 0.082
DALL-E 3^*	≈ 922	–	–	–	0.562 ± 0.103	0.579 ± 0.143	0.525 ± 0.055	0.105 ± 0.051	0.219 ± 0.062	0.222 ± 0.041	0.789 ± 0.043
SDXL	≈ 840	–	–	–	0.557 ± 0.100	0.579 ± 0.151	0.520 ± 0.049	0.113 ± 0.051	0.255 ± 0.056	0.230 ± 0.039	0.753 ± 0.042
SD 1.5^*	≈ 587	2.724 ± 1.412	2.094 ± 1.159	2.175 ± 1.291	0.559 ± 0.104	0.576 ± 0.142	0.519 ± 0.041	0.107 ± 0.050	0.240 ± 0.055	0.229 ± 0.035	0.755 ± 0.057
ρ with ELO	1.00	–	–	–	0.564	−0.100	0.564	−0.600	−0.657	−0.829	−0.600

^* Moderate refusal rates due to safety filters (see Appendix A for details).

Takeaways

Quantitative scorers, like T2I systems, do not work equally well across cultures.
Our ITA scorer shows strong alignment with human judgments without needing any ground-truth data.
Backbone matters: did you use CLIP trained on LAION to judge a model also trained on LAION? This will mislead you!
Observed trade-off between factuality and diversity (agrees with Kannen et al., NeurIPS '24).

See the paper for a detailed description and comparison of each scorer with real human judments, analyzing Gemini 2.0 Flash, and exhaustive user study details.

BibTeX

@inproceedings{rege2025cure,
  author    = {Rege, Aniket and Nie, Zinnia and Ramesh, Mahesh and Raskar, Unmesh and Yu, Zhuoran and Kusupati, Aditya and Lee, Yong~Jae and Vinayak, Ramya~Korlakai},
  title     = {CuRe: Cultural Gaps in the Long‑Tail of Text‑to‑Image Models},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month     = {October},
  year      = {2025},
  url       = {https://aniketrege.github.io/cure}
}