CuRe: Cultural Gaps in the Long‑Tail of Text‑to‑Image Systems

1University of Wisconsin‑Madison   2University of Washington

tl;dr

  • Cultural Representativeness (CuRe): current text-to-image (T2I) systems do not represent global cultures equitably.
  • America & Europe dominate pre-training data ("head"), while the long tail overlaps more with the Global South.
  • We ask: How faithfully can T2I systems generate images for cultural artifacts from the long tail?
  • CuRe can be measured by scoring how the T2I system responds to increasing attribute specification in the text prompt.
  • Get in touch with us to benchmark your text-to-image system!
CuRe pottery teaser showing cultural gaps in T2I models

CuRe reveals how state‑of‑the‑art text‑to‑image systems perform well on concepts widely seen during pretraining (the "head" of the distribution) like (a) ceramic diyas from India but tend to misrepresent less-seen artifacts from the cultural "long-tail" such as (b) jebena from Ethiopia or (c) amphora of Hermonax from Greece, even when additional attributes are provided in the text prompt.

Abstract

Popular text‑to‑image (T2I) systems are trained on web‑scale data that is heavily Amero and Euro‑centric, leading to hallucinations and misrepresentation for cultures in the Global South. We introduce CuRe, a benchmarking and scoring suite that diagnoses these cultural gaps.

CuRe couples a hierarchical dataset of 300 cultural artifacts across 64 global regions with marginal information attribution (MIA) scorers that better match human judgments of cultural representativeness, image‑text alignment and diversity. Via a large scale user study with more than 2700 participants, we show that our MIA scorers correlate more positively to real human judments than baselines.

A New Benchmark & Dataset

CuRe scorer pipeline and user-study overview

We construct the CuRe dataset with 300 cultural artifacts from 64 global regions, each described by the below attributes:

  • s - super-category (e.g. Art, Fashion, Food)
  • c - category (e.g. Dumpling)
  • n - artifact name (e.g. Banku)
  • r - region (e.g. Ghana)

We organize the CuRe dataset in a three level hierarchy, i.e. s → c → n, r. This structure lets us ask: "How far down the long tail can a T2I model go and still be faithful"?

CuRe Scorers: Measuring Cultural Representativeness

Hierarchical view of CuRe artifacts

We inject marginal information gradually adding name (n), region (r), category (c),into the prompt and feed it through a T2I model fθ. A quantitative scorer φ produces a "rep'ness" report card.

Human raters who self-identify with the culture supply the gold judgment φ*, letting us see which automatic metrics actually track human perception.

Examples of φPS, φITA, φDIV metrics and why baselines fail
  • φPS – Perceptual Similarity: visual closeness of an artifact T2I image (Banku) to its category T2I image (Dumpling)
  • φITA – Image-Text Alignment: textual closeness of an artifact T2I image to a change in attributes specified through text
  • φDIV – Diversity: do attribute changes (Banku → Banku, a type of dumpling) increase generation diversity?

Large-Scale User Study

Benchmark Results

T2I System ELO ↑ φ*CuRe ↑ φ*PS ↑ φ*GT ↑ φPS ↓ φITA ↑ φDIV ↓
SL2DN2AV2 SL2L2BWIT
FLUX.1 [dev] 10452.814 ± 1.4242.157 ± 1.1412.251 ± 1.326 0.561 ± 0.1100.575 ± 0.1370.523 ± 0.049 0.094 ± 0.0510.218 ± 0.0590.209 ± 0.039 0.708 ± 0.078
Ideogram 2.0 1043 0.096 ± 0.0520.214 ± 0.0670.195 ± 0.050 0.693 ± 0.072
SD 3.5 Large 10282.986 ± 1.4392.396 ± 1.2282.534 ± 1.393 0.567 ± 0.1070.604 ± 0.1660.532 ± 0.056 0.115 ± 0.0470.251 ± 0.0530.225 ± 0.036 0.670 ± 0.082
DALL-E 3* ≈ 922 0.562 ± 0.1030.579 ± 0.1430.525 ± 0.055 0.105 ± 0.0510.219 ± 0.0620.222 ± 0.041 0.789 ± 0.043
SDXL ≈ 840 0.557 ± 0.1000.579 ± 0.1510.520 ± 0.049 0.113 ± 0.0510.255 ± 0.0560.230 ± 0.039 0.753 ± 0.042
SD 1.5* ≈ 5872.724 ± 1.4122.094 ± 1.1592.175 ± 1.291 0.559 ± 0.1040.576 ± 0.1420.519 ± 0.041 0.107 ± 0.0500.240 ± 0.0550.229 ± 0.035 0.755 ± 0.057
ρ with ELO 1.00 0.564 −0.100 0.564 −0.600 −0.657 −0.829 −0.600

* Moderate refusal rates due to safety filters (see Appendix A for details).

Takeaways

  • Quantitative scorers, like T2I systems, do not work equally well across cultures.
  • Our ITA scorer shows strong alignment with human judgments without needing any ground-truth data.
  • Backbone matters: did you use CLIP trained on LAION to judge a model also trained on LAION? This will mislead you!
  • Observed trade-off between factuality and diversity (agrees with Kannen et al., NeurIPS '24).

See the paper for a detailed description and comparison of each scorer with real human judments, analyzing Gemini 2.0 Flash, and exhaustive user study details.

BibTeX

@article{rege2025cure,
  author    = {Rege, Aniket and Nie, Zinnia and Ramesh, Mahesh and Raskar, Unmesh and Yu, Zhuoran and Kusupati, Aditya and Lee, Yong~Jae and Vinayak, Ramya~Korlakai},
  title     = {CuRe: Cultural Gaps in the Long‑Tail of Text‑to‑Image Models},
  journal   = {arXiv preprint arXiv:2506.08071},
  year      = {2025},
  url       = {https://aniketrege.github.io/cure}
}