Evaluating Cultural Diversity in Text-to-Image Models

Text-To-Image (T2I) models have advanced significantly, leading to widespread global adoption. However, these improvements are not equally reflected across cultures. To evaluate their cultural knowledge and assess cross-cultural differences requires benchmarks. Current cultural benchmarks rely on simple text prompts with English cultural concepts, overlooking the complexity of individuals, locations, and their semantic and spatial interactions in realistic settings. To address this gap, we present a new T2I benchmark for evaluating cultural knowledge, containing 1) a multilingual cultural concept dataset and 2) modular prompt templates for generating compositional and complex prompts. Our dataset is built using a pipeline that automatically extracts cultural concepts from Wikipedia, then refined through Large Language Models and human assessment, covering 4 geographically and typologically diverse Geo-Cultures across 12 categories. With 37 prompt templates, each containing 5 unique individuals and locations per category, our framework enables comprehensive cultural evaluation of T2I models by generating up to 2.3 million unique text prompts. We demonstrate that existing metrics fail to adequately assess the generation quality of cultural concepts by comparing embedding-based models on aligned Wikipedia image-caption pairs and propose an automatic metric using Visual Question Answering models to evaluate text-to-image alignment. Our analysis of three stateof-the-art T2I models reveal that they handle compositional prompts well but are limited in their generative capabilities by their insufficient cultural knowledge. The assessment of their multilingual understanding, achieved by translating prompts in the concept’s native language and evaluating cross-lingual consistency reveals a bias in non-multilingual models towards Western languages. This underscores the need to improve cross-cultural and multilingual capabilities in T2I models.

Freie Schlagworte

Text-to-Image Generat...

Multimodal Models

Cultural Benchmarking...

Cross-Cultural Evalua...

Multilingual Datasets...

Visual Question Answe...

Multilingual Understa...

Sprache

Englisch

Fachbereich/-gebiet

20 Fachbereich Informatik > Ubiquitäre Wissensverarbeitung

DDC

000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik

Institution

Technische Universität Darmstadt

Ort

Darmstadt

Datum der mündlichen Prüfung

05.12.2024

Gutachter:innen

Gurevych, Iryna

Liu, Chen Cecilia

Name der Gradverleihenden Institution

Technische Universität Darmstadt

Ort der Gradverleihenden Institution

Darmstadt

PPN

532047788