Skip to content

CreativityPrism

A benchmark for creative reasoning in LLMs โ€” Quality ยท Novelty ยท Diversity

17
LLMs
8
Tasks
3
Dimensions
17
Metrics

What is CreativityPrism?

Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as generating creative text, there is still no holistic and scalable framework to evaluate their creativity across diverse scenarios. Existing methods of LLM creativity evaluation either heavily rely on humans, limiting speed and scalability, or are fragmented across different domains and different definitions of creativity. To address this gap, we propose CREATIVITYPRISM, an evaluation analysis framework that consolidates eight tasks from three domains, divergent thinking, creative writing, and logical reasoning, into a taxonomy of creativity that emphasizes three dimensions: quality, novelty, and diversity of LLM generations. The framework is designed to be scalable with reliable automatic evaluation judges that have been validated against human annotations. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CREATIVITYPRISM and find that while proprietary LLMs dominate creative writing and logical reasoning tasks by a 15% lead over open-sourced ones, they offer no significant advantage in divergent thinking, a domain much less explored in existing post-training regimes. Our analysis also shows that high performance in one creative dimension or domain rarely generalizes to others; specifically, novelty metrics often show weak or negative correlations with other metrics. This fragmentation confirms that a holistic, multi-dimensional framework like CREATIVITYPRISM is essential for meaningful assessment of LLM creativity.


Overview

CreativityPrism overview diagram
Domains, tasks, and metric dimensions.

Leaderboard

Last updated: โ€”
Model Overall Quality Novelty Diversity Creative Writing Divergent Thinking Logical Reasoning

Overall Results

Overall performance bar chart

Detailed Results

Quality performance

Novelty performance

Diversity performance

Creative writing performance

Divergent thinking performance

Logical reasoning performance