3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models

Yuhan Zhang* 1,3, Mengchen Zhang* 2,3, Tong Wu† 4, Tengfei Wang3,
Gordon Wetzstein4, Dahua Lin5, Ziwei Liu† 6
1Fudan University 2Zhejiang University 3Shanghai Artificial Intelligence Laboratory 4Stanford University 5The Chinese University of Hong Kong 6S-Lab, Nanyang Technological University
* These authors contributed equally to this work. † Corresponding author.

Overview of 3DGen-Bench, a large-scale comprehensive human preference dataset for 3D generation models. For efficient data collection, we build the 3DGen-Arena platform in a pairwise battle manner. Based on the annotated data, we perform a comprehensive evaluation for state-of-the-art 3D generative models, and propose two scoring model, 3DGen-Score and 3DGen-Eval models, regarded as automated 3D evaluators that align well with human judgments.

Video

Abstract

3D generation is experiencing rapid advancements, while the development of 3D evaluation has not kept pace. How to keep automatic evaluation equitably aligned with human perception has become a well-recognized challenge. Recent advances in the field of language and image generation have explored human preferences and showcased respectable fitting ability. However, the 3D domain still lacks such a comprehensive preference dataset over generative models

To mitigate this absence, we develop 3DGen-Arena, an integrated platform in a battle manner. Then, we carefully design diverse text and image prompts and leverage the arena platform to gather human preferences from both public users and expert annotators, resulting in a large-scale multi-dimension human preference dataset 3DGen-Bench. Using this dataset, we further train a CLIP-based scoring model, 3DGen-Score, and a MLLM-based automatic evaluator, 3DGen-Eval. These two models innovatively unify the quality evaluation of text-to-3D and image-to-3D generation, and jointly form our automated evaluation system with their respective strengths.

Extensive experiments demonstrate the efficacy of our scoring model in predicting human preferences, exhibiting a superior correlation with human ranks compared to existing metrics. We believe that our 3DGen-Bench dataset and automated evaluation system will foster a more equitable evaluation in the field of 3D generation, further promoting the development of 3D generative models and their downstream applications.

Data Construction

Prompts

6 basic domain: 1k+ prompts with 270+ categories.

Models

19 generative models: 9 for text-to-3D && 13 for image-to-3D.

Text-to-3D

Human evaluation of 9 text-to-3d models

Image-to-3D

Human evaluation of top-9 image-to-3d models

3DGen-Evaluator

Visual Results

High Human Alignment

We calculate human consistency quantitatively and compare it with existing methods

Application

Optimize generative models by regarding 3DGen-Score as reward model, taking Mvdream as example.

Orignal
+ Score Reward
Several chairs are arranged around the table
A rusted anchor, its chains worn with age, lies forgotten on the sandy ocean floor
A traffic light stands tall on the intersection, with red, yellow, and green lights
A dog took shelter from the rain under an umbrella

BibTeX


      @misc{zhang20253dgenbenchcomprehensivebenchmarksuite,
        title={3DGen-Bench: Comprehensive Benchmark Suite for 3D Generative Models}, 
        author={Yuhan Zhang and Mengchen Zhang and Tong Wu and Tengfei Wang and Gordon Wetzstein and Dahua Lin and Ziwei Liu},
        year={2025},
        eprint={2503.21745},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2503.21745}, 
      }