OmniEarth-Bench

Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions 
with Multimodal Observational Earth Data

Fengxiang Wang1,2, Mingshuo Chen3, Xuming He2,4, Yi-Fan Zhang9
Feng Liu2,5, Zijie Guo6, Zhenghao Hu7, Jiong Wang2,6, Jingyi Xu2,6
Zhangrui Li2,8, Fenghua Ling2, Ben Fei2, Weijia Li7
Long Lan1, Wenjing Yang1 †, Wenlong Zhang2 †, Lei Bai2
1 National University of Defense Technology, China,  2 Shanghai Artificial Intelligence Laboratory, China, 
3 Beijing University of Posts and Telecommunications, China,  4 Zhejiang University, China,
5 Shanghai Jiao Tong University, China,  6 Fudan University, China,  7 Sun Yat-sen University, China,
8 Nanjing University, China,  9 Chinese Academy of Sciences, China
📄 Paper 💻 Code 📦 Dataset

Benchmark Results (VQA)

Method Cross-sphere Atmosphere Lithosphere Oceansphere Cryosphere Biosphere Human-activities Avg.
Claude-3.7-Sonnet30.6824.7228.1523.1254.4631.2111.1829.07
Gemini-2.016.9320.8338.9416.9458.5220.8323.7428.10
GPT-4o0.049.6412.8013.3537.481.972.7611.15
InternVL3-72B19.1933.9823.3920.2274.5631.9929.4633.26
InternVL3-7B42.8530.1037.4720.2849.2728.7423.1833.13
LLaVA-Onevision-7B19.2633.6928.7224.5446.4037.3130.6231.51
InternLM-XComposer-2.5-7B19.7817.4528.8821.0640.0430.6724.7626.09
Qwen 2.5-VL-7B9.859.2518.6513.9517.8510.946.2312.39
Qwen 2.5-VL-72B3.924.8222.4316.275.8814.918.6310.98

Benchmark Coverage

OmniEarth-Bench evaluates models over six spheres and cross-sphere interactions:

Atmosphere 🌤️ Lithosphere 🏔️ Oceansphere 🌊 Cryosphere ❄️ Biosphere 🌳 Human Activities 🏙️ Cross-Sphere 🔄

Data & Tasks

Dataset Overview

Overview of OmniEarth-Bench

Fig 1. Overview of OmniEarth-Bench.

We introduce OmniEarth-Bench, the first comprehensive multimodal benchmark spanning all six Earth-science spheres (atmosphere, lithosphere, oceansphere, cryosphere, biosphere, human-activities) and cross-sphere scenarios with one hundred expert-curated evaluation dimensions. Leveraging observational data from satellite sensors and in-situ measurements, OmniEarth-Bench integrates 29 779 annotations across four tiers: perception, general reasoning, scientific-knowledge reasoning, and chain-of-thought (CoT) reasoning.

Dataset

Comparison & Examples

Comparison with existing benchmarks

Fig 2. Comparison with existing benchmarks.

OmniEarth-Bench defines tasks across four hierarchical levels (L1–L4): 7 L1 spheres, 23 L2 dimensions, 4 L3 dimensions and 103 expert-defined L4 subtasks with real-world applicability.

Examples of OmniEarth-Bench

Fig 3. Representative L4 subtasks from each sphere.

Benchmark Results

For detailed results on every dimension, please refer to the paper appendix.

Experimental results on each sphere

Fig 4. Experimental results on each sphere of VQA tasks.

Following MME-CoT, precision, recall and F1 are reported on CoT tasks:

CoT performance

Fig 5. CoT performance on OmniEarth-Bench.

Benchmark Quickstart

Please refer to evaluation

Citation

@article{wang2025omniearth,
  title   = {OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data},
  author  = {Fengxiang Wang and Mingshuo Chen and Xuming He and others},
  journal = {arXiv preprint arXiv:2505.23522},
  year    = {2025}
}