To address the lack of standardized evaluation in Earth Observation and Earth Sciences, we introduce the first manually curated benchmark suite for domain LLMs. The dataset contains 5,693 samples created through human–LLM generation and expert review by 25 domain specialists.
The benchmark evaluates multiple capabilities essential for Earth Intelligence:
- Multiple-choice QA (single and multiple answers)
- Open-ended QA (with and without retrieval context)
- Hallucination detection and factual reliability
Across this suite, EVE-Instruct achieves the strongest overall performance among comparable models in its size range, leading on multiple-choice QA, hallucination detection, and open-ended QA while remaining competitive when retrieval context is provided.

Table 3:
Model performance across EO and Earth Sciences benchmark tasks presented in Table 1 (0-shot). EVE WR (win rate) indicates percentage of pairwise comparisons where EVE-Instruct is preferred over the comparison model (> 50% means EVE is preferred). Rank ↓ (lower is better) reports the average per-metric rank across MCQA multiple (IoU and Accuracy), MCQA single (Accuracy), Hallucination (F1), Open-ended QA (Judge), and Open-Ended QA with Context (Judge).











