EVE is the first open-source, end-to-end initiative to build domain-specialised large language models for Earth Intelligence

Àlex R. Atrio, Antonio Lopez, Jino Rohit, Yassine El Ouahidi, Marcello Politi, Vijayasri Iyer, Umar Jamil, Sébastien Bratières, and Nicolas Longépé.

Abstract

Its core model, EVE-Instruct (24B), achieves strong reasoning and QA performance on new Earth Observation and Earth Sciences benchmarks while preserving general capabilities. EVE combines curated data, domain benchmarks, RAG, and hallucination detection in a production system deployed via API and GUI, with models and datasets released openly.

Explore the comprehensive research

Contributions

  • EVE-Instruct: a specialized 24B LLM for Earth Intelligence.
  • A curated EO and Earth Sciences corpus (2.8B tokens) plus a large-scale synthetic instruction dataset (10.7B tokens).
  • The first manually created EO and Earth Sciences benchmarks (5,693 samples) covering QA and factuality.
  • A deployed RAG- and hallucination-aware chat system available via GUI and API.
  • Open release of models, datasets, and code to enable reproducible domain-specific LLM development.
System architecture of EVE depicting component interactions
System architecture of EVE depicting component interactions

Benchmark of the model

To address the lack of standardized evaluation in Earth Observation and Earth Sciences, we introduce the first manually curated benchmark suite for domain LLMs. The dataset contains 5,693 samples created through human–LLM generation and expert review by 25 domain specialists.

The benchmark evaluates multiple capabilities essential for Earth Intelligence:

  • Multiple-choice QA (single and multiple answers)
  • Open-ended QA (with and without retrieval context)
  • Hallucination detection and factual reliability

Across this suite, EVE-Instruct achieves the strongest overall performance among comparable models in its size range, leading on multiple-choice QA, hallucination detection, and open-ended QA while remaining competitive when retrieval context is provided.


Table 3
:
Model performance across EO and Earth Sciences benchmark tasks presented in Table 1 (0-shot). EVE WR (win rate) indicates percentage of pairwise comparisons where EVE-Instruct is preferred over the comparison model (> 50% means EVE is preferred). Rank ↓ (lower is better) reports the average per-metric rank across MCQA multiple (IoU and Accuracy), MCQA single (Accuracy), Hallucination (F1), Open-ended QA (Judge), and Open-Ended QA with Context (Judge).

Data

  • Open-access curated EO & Earth Sciences corpus: 2.8B tokens publicly released, collected from 172 sources across 22 trusted institutions, spanning scientific literature, institutional knowledge bases, and technical documentation to support reproducible domain LLM research and downstream development.
  • Synthetic instruction generation: large-scale instruction dataset totaling 10.7B released tokens (derived from ~21B generated tokens), covering QA, long-form reasoning, and multi-document tasks.
  • Evaluation benchmarks alignment: the data pipeline also produces the domain benchmark suite with 5,693 expert-reviewed samples across QA and factuality tasks.

Reliability & Grounding

  • Retrieval grounding: answers are generated using relevant documents retrieved from curated knowledge bases (~365k documents), ensuring responses are anchored in verifiable sources.
  • Evidence-aware generation: retrieved passages are injected into the model context, enabling answers that reference domain evidence rather than relying solely on parametric knowledge.
  • Hallucination detection: EVE includes adedicated factuality stage where the model evaluates its own response and flagspotential hallucinations before final output.
  • Revision / verification loop: when issues are detected, the system reformulates the query, retrieves additional evidence,generates a revised answer, and selects the most reliable version.

Why this matters: EVE is designed to prioritize grounded, evidence-backed responses, reducing hallucinations while preserving practical latency for real-world use.

Deployment

  • GUI interface: interactive chat environment with streaming responses, source citations, knowledge-base selection, and feedback/evaluation tools for iterative exploration.
  • Transparent retrieval: users can inspect retrieved documents, control grounding settings, and refine queries directly from the interface.
  • Production API: FastAPI-based service exposing endpoints for chat/completion, RAG queries, document ingestion, evaluation, and knowledge-base management.
  • System integration: supports programmatic access, workflow automation, and deployment within existing EO data platforms.
  • Page visualization: sliding window showcasing GUI screens and representative FastAPI endpoints to illustrate real usage and developer integration.
End-to-end architecture of the deployed EVE system

Citation

If you use this project in academic or research settings, please cite:

@article{atrio2026eve,
  title   = {{EVE}: A Domain-Specific {LLM} Framework for {E}arth {I}ntelligence},
  author  = {Atrio, Àlex R. and Lopez, Antonio and Rohit, Jino and 
             El Ouahidi, Yassine and Politi, Marcello and Iyer, Vijayasri and 
             Jamil, Umar and Bratières, Sébastien and Longépé, Nicolas},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026},
}

Acknowledgements

This project is supported by the European Space Agency (ESA) Φ-lab through the Large Language Model for Earth Observation and Earth Science project, as part of the Foresight Element within FutureEO Block 4 programme.

EVE Public Launch

EVE is launching publicly in Q2 2026. Register now to be notified when access opens. Be among the first to explore a foundation model built for Earth Observation: trained on satellite data and designed for real-world geospatial applications.