EVE is the first open-source, end-to-end initiative to build domain-specialised large language models for Earth Intelligence

Àlex R. Atrio1, Antonio Lopez1, Jino Rohit1, Yassine El Ouahidi2, Marcello Politi1, Vijayasri Iyer1, Umar Jamil2, Sébastien Bratières1,3, and Nicolas Longépé4.

1Pi School, 2Mistral AI, 3Translated, 4ESA Φ-lab

Abstract

Its core model, EVE-Instruct (24B), achieves strong reasoning and QA performance on new Earth Observation and Earth Sciences benchmarks while preserving general capabilities. EVE combines curated data, domain benchmarks, RAG, and hallucination detection in a production system deployed via API and GUI, with models and datasets released openly.

Explore the comprehensive research

Contributions

  • EVE-Instruct: a specialized 24B LLM for Earth Intelligence.
  • A curated EO and Earth Sciences corpus (2.8B tokens) plus a large-scale synthetic instruction dataset (10.7B tokens).
  • The first manually created EO and Earth Sciences benchmarks (5,693 samples) covering QA and factuality.
  • A deployed RAG- and hallucination-aware chat system available via GUI and API.
  • Open release of models, datasets, and code to enable reproducible domain-specific LLM development.
System architecture of EVE depicting component interactions
System architecture of EVE depicting component interactions

Benchmark of the model

To address the lack of standardized evaluation in Earth Observation and Earth Sciences, we introduce the first manually curated benchmark suite for domain LLMs. The dataset contains 5,693 samples created through human–LLM generation and expert review by 25 domain specialists.

The benchmark evaluates multiple capabilities essential for Earth Intelligence:

  • Multiple-choice QA (single and multiple answers)
  • Open-ended QA (with and without retrieval context)
  • Hallucination detection and factual reliability

Across this suite, EVE-Instruct achieves the strongest overall performance among comparable models in its size range, leading on multiple-choice QA, hallucination detection, and open-ended QA while remaining competitive when retrieval context is provided.

EVE-Instruct performance on EO benchmarks vs. comparable and larger models.

Data

  • Open-access curated EO & Earth Sciences corpus: 2.8B tokens publicly released, collected from 172 sources across 22 trusted institutions, spanning scientific literature, institutional knowledge bases, and technical documentation to support reproducible domain LLM research and downstream development.
  • Synthetic instruction generation: large-scale instruction dataset totaling 10.7B released tokens (derived from ~21B generated tokens), covering QA, long-form reasoning, and multi-document tasks.
  • Evaluation benchmarks alignment: the data pipeline also produces the domain benchmark suite with 5,693 expert-reviewed samples across QA and factuality tasks.

EVE data assets: curated corpus, synthetic fine-tuning datasets, and evaluation benchmarks.

Reliability & Grounding

  • Retrieval grounding: answers are generated using relevant documents retrieved from curated knowledge bases (~365k documents), ensuring responses are anchored in verifiable sources.
  • Evidence-aware generation: retrieved passages are injected into the model context, enabling answers that reference domain evidence rather than relying solely on parametric knowledge.
  • Hallucination detection: EVE includes adedicated factuality stage where the model evaluates its own response and flagspotential hallucinations before final output.
  • Revision / verification loop: when issues are detected, the system reformulates the query, retrieves additional evidence,generates a revised answer, and selects the most reliable version.

Why this matters: EVE is designed to prioritize grounded, evidence-backed responses, reducing hallucinations while preserving practical latency for real-world use.

Deployment

  • GUI interface: interactive chat environment with streaming responses, source citations, knowledge-base selection, and feedback/evaluation tools for iterative exploration.
  • Transparent retrieval: users can inspect retrieved documents, control grounding settings, and refine queries directly from the interface.
  • Production API: FastAPI-based service exposing endpoints for chat/completion, RAG queries, document ingestion, evaluation, and knowledge-base management.
  • System integration: supports programmatic access, workflow automation, and deployment within existing EO data platforms.
  • Page visualization: sliding window showcasing GUI screens and representative FastAPI endpoints to illustrate real usage and developer integration.
End-to-end architecture of the deployed EVE system

Citation

If you use this project in academic or research settings, please cite:

@misc{atrio2026evedomainspecificllmframework,
      title={{EVE}: A Domain-Specific {LLM} Framework for Earth Intelligence},
      author={Àlex R. Atrio and Antonio Lopez and Jino Rohit and Yassine El Ouahidi and Marcello Politi and Vijayasri Iyer and Umar Jamil and Sébastien Bratières and Nicolas Longépé},
      year={2026},
      eprint={2604.13071},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.13071},
}

Acknowledgements

This project is supported by the European Space Agency (ESA) Φ-lab through the Large Language Model for Earth Observation and Earth Science project, as part of the Foresight Element within FutureEO Block 4 programme.

EVE Public Launch

EVE is launching publicly in Q2 2026. Register now to be notified when access opens. Be among the first to explore a foundation model built for Earth Observation: trained on satellite data and designed for real-world geospatial applications.