Publishing Environmental Data APIs for use in AI workflows: recommendations and demonstrators of a standard approach within the NERC Environmental Data Service
The advent of artificial intelligence as a scientific tool is driving a new demand for multidisciplinary
data analyses that crosscuts scientific domain boundaries. Developing
interoperable solutions to data delivery enables new avenues for scientific investigation.
This report summarises work on development of interoperability tools, for environmental
research.
The NERC Environmental Data Service (EDS) consists of five domain specific data
centres supplying data to environmental scientists. The data is findable through data
catalogues and web search engines, thanks to decades of collaborative effort
implementing standardised discovery metadata. However, the access methods, formats
and content of the delivered data are varied, and users need to spend time navigating
and understanding these. Data access through Application Programming Interfaces (API)
are preferred over bulk data downloads, because they allow programmatic querying and
repeatable workflows, and are recommended for access to data that is large, complex
or being continuously updated.
This project’s detailed aim was a greater level of standardisation of data access APIs
across the EDS, with a particular focus on their use in AI and machine learning (ML)
applications. This will reduce the effort needed by EDS as data publishers and by
environmental researchers as data consumers, saving development time and easing
data integration processes. This supports systematic AI analysis of multiple
environmental datatypes to underpin development of predictive environmental
modelling and digital twins.
Through co-design and Agile development processes, we identified and recommended
mlcommons Croissant specification as a common standard to help ML consumers
interface between data APIs - and bulk download - of any design. Croissant extends
existing metadata standards, is understood by web search engines and AI agents to
support findability and is integrated into ML python libraries and popular ML platforms to
support usability.
We created a number of Croissant descriptors from each of the data centres, a new data
API, and extensions of metadata APIs to serve croissant metadata. We created
demonstrator ML workflow notebooks using the Croissant descriptors and data APIs and
ran these on different data science platforms to demonstrate portability.
Croissant [26] is a relatively new standard and not built primarily for data access by API
or for multi-dimensional spatiotemporal data. We identified areas where croissant and
the implementing libraries could work better for these use cases, such as use of the emerging geo-croissant extension, integration with OpenAPI [38] specifications, and
support for authenticated data access.
At the API implementation level our recommendations were more flexible, and in line
with existing EDS practices to use API standards appropriate to the data type (e.g. OGC
[37], STAC [43]), and to describe APIs using OpenAPI specification.
Details
Publication status:
Unpublished
Author(s):
Authors: Card, Chris, Heaven, Rachel ORCID record for Rachel Heaven, Kingdon, Andrew ORCID record for Andrew Kingdon, Bell, Patrick, Baldwin, Alex, Carter, Jeremy, Coney, Jonathan, Cooper, Jonathan, Gonzalez Alvarez, Itahisa, Hollaway, Michael ORCID record for Michael Hollaway, McCormack, Matthew, Poulter, David, Stephens, Ag, Trembath, Philip