Information explosion vs analysis gap
Not so long ago, collecting, storing, owning and, if necessary, digitizing data was vital for any data-driven application. Nowadays, it is safe to say that we are swimming in data, whereby one could postulate that we are drowning. An ever-increasing amount of public and commercial data is available with accompanying access APIs (Application Programming Interfaces). But downloading a large swath of data to local storage and subsequent in-house processing on dedicated hardware is inefficient and not in line with the big data processing philosophy. While the FAIR principles are fulfilled in as much as the data is findable, accessible, and interoperable, the actual reuse of the data to gain new insights depends on the data user’s local storage and processing capabilities. However, scientists aware of the potentially available data and processing capabilities are still not able to easily identify, access and utilize these resources as they require to perform their work; while the analysis gap entailed by the information explosion is being increasingly highlighted, remediation lags.
Networks of data
Every individual data source describes a clearly defined aspect of a system, process, or model; to solve complex questions such as addressing EU Green Deal action items, complex networks of data from diverse sources are required. Data sources must be set in relation to each other, aligning the semantics of common concepts utilized across sources. In addition, they must reflect their provenance, exhibit all the intrinsic dependencies and relations, and be structured in a manageable form. Some data is too big or complex, in other cases too sparse or lacking completeness, to be directly processed by static algorithms with pre-defined parameters. In addition, some data sources do not align in terms of underlying methodologies, such as sampling, referencing of commonly agreed concepts, or data format.
Datacubes
The concept of multidimensional datacubes can help overcome many if not all the challenges mentioned above with respect to performance, scalability, interoperability, semantics, sampling, geo-referencing, and readiness for ML applications. Peter Baumann (professor of Computer Science at Constructor (formerly: Jacobs) University, Bremen, Germany and founder and CEO of Rasdaman GmbH) introduces the concept and use of data cubes in the following video clip.
While initial effort is required to transform existing point and vector sources to multidimensional grid formats and harmonize domain semantics, these formats’ discrete spatio-temporal basis facilitates alignment and is advantageous for automated processing. Depending on the technology used, creating datacubes can be done either through referencing existing archive files generating “virtual cubes” for users or importing (i.e., copying) with re-gridding, which requires some computational efforts as a pre-processing step. In both cases, the user can directly focus on the data analysis and interpretation step.
Each dimension of a datacube can represent a different property, integrating data from diverse sources across standard dimensions such as space and time; all this data is typically hosted in a single or multiple cloud environments. As long as the spatio-temporal dimensions of all component sources can be aligned, additional dimensions pertaining to the properties provided in these data sources can be added, and each grid point of the datacube can be populated with a variety of physical, biological, socio-economic, geographical, etc. properties. Given completeness and/or additional gap-filling techniques applied to the datacube, accessing the diverse dimensions associated with a single spatio-temporal grid point returns a multitude of information uniquely defined in time and/or space. Subsets of this data (data windows or ranges) can be defined and accessed from a cloud-based data hosting service. Subsequently, these can be processed and transformed to spatial features and indicators, providing insights on a wide range of properties not necessarily contained within the originally accessed data. These outputs can then be directly visualized or passed on down the pipeline for additional processing. Aggregating data over individual dimensions of the cube and processing across dimensions is solved efficiently by the underlying datacube engine. These cube capabilities providing discretely structured and aligned data allow for straightforward and efficient application of ML algorithms. For large spatio-temporal datasets common in Earth Observation, raster formats aligned with common datacubes appear to be the most appropriate form of data storage/handling/processing.
FAIR processing and analysing
The challenge posed by the potential of such multi-thematic datacubes pertains to processing; while established processing paradigms align well with the constrained datasets traditionally available, these must be rethought when confronting both the structure and sheer volume of data available in cube formats. Fortuitously, the experience and compute resources to run artificial intelligence (AI) and specifically machine learning (ML) applications have significantly evolved, developing from what was long seen as a buzzword for untransparent data analysis towards viable tools enabling the extraction of real value and insights from large and complex data collections. Therefore, when providing data, processing functionality and data products to relevant stakeholders such as governmental authorities, civil society and NGOs, commercial players, or researchers, all these aspects must be considered. Following the FAIR principles, while data is becoming increasingly findable, accessible, and interoperable, true reusability depends on the availability and functionality of suitable processing mechanisms. As ever more decisions are taken based on live, historical, or synthetic data, applying the FAIR principles towards analysis and processing is essential in maintaining trust in the data and analyses underpinning these actions. In this project, we aim to advance the FAIRness of both data and data analysis and subsequent products by enhancing the reusability of existing data and applying the FAIR principles to advanced data analytics algorithms and concepts.
Machine learning and datacubes
Datacube management and access is well established, and processing, analysis, and visualisation are well matured. Creating a cube and transforming data to match a standard grid and grid spacing is becoming less of a challenge as advanced import routines are becoming available to diverse source data. Thanks to the scalability and performance benefits, data centres can be used across a wide range of thematic and geographical scopes, e.g., creating a city cube or a cube comprising a whole continent. A wide range of publications further demonstrates the performance, flexibility, and useability of datacubes for earth observation data. However, the broader potential of machine learning applied to multi-thematic datacubes has rarely been demonstrated. Most ML applications on datacubes focus on a limited number of data sources or often just temporal steps within one dataset.