Data Lake Requirements and Constraints

Data Lake Requirements

The data lake constitutes the central infrastructure of the project, designed to support the acquisition, standardisation, and analysis of heterogeneous datasets at national scale. Requirements are organised into two categories: functional and non-functional.

Functional requirements

RF1: Heterogeneous Data Acquisition

The system must support the acquisition of over 180 datasets from diverse public sources, as catalogued in the project data catalogue. By institution, ISTAT is the primary source with 113 datasets (62%), followed by the Ministry of Education with 35 datasets (19%). Complementary sources include Agenas for healthcare facilities (6 datasets), ISPRA for environmental data (6 datasets), the Ragioneria Generale dello Stato (4 datasets), the Ministry of Health (3 datasets), the Ministry of Justice (2 datasets), and 8 other public bodies with smaller contributions. The variety of distribution formats (CSV, XLSX, HTML, PDF) requires differentiated parsing capabilities for each type.

RF2: Historical Series Management

The temporal dimension covers the period 2010-2025, for a total of 15 years of observations. The reference granularity is the municipal level, entailing management of approximately 8,000 Italian municipalities. The system must handle temporal discontinuities introduced by the COVID-19 pandemic in 2020-2021, during which many surveys were interrupted or underwent methodological changes. It must also track municipal mergers and splits occurring in the period through a historical lookup system.

RF3: Data Standardisation

The standardisation process involves several critical operations. ISTAT codes must be normalised to ensure consistency of territorial references. Spatial coordinates are aligned to the reference system EPSG:32632 (WGS84 / UTM zone 32N). The system must distinguish between missing values and semantic nulls, applying differentiated management strategies. Reconciliation of municipal mergers and splits ensures continuity of historical series even in the presence of administrative changes.

RF4: Medallion Pattern

The data architecture follows the Medallion pattern, articulated in three distinct layers. The Bronze layer stores immutable raw data, exactly as acquired from sources. The Silver layer applies standardisation through an EAV (Entity-Attribute-Value) schema that manages the variability of municipal attributes over time. The Gold layer produces analytical data marts optimised for business applications, with denormalised tables and PostGIS geometries ready for spatial querying. Implementation details are described in chapter 2.

Non-functional requirements

RNF1: Completeness

The system must guarantee coverage equal to or greater than 95% of Italian municipalities for each priority dataset. The reference metric is the percentage of municipalities with complete data relative to the total number of municipalities existing in the reference period.

RNF2: Accuracy

Accuracy requires that at least 99.9% of records contain a valid ISTAT code. Validation is based on comparison with the reference dataset of official administrative boundaries published by ISTAT.

RNF3: Timeliness

Datasets with monthly update frequency must be processed within 5 days of their availability. Priority ETL pipelines must complete processing with a latency of less than 24 hours.

RNF4: Lineage

Every record must be traceable through the entire transformation chain, from the original source to the Bronze layer, then to the Silver layer, and finally to the Gold layer. The system must maintain versioning of data sources to allow historical reconstruction of transformations.

RNF5: Scalability

The system is sized to support a data volume of up to 1TB over the three-year project period. ETL pipeline throughput must be equal to or greater than 100 records per second to guarantee acceptable processing times.

RNF6: Governance

Governance requires complete cataloguing of all datasets with structured metadata. A real-time dashboard must monitor data quality, highlighting anomalies and deviations from defined thresholds. All operations on sensitive data must be recorded in complete logs for audit purposes.

Critical data for MVP (Phase 1a)

The initial phase of the project (months 1-4) focuses on the acquisition and validation of 5 critical datasets, which constitute the minimum information base for developing the SLO algorithms. The following table summarises the main characteristics of each dataset.

Dataset	Source	Format	Frequency	Priority
Municipal boundaries + metadata	ISTAT	Shapefile/JSON	Annual	High
Resident population	ISTAT	CSV	Annual	High
Commuting matrices	ISTAT	CSV	Decennial	High
Public transport network	GTFS	ZIP	Monthly	Medium
Healthcare facilities	Min. Salute	Excel	Annual	Medium

Architectural constraints

The system architecture is subject to four fundamental constraints, derived from both strategic choices and specific technical requirements.

Constraint V1: Self-Hosted Open Source

The choice to use exclusively open source technologies in self-hosted mode responds to the need for vendor independence and cost predictability. The technology stack comprises PostgreSQL for primary storage, PostGIS for spatial operations, Prefect for pipeline orchestration, OpenMetadata for data governance, and DuckDB for embedded analytics. Managed services offered by BigQuery, Azure, and AWS are explicitly excluded, as they would introduce vendor dependencies and variable costs that are difficult to forecast.

Constraint V2: Scale Adequacy

The expected data volumes are on the order of 10^4 rows by 10^3 columns, typical of projects managed by DEPP. At this scale, adopting BigQuery would represent excess capacity (overkill) relative to actual needs, resulting in unjustified resource waste and management complexity.

Constraint V3: Territorial Standards

PostGIS is the de facto industry standard for managing spatial data in a PostgreSQL environment. The adopted reference system is EPSG:32632 (WGS84 / UTM zone 32N), which is the standard system for Italian territory and ensures compatibility with national geographic datasets.

Constraint V4: FAIR Compliance

Released datasets must comply with FAIR principles. Findable: data must be catalogued with structured metadata that facilitates discovery. Accessible: the adopted licences (CC-BY, CC0) guarantee free and open access. Interoperable: the distribution formats (GeoJSON, GeoParquet, CSV) are open standards that ensure interoperability with other systems. Reusable: the assignment of DOI guarantees scientific citability and version traceability.

Specific challenges

The project faces four significant technical challenges, each requiring specific architectural solutions.

Challenge S1: Absence of Point-Level Geolocation

Most of the services analysed (schools, hospitals, post offices) do not have precise geographic coordinates. The approach adopted is based on detecting the presence or absence of a service at municipal level, forgoing point-level geolocation. Consequently, isochrone calculations operate on municipality-to-municipality distances rather than point-to-point.

Challenge S2: Administrative Boundary Evolution

In the period 2010-2025, approximately 100 municipal mergers or splits occurred, altering the reference territorial structure. The adopted solution uses a historical lookup with temporal validity implemented according to the Slowly Changing Dimension Type 2 (SCD Type 2) pattern. The Silver schema includes the valid_from and valid_to columns that delimit the validity interval of each territorial configuration.

Challenge S3: COVID Temporal Gaps

Data relating to the 2020-2021 period presents methodological discontinuities due to restrictions imposed by the pandemic. The system applies a covid_affected flag to impacted datasets, enabling differentiated analysis strategies. Options include statistical imputation techniques to estimate missing values, or explicit exclusion of the problematic periods from longitudinal analyses.

Challenge S4: Format Heterogeneity

The presence of 207 datasets distributed in different formats (CSV, XLSX, HTML, PDF) requires a multi-layer extraction stack. Docling handles extraction from complex PDFs with structured tables. Pandas processes CSV and Excel files. BeautifulSoup parses HTML tables. Final normalisation occurs through the flexible EAV schema of the Silver layer, which absorbs the structural variability of sources.

WP2 Objectives

At the conclusion of Work Package 2, the data lake must satisfy five measurable objectives:

Acquire the over 180 datasets catalogued in the project data catalogue, ensuring diversity in input types and formats.
Guarantee coverage equal to or greater than 95% of Italian municipalities for all priority datasets identified in the MVP phase.
Provide complete traceability through Bronze-Silver-Gold lineage for every individual record present in the system.
Validate ETL pipelines through data quality reports documenting completeness, accuracy, and timeliness metrics.
Prepare the data infrastructure required for implementing the SLO algorithms planned in Work Package 3.

Expected outputs (Deliverables)

Work Package 2 produces five documentary deliverables:

D2.1.1: Data Lake Technical Design Document
D2.1.2: Cloud Architecture Solution Design
D2.1.3: Hosting Infrastructure Specifications
D2.2: Documented and tested ETL scripts (Python/SQL)
D2.3: ETL Pipeline Validation Report

Data Lake Requirements and Constraints ​