🇬🇧 English
🇬🇧 English
Appearance
🇬🇧 English
🇬🇧 English
Appearance
The data lake constitutes the central infrastructure of the project, designed to support the acquisition, standardisation, and analysis of heterogeneous datasets at national scale. Requirements are organised into two categories: functional and non-functional.
RF1: Heterogeneous Data Acquisition
The system must support the acquisition of over 180 datasets from diverse public sources, as catalogued in the project data catalogue. By institution, ISTAT is the primary source with 113 datasets (62%), followed by the Ministry of Education with 35 datasets (19%). Complementary sources include Agenas for healthcare facilities (6 datasets), ISPRA for environmental data (6 datasets), the Ragioneria Generale dello Stato (4 datasets), the Ministry of Health (3 datasets), the Ministry of Justice (2 datasets), and 8 other public bodies with smaller contributions. The variety of distribution formats (CSV, XLSX, HTML, PDF) requires differentiated parsing capabilities for each type.
RF2: Historical Series Management
The temporal dimension covers the period 2010-2025, for a total of 15 years of observations. The reference granularity is the municipal level, entailing management of approximately 8,000 Italian municipalities. The system must handle temporal discontinuities introduced by the COVID-19 pandemic in 2020-2021, during which many surveys were interrupted or underwent methodological changes. It must also track municipal mergers and splits occurring in the period through a historical lookup system.
RF3: Data Standardisation
The standardisation process involves several critical operations. ISTAT codes must be normalised to ensure consistency of territorial references. Spatial coordinates are aligned to the reference system EPSG:32632 (WGS84 / UTM zone 32N). The system must distinguish between missing values and semantic nulls, applying differentiated management strategies. Reconciliation of municipal mergers and splits ensures continuity of historical series even in the presence of administrative changes.
RF4: Medallion Pattern
The data architecture follows the Medallion pattern, articulated in three distinct layers. The Bronze layer stores immutable raw data, exactly as acquired from sources. The Silver layer applies standardisation through an EAV (Entity-Attribute-Value) schema that manages the variability of municipal attributes over time. The Gold layer produces analytical data marts optimised for business applications, with denormalised tables and PostGIS geometries ready for spatial querying. Implementation details are described in chapter 2.
RNF1: Completeness
The system must guarantee coverage equal to or greater than 95% of Italian municipalities for each priority dataset. The reference metric is the percentage of municipalities with complete data relative to the total number of municipalities existing in the reference period.
RNF2: Accuracy
Accuracy requires that at least 99.9% of records contain a valid ISTAT code. Validation is based on comparison with the reference dataset of official administrative boundaries published by ISTAT.
RNF3: Timeliness
Datasets with monthly update frequency must be processed within 5 days of their availability. Priority ETL pipelines must complete processing with a latency of less than 24 hours.
RNF4: Lineage
Every record must be traceable through the entire transformation chain, from the original source to the Bronze layer, then to the Silver layer, and finally to the Gold layer. The system must maintain versioning of data sources to allow historical reconstruction of transformations.
RNF5: Scalability
The system is sized to support a data volume of up to 1TB over the three-year project period. ETL pipeline throughput must be equal to or greater than 100 records per second to guarantee acceptable processing times.
RNF6: Governance
Governance requires complete cataloguing of all datasets with structured metadata. A real-time dashboard must monitor data quality, highlighting anomalies and deviations from defined thresholds. All operations on sensitive data must be recorded in complete logs for audit purposes.
The initial phase of the project (months 1-4) focuses on the acquisition and validation of 5 critical datasets, which constitute the minimum information base for developing the SLO algorithms. The following table summarises the main characteristics of each dataset.
| Dataset | Source | Format | Frequency | Priority |
|---|---|---|---|---|
| Municipal boundaries + metadata | ISTAT | Shapefile/JSON | Annual | High |
| Resident population | ISTAT | CSV | Annual | High |
| Commuting matrices | ISTAT | CSV | Decennial | High |
| Public transport network | GTFS | ZIP | Monthly | Medium |
| Healthcare facilities | Min. Salute | Excel | Annual | Medium |
The system architecture is subject to four fundamental constraints, derived from both strategic choices and specific technical requirements.
Constraint V1: Self-Hosted Open Source
The choice to use exclusively open source technologies in self-hosted mode responds to the need for vendor independence and cost predictability. The technology stack comprises PostgreSQL for primary storage, PostGIS for spatial operations, Prefect for pipeline orchestration, OpenMetadata for data governance, and DuckDB for embedded analytics. Managed services offered by BigQuery, Azure, and AWS are explicitly excluded, as they would introduce vendor dependencies and variable costs that are difficult to forecast.
Constraint V2: Scale Adequacy
The expected data volumes are on the order of 10^4 rows by 10^3 columns, typical of projects managed by DEPP. At this scale, adopting BigQuery would represent excess capacity (overkill) relative to actual needs, resulting in unjustified resource waste and management complexity.
Constraint V3: Territorial Standards
PostGIS is the de facto industry standard for managing spatial data in a PostgreSQL environment. The adopted reference system is EPSG:32632 (WGS84 / UTM zone 32N), which is the standard system for Italian territory and ensures compatibility with national geographic datasets.
Constraint V4: FAIR Compliance
Released datasets must comply with FAIR principles. Findable: data must be catalogued with structured metadata that facilitates discovery. Accessible: the adopted licences (CC-BY, CC0) guarantee free and open access. Interoperable: the distribution formats (GeoJSON, GeoParquet, CSV) are open standards that ensure interoperability with other systems. Reusable: the assignment of DOI guarantees scientific citability and version traceability.
The project faces four significant technical challenges, each requiring specific architectural solutions.
Challenge S1: Absence of Point-Level Geolocation
Most of the services analysed (schools, hospitals, post offices) do not have precise geographic coordinates. The approach adopted is based on detecting the presence or absence of a service at municipal level, forgoing point-level geolocation. Consequently, isochrone calculations operate on municipality-to-municipality distances rather than point-to-point.
Challenge S2: Administrative Boundary Evolution
In the period 2010-2025, approximately 100 municipal mergers or splits occurred, altering the reference territorial structure. The adopted solution uses a historical lookup with temporal validity implemented according to the Slowly Changing Dimension Type 2 (SCD Type 2) pattern. The Silver schema includes the valid_from and valid_to columns that delimit the validity interval of each territorial configuration.
Challenge S3: COVID Temporal Gaps
Data relating to the 2020-2021 period presents methodological discontinuities due to restrictions imposed by the pandemic. The system applies a covid_affected flag to impacted datasets, enabling differentiated analysis strategies. Options include statistical imputation techniques to estimate missing values, or explicit exclusion of the problematic periods from longitudinal analyses.
Challenge S4: Format Heterogeneity
The presence of 207 datasets distributed in different formats (CSV, XLSX, HTML, PDF) requires a multi-layer extraction stack. Docling handles extraction from complex PDFs with structured tables. Pandas processes CSV and Excel files. BeautifulSoup parses HTML tables. Final normalisation occurs through the flexible EAV schema of the Silver layer, which absorbs the structural variability of sources.
At the conclusion of Work Package 2, the data lake must satisfy five measurable objectives:
Work Package 2 produces five documentary deliverables: