Data-Lake Technical Architecture

Deliverable D2.1.1: Data-Lake Technical Design Document

Medallion Pattern: Bronze → Silver → Gold

Overview

The MAPS Data Lake architecture adopts the Medallion pattern (also known as Multi-Hop Architecture), a best practice in the Modern Data Stack ecosystem that organizes data into three layers of progressive refinement with increasing levels of quality, cleanliness, and business-readiness:

EXTERNAL SOURCES → BRONZE → SILVER → GOLD → APPLICATIONS

Rationale

For the MAPS project, the Medallion architecture is particularly well-suited for five main reasons.

Data source heterogeneity is the first determining factor. The project manages over 180 datasets from different public sources (ISTAT, Ministries, local authorities), distributed across multiple formats (CSV, Excel, PDF, JSON, HTML) with variable quality ranging from structured datasets to semi-structured documents. The layered approach allows this heterogeneity to be standardized progressively without losing the original information.

Audit and compliance requirements represent the second critical element. Although the data is public, scientific research requires complete traceability of provenance and transformations applied. Article 30 of the GDPR also requires documentation of data processing activities. The immutable Bronze layer provides the original "source of truth" needed to satisfy these requirements.

Transformation complexity requires a multi-stage architecture. Pipelines operate in multi-step mode, moving from acquisition from heterogeneous sources (CSV, Excel, PDF, JSON, HTML) to validation, then to normalization into EAV schema, and finally to territorial aggregations. Managing time series from 2010 to 2025 with municipal mergers and splits adds further complexity. The logical separation between raw acquisition (Bronze), cleaning (Silver), and business logic (Gold) simplifies the development and debugging of transformations.

Data reuse justifies the introduction of a validated intermediate layer. The same datasets are used for multiple analyses (demographics, services, mobility), making re-processing from the external source for each use inefficient. The Silver layer acts as a validated enterprise cache, accelerating downstream analyses.

Performance and scalability constitute the final enabling factor. Analytical queries operate on over 8,000 municipalities with dozens of attributes each, while PostGIS spatial operations are computationally intensive. The denormalized Gold layer optimizes the performance of frequent queries, reducing response times from seconds to milliseconds.

Complete data flow

graph TB
    A[External Sources
ISTAT, GTFS, Ministries, OpenData] --> B[Prefect Orchestration
Worker Pools: istat, pdf, analytics]
    B --> C[Bronze Layer
File System - Raw Files]
    C --> D[Silver Layer
PostgreSQL - EAV Schema]
    D --> E[Gold Layer
PostgreSQL + PostGIS]
    B -.metadata.-> F[OpenMetadata
Internal Governance]
    E --> F
    E --> G[DuckDB
Analytics]
    E --> H[API v2
Web Apps]
    E --> I[CKAN
Open Data Catalog]
    I --> J[dati.gov.it
National Portal]

    style C fill:#cd7f32
    style D fill:#c0c0c0
    style E fill:#ffd700

Data flows from external sources through three progressive layers, orchestrated by Prefect. The Bronze layer preserves the original files; the Silver layer normalizes them into EAV schema on PostgreSQL; the Gold layer produces analytical tables with PostGIS spatial extensions. Published data reaches the national portal via CKAN, while OpenMetadata tracks governance throughout the entire flow.

Bronze Layer: Raw Data Archive

Role: Immutable archive of original data exactly as downloaded from the source.

Key principle: "Never modify, always preserve"

Technical characteristics

Aspect	MAPS Specification
Storage	Local file system `/data/bronze/` (op-linkurious server)
Format	Original files without transformations (CSV, XLSX, PDF, JSON, HTML, Parquet)
Naming convention	`/data/bronze/{source}/{year}/{dataset}_{timestamp}.{ext}`
Retention policy	Indefinite (storage cost is low: ~50GB total for 180 datasets)
Backup	Daily snapshots via `backup.sh` script
Access mode	Write-once, read-rarely (only for reprocessing or audit)

Directory structure

bash

/data/bronze/
├── istat/
│   ├── 2024/
│   │   ├── popolazione_comuni_20240218.csv
│   │   ├── pendolarismo_matrix_20240218.csv
│   │   ├── confini_amministrativi_20240218.geojson
│   │   └── _metadata/
│   │       ├── popolazione_comuni_20240218.json   # Metadata file
│   │       └── checksums.sha256
│   └── 2023/
│       └── popolazione_comuni_20230315.csv
├── minlavoro/
│   └── 2024/
│       ├── tabacchi_adm_report_20240218.pdf
│       └── _metadata/
│           └── tabacchi_adm_report_20240218.json
├── minsalute/
│   └── 2023/
│       ├── strutture_sanitarie_asl_20231120.xlsx
│       └── _metadata/
│           └── strutture_sanitarie_asl_20231120.json
└── minambiente/
    └── 2024/
        ├── ato_gas_page_20240115.html          # ← HTML file
        └── _metadata/
            └── ato_gas_page_20240115.json

Metadata tracking

Tracking occurs at two complementary levels: a local JSON file alongside each acquired file, and a centralized table in PostgreSQL that records the status of all acquisitions.

Each Bronze file has a corresponding JSON file with acquisition metadata:

Example: /data/bronze/istat/2024/_metadata/popolazione_comuni_20240218.json

json

{
  "file_path": "/data/bronze/istat/2024/popolazione_comuni_20240218.csv",
  "fonte": "istat",
  "dataset": "popolazione",
  "anno_riferimento": 2024,
  "download_timestamp": "2024-02-18T03:00:15Z",
  "download_url": "https://www.istat.it/storage/cartografia/popolazione_comuni_2024.csv",
  "file_size_bytes": 3355482,
  "file_hash_sha256": "a3f5b8c9d2e1f4a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0",
  "mime_type": "text/csv",
  "encoding": "UTF-8",
  "rows_detected": 7901,
  "columns_detected": 12,
  "prefect_flow_run_id": "abc123-456-def-789",
  "ingestion_status": "completed"
}

The file is written by the Prefect flow at the end of each acquisition and resides in the _metadata/ subdirectory next to the corresponding data file. It records essential operational information: provenance, integrity hash, status, and the identifier of the flow that performed the download. It serves three functions in the architecture: it enables targeted reprocessing (if an error occurs in the Silver layer, the pipeline identifies the Bronze file to reprocess without re-downloading from the source); it guarantees the audit trail via SHA256 hash; and it provides the technical metadata that the flow publishes to OpenMetadata via API, feeding the centralized governance catalog.

Database tracking (PostgreSQL):

While the JSON is local to the file system, the bronze.ingestion_log table centralizes the status of all acquisitions, enabling aggregate queries by source, dataset, and status. It acts as a registry that Prefect can query to manage re-executions and guarantee deduplication via the uniqueness constraint on the file hash.

sql

-- Schema: bronze
CREATE TABLE bronze.ingestion_log (
    id BIGSERIAL PRIMARY KEY,
    fonte VARCHAR(50) NOT NULL,
    dataset VARCHAR(100) NOT NULL,
    anno_rif INTEGER NOT NULL,
    file_path TEXT NOT NULL,
    file_size_bytes BIGINT,
    file_hash_sha256 CHAR(64),
    download_url TEXT,
    download_timestamp TIMESTAMP NOT NULL,
    prefect_flow_run_id UUID,
    status VARCHAR(20) NOT NULL,  -- 'completed', 'failed', 'in_progress'
    error_message TEXT,
    created_at TIMESTAMP DEFAULT NOW(),

    UNIQUE (fonte, dataset, anno_rif, file_hash_sha256)
);

-- Index for frequent queries
CREATE INDEX idx_ingestion_log_fonte_dataset ON bronze.ingestion_log(fonte, dataset);
CREATE INDEX idx_ingestion_log_status ON bronze.ingestion_log(status);
CREATE INDEX idx_ingestion_log_timestamp ON bronze.ingestion_log(download_timestamp DESC);

Bronze layer guarantees

The Bronze layer provides three fundamental guarantees:

Immutability: files are never modified after writing; a new download of the same dataset produces a distinct file with a different timestamp, preserving the complete history of all acquisitions.
Idempotency: re-executing a flow does not produce duplicates; if a file with the same hash is already present, the download is skipped and deduplication operates on the (fonte, dataset, anno_rif, file_hash) constraint.
Disaster recovery: the Bronze layer constitutes the original copy from which the entire platform can be rebuilt; if the Silver or Gold layers become corrupted, it is sufficient to reprocess the Bronze files via the dedicated backup script.

Use case: HTML → structured data

Some Italian public sources publish data exclusively as HTML tables embedded in web pages, without making downloadable CSV or Excel files available. This is the case for several regional registries of accredited healthcare facilities, lists of public bodies, and sections of ministerial portals where data is presented for online consultation but cannot be exported directly. This type of source requires a dedicated acquisition approach, based on HTTP GET of the page and parsing with BeautifulSoup, before proceeding to data normalization.

The Bronze-HTML flow follows the same logic as other formats: the HTML page is saved in full to the Bronze layer before any transformation.

sequenceDiagram
    participant S as Source Website
    participant P as Prefect Flow
    participant B as Bronze Layer
    participant SI as Silver Layer

    S->>P: HTTP GET HTML page
    P->>B: Save raw HTML file
    P->>B: Log metadata JSON
    Note over B: HTML preserved unchanged
    P->>P: Parse HTML with BeautifulSoup
    P->>P: Clean & Validate DataFrame
    P->>SI: Load to EAV schema

The advantage of the Bronze pattern over direct acquisition without intermediate archiving is resilience: if the parser fails because the HTML structure of the page has changed, the code can be corrected and the already-saved file reprocessed, without returning to the source. If the source server is temporarily unreachable or the page is removed, the Bronze layer preserves the original copy and allows reprocessing. The history of HTML versions acquired over time also makes it possible to detect retroactive changes to published data.

Silver Layer: Clean and validated data

Role: Single reference source for all analytical processes in the project. Data is clean, validated, and normalized from the Bronze layer.

Key principle: "Verify before trusting, then archive"

Technical characteristics

Aspect	MAPS Specification
Storage	PostgreSQL schema `silver`
Format	Normalized relational tables (EAV schema)
Naming convention	Schema `silver`, core tables: `territory_types`, `territories`, `territory_identifiers`, `territory_names`, `territory_containments`, `territory_attributes`
Retention policy	Temporal versioning (valid_from, valid_to) for time series
Backup	Daily PostgreSQL snapshots + WAL archiving
Access mode	Read-heavy (analytical queries), write-moderate (batch ingestion)

EAV Schema (Entity-Attribute-Value)

The MAPS project acquires over 180 datasets with heterogeneous structures: population, healthcare facilities, transport infrastructure, environmental data. Adopting a traditional relational model would require a separate table for each dataset, making it difficult to manage changes over time and to add new sources. The EAV (Entity-Attribute-Value) model solves this problem by representing each observation as a row (territory_id, attribute, value): the schema remains stable as acquired datasets change, and new attributes are added without modifications to the database structure.

A further requirement emerged during the project: data is not always available at municipal level. Some datasets are published at provincial or regional level; others refer to functional aggregations such as Labour Market Areas (Sistemi Locali del Lavoro, SLL) or the new Homogeneous Local Systems (Sistemi Locali Omogenei, SLO) that this project aims to define. The Silver schema therefore supports territorial entities of any type, not just municipalities.

The model is articulated in six core tables: territory_types for the registry of entity types; territories for the master registry of all territorial entities, each with a stable surrogate id that never changes even when ISTAT codes are reassigned; territory_identifiers for temporal ISTAT codes and other external identifiers; territory_names for official and alternate names with language support; territory_containments for temporal hierarchical containment (comune → provincia, with precise dates for province changes); and territory_attributes for all observed values with complete traceability. Administrative boundary change events (mergers, incorporations, splits) are tracked through territory_relationships, which stores predecessor/successor links derived from ISTAT variazioni amministrative data.

erDiagram
    TERRITORY_TYPES ||--o{ TERRITORIES : "typed as"
    TERRITORIES ||--o{ TERRITORY_IDENTIFIERS : "has identifiers"
    TERRITORIES ||--o{ TERRITORY_NAMES : "has names"
    TERRITORIES ||--o{ TERRITORY_CONTAINMENTS : "member"
    TERRITORIES ||--o{ TERRITORY_CONTAINMENTS : "container"
    TERRITORIES ||--o{ TERRITORY_ATTRIBUTES : "has attributes"

    TERRITORY_TYPES {
        varchar code PK
        varchar label
        varchar authority
        text    coding_scheme
        integer hierarchy_level
    }
    TERRITORIES {
        serial  id PK
        varchar type_code FK
        varchar label
        date    valid_from
        date    valid_to
        text    end_reason
    }
    TERRITORY_IDENTIFIERS {
        serial  id PK
        integer territory_id FK
        varchar scheme
        varchar identifier
        date    valid_from
        date    valid_to
    }
    TERRITORY_NAMES {
        serial  id PK
        integer territory_id FK
        varchar name
        varchar language
        varchar name_type
        date    valid_from
        date    valid_to
    }
    TERRITORY_CONTAINMENTS {
        serial  id PK
        integer member_id FK
        integer container_id FK
        date    valid_from
        date    valid_to
    }
    TERRITORY_ATTRIBUTES {
        serial  id PK
        integer territory_id FK
        varchar type_code
        varchar attribute
        text    value
        varchar data_type
        varchar source
        date    valid_from
        date    valid_to
    }

The territory_attributes table is the heart of the Silver schema. Each row represents a single observed value for a territorial entity over a validity period, with complete traceability of the source. The valid_from and valid_to columns implement temporal versioning according to the SCD Type 2 pattern. Crucially, territory_attributes references the stable surrogate territory_id rather than the ISTAT code directly: when a municipality changes province and receives a new ISTAT code, the attribute history is preserved without any data migration.

sql

-- Territory type registry
CREATE TABLE silver.territory_types (
    code             VARCHAR(30) PRIMARY KEY,
    label            VARCHAR(100) NOT NULL,
    authority        VARCHAR(100),          -- "ISTAT", "SNAI", "MAPS", ...
    coding_scheme    TEXT,
    hierarchy_level  INTEGER                -- 1=nation ... 5=municipality; NULL=functional
);

-- Master registry — stable surrogate id, never changes
CREATE TABLE silver.territories (
    id           SERIAL PRIMARY KEY,
    type_code    VARCHAR(30) NOT NULL REFERENCES silver.territory_types(code),
    label        VARCHAR(255),
    valid_from   DATE,                      -- from variazioni CS events; NULL = unknown start
    valid_to     DATE,                      -- from variazioni ES events; NULL = still active
    end_reason   TEXT
);

-- Temporal external identifiers (ISTAT codes, cadastral codes, ...)
CREATE TABLE silver.territory_identifiers (
    id            SERIAL PRIMARY KEY,
    territory_id  INTEGER NOT NULL REFERENCES silver.territories(id),
    scheme        VARCHAR(50) NOT NULL,     -- 'istat', 'catasto', 'fiscale', 'uts', ...
    identifier    VARCHAR(50) NOT NULL,
    valid_from    DATE,
    valid_to      DATE,
    fonte         VARCHAR(100)
);

-- Temporal names (official + bilingual + alternate)
CREATE TABLE silver.territory_names (
    id            SERIAL PRIMARY KEY,
    territory_id  INTEGER NOT NULL REFERENCES silver.territories(id),
    name          TEXT NOT NULL,
    language      VARCHAR(10),              -- 'it', 'de', 'fr', 'sl', ...
    name_type     VARCHAR(50),             -- 'official', 'local', 'alias', ...
    valid_from    DATE,
    valid_to      DATE
);

-- Temporal hierarchical containment (comune → provincia, with province-change dates)
CREATE TABLE silver.territory_containments (
    id            SERIAL PRIMARY KEY,
    member_id     INTEGER NOT NULL REFERENCES silver.territories(id),
    container_id  INTEGER NOT NULL REFERENCES silver.territories(id),
    valid_from    DATE,
    valid_to      DATE
);

-- Generalized EAV — references stable territory_id, not ISTAT code
CREATE TABLE silver.territory_attributes (
    id            SERIAL PRIMARY KEY,
    territory_id  INTEGER NOT NULL REFERENCES silver.territories(id),
    type_code     VARCHAR(30) NOT NULL,     -- denormalised from territories for query convenience
    attribute     VARCHAR(255) NOT NULL,
    value         TEXT,
    data_type     VARCHAR(50),              -- integer, float, string, boolean, date
    source        VARCHAR(100),
    valid_from    DATE,
    valid_to      DATE,                     -- NULL = current value
    UNIQUE (territory_id, attribute, valid_from)
);

CREATE INDEX idx_ta_territory  ON silver.territory_attributes(territory_id);
CREATE INDEX idx_ta_type       ON silver.territory_attributes(type_code);
CREATE INDEX idx_ta_attribute  ON silver.territory_attributes(attribute);
CREATE INDEX idx_ta_source     ON silver.territory_attributes(source);

EAV data example

To make the model concrete, a real example is provided: the municipality of AGLIÈ (ISTAT code 001001), with attributes from three distinct sources, alongside a provincial-level record and a Labour Market Area record. All entity types coexist in the same table thanks to the generalised EAV schema. Territory resolution always goes through the stable territory_id FK; the ISTAT code is looked up in territory_identifiers.

sql

-- First, resolve territory_id for AGLIÈ (ISTAT code 001001)
-- This join is handled by territory_resolver.py at transform time
SELECT t.id FROM silver.territories t
JOIN silver.territory_identifiers ti ON ti.territory_id = t.id
WHERE t.type_code = 'comune' AND ti.scheme = 'istat' AND ti.identifier = '001001';
-- → id = 42 (example)

-- Municipality AGLIÈ (territory_id=42) — attributes from three sources
INSERT INTO silver.territory_attributes
    (territory_id, type_code, attribute, value, data_type, source, valid_from)
VALUES
    (42, 'comune', 'popolazione',    '2635',       'integer', 'ISTAT',       '2024-01-01'),
    (42, 'comune', 'superficie_kmq', '13.98',      'float',   'ISTAT',       '2024-01-01'),
    (42, 'comune', 'num_tabacchi',   '2',          'integer', 'ADM',         '2024-01-01'),
    (42, 'comune', 'num_asl',        '1',          'integer', 'MinSalute',   '2023-01-01'),
    (42, 'comune', 'ato_gas_id',     'ATO-PIE-01', 'string',  'MinAmbiente', '2024-01-01');

-- Province of Turin (territory_id=107) — provincial-level dataset
INSERT INTO silver.territory_attributes
    (territory_id, type_code, attribute, value, data_type, source, valid_from)
VALUES
    (107, 'provincia', 'popolazione',    '2259523', 'integer', 'ISTAT', '2023-01-01'),
    (107, 'provincia', 'pil_pro_capite', '32400',   'integer', 'ISTAT', '2022-01-01');

-- Labour Market Area (territory_id=8) — functional aggregation
INSERT INTO silver.territory_attributes
    (territory_id, type_code, attribute, value, data_type, source, valid_from)
VALUES
    (8, 'sll', 'addetti_industria', '15200', 'integer', 'ISTAT', '2021-01-01');

Bronze → Silver transformations

Each dataset acquired in the Bronze layer is transformed into Silver through a standard Prefect pipeline in four phases: parsing the source file in its original format (CSV, Excel, PDF, HTML), cleaning and normalizing values, validation with Great Expectations, and loading into the EAV schema. This structure is uniform for all datasets, regardless of source.

python

@task(name="bronze-to-silver-transform")
def transform_bronze_to_silver(
    bronze_file_path: str,
    entity_type: str,   # 'comune', 'provincia', 'regione', 'sll', ... — used to resolve territory_id
    fonte: str,
    dataset: str,
    anno: int
):
    """
    Standard transformation pipeline: Bronze → Silver

    Steps:
    1. Parse: Read Bronze file (CSV/PDF/Excel/HTML)
    2. Clean: Normalize encoding, trim whitespace, fix typos
    3. Validate: Check data quality, enforce business rules
    4. Load: Resolve territory_id and insert into silver.territory_attributes
    """

    # Step 1: Parse
    if bronze_file_path.endswith('.csv'):
        df = pd.read_csv(bronze_file_path)
    elif bronze_file_path.endswith('.pdf'):
        df = extract_pdf_tables(bronze_file_path)  # Docling
    elif bronze_file_path.endswith('.xlsx'):
        df = pd.read_excel(bronze_file_path)
    elif bronze_file_path.endswith('.html'):
        df = parse_html_to_dataframe(bronze_file_path)  # BeautifulSoup

    # Step 2: Clean
    df = clean_dataframe(df)

    # Step 3: Validate with Great Expectations
    validation_results = validate_dataframe(df, fonte, dataset)
    if not validation_results.passed:
        raise ValueError(f"Validation failed: {validation_results.errors}")

    # Step 4: Resolve territory_id and load to Silver EAV
    load_to_territory_attributes(df, entity_type, fonte, anno)

Temporal versioning

In the 2010-2025 period, approximately 100 municipal mergers or splits occurred. When two municipalities merge into a new one, the historical data of the pre-existing entities remains valid for the period preceding the merger: it is not deleted or overwritten, but temporally bounded via the valid_from and valid_to columns. The new municipality receives its own records from its establishment date. Queries can therefore interrogate the state of the territory at any historical date with a simple temporal filter.

The following example shows a merger that occurred in 2019: municipalities 001234 and 001235 merge into the new municipality 001236. In the surrogate identity model, the two original settlements retain their territories rows (now with valid_to set); the new municipality gets a new row. The relationship is recorded in territory_relationships.

sql

-- territory_id=10 → comune 001234 (valid_to 2019-01-01)
-- territory_id=11 → comune 001235 (valid_to 2019-01-01)
-- territory_id=12 → comune 001236 (valid_from 2019-01-01, valid_to NULL)

-- Attributes before the merger (valid_to closes on merger date)
INSERT INTO silver.territory_attributes
    (territory_id, type_code, attribute, value, source, valid_from, valid_to)
VALUES
    (10, 'comune', 'popolazione', '1500', 'ISTAT', '2018-01-01', '2019-01-01'),
    (11, 'comune', 'popolazione', '800',  'ISTAT', '2018-01-01', '2019-01-01');

-- Attributes after the merger
INSERT INTO silver.territory_attributes
    (territory_id, type_code, attribute, value, source, valid_from, valid_to)
VALUES
    (12, 'comune', 'popolazione', '2300', 'ISTAT', '2019-01-01', NULL);

-- Query: Population of the settlement with ISTAT code 001234 in 2018
SELECT ta.value FROM silver.territory_attributes ta
JOIN silver.territory_identifiers ti ON ti.territory_id = ta.territory_id
WHERE ti.scheme = 'istat' AND ti.identifier = '001234'
  AND ta.attribute = 'popolazione'
  AND '2018-12-31' BETWEEN ta.valid_from AND COALESCE(ta.valid_to, '9999-12-31');
-- Result: 1500

Gold Layer: Analysis-ready data

Role: Data optimized for specific applications — aggregated, denormalized, enriched with spatial geometries. This is the layer that directly feeds dashboards, APIs, and SLO algorithms.

Key principle: "Optimize for queries, not for space"

Technical characteristics

Aspect	MAPS Specification
Storage	PostgreSQL schema `gold` with PostGIS extensions
Format	Denormalized tables (wide table) with spatial geometries
Naming convention	Schema `gold`, thematic tables by domain (municipalities, attractors, time series)
Retention policy	Snapshots updated periodically (daily or weekly); rebuildable from Silver layer
Backup	Daily snapshots (but rebuildable from Silver layer)
Access mode	High-frequency read-only (aggregate queries, point access via API, full scans for SLO algorithms)

Main data marts

A data mart is a thematic, optimized view of data designed to answer a specific set of analytical questions. In the Gold layer, each data mart is a denormalized table that aggregates information from multiple Silver sources into a single structure, eliminating the need for complex joins at query time. This approach sacrifices storage space — data is duplicated compared to Silver — in favor of query speed and ease of use for application and algorithm developers.

The MAPS Gold layer defines three main data marts, each oriented toward a distinct analytical domain.

A. gold.comuni_aggregati

This is the central table of the Gold layer: a denormalized representation of each Italian municipality, with all relevant attributes collected in a single row. Each record integrates demographic data from ISTAT, service counts from the Ministries, geographic information with PostGIS geometry, and attributes calculated by the SLO algorithm (attractiveness level, 60-minute isochrone). This structure enables complex analytical queries in under 10 ms, with no joins between tables.

sql

CREATE TABLE gold.comuni_aggregati (
    -- Identifiers
    codice_istat VARCHAR(6) PRIMARY KEY,
    denominazione VARCHAR(255) NOT NULL,
    denominazione_full VARCHAR(255),  -- With province abbreviation

    -- Administrative hierarchy
    codice_regione CHAR(2) NOT NULL,
    denominazione_regione VARCHAR(100) NOT NULL,
    codice_provincia CHAR(3),
    denominazione_provincia VARCHAR(100),
    sigla_provincia CHAR(2),

    -- Demographic attributes (from ISTAT)
    popolazione_2024 INTEGER,
    popolazione_2023 INTEGER,
    popolazione_2022 INTEGER,
    crescita_popolazione_pct NUMERIC(5,2),  -- Calculated: (2024-2023)/2023*100

    -- Geographic attributes
    superficie_kmq NUMERIC(10,2),
    densita_abitanti_kmq NUMERIC(10,2),  -- Calculated: pop/superficie
    altitudine_m INTEGER,
    zona_altimetrica VARCHAR(50),  -- 'montagna', 'collina', 'pianura'

    -- PostGIS geometry
    geometria GEOMETRY(MultiPolygon, 4326) NOT NULL,
    centroide GEOMETRY(Point, 4326),

    -- Services (from Ministries)
    num_strutture_sanitarie INTEGER DEFAULT 0,
    num_asl INTEGER DEFAULT 0,
    num_scuole_primarie INTEGER DEFAULT 0,
    num_scuole_secondarie INTEGER DEFAULT 0,
    num_tabacchi INTEGER DEFAULT 0,
    num_uffici_postali INTEGER DEFAULT 0,

    -- Infrastructure (from OpenData)
    ha_stazione_ferroviaria BOOLEAN DEFAULT FALSE,
    ha_casello_autostradale BOOLEAN DEFAULT FALSE,
    ha_aeroporto BOOLEAN DEFAULT FALSE,

    -- Territorial areas (from public registries)
    ato_gas_id VARCHAR(50),
    ato_gas_denominazione VARCHAR(255),
    ato_gas_gestore VARCHAR(255),
    ato_acqua_id VARCHAR(50),
    ato_rifiuti_id VARCHAR(50),

    -- DLS attributes (Calculated)
    attractor_level VARCHAR(50),  -- 'metropolitan', 'urban', 'semi-urban', 'rural'
    cluster_dls_id INTEGER,
    isochrone_60min GEOMETRY(MultiPolygon, 4326),

    -- Metadata
    last_updated TIMESTAMP NOT NULL DEFAULT NOW(),
    data_completeness_pct NUMERIC(5,2),  -- % populated attributes

    -- Constraints
    CONSTRAINT valid_codice CHECK (codice_istat ~ '^\d{6}$')
);

-- Spatial indexes
CREATE INDEX idx_gold_comuni_geometria ON gold.comuni_aggregati USING GIST(geometria);
CREATE INDEX idx_gold_comuni_centroide ON gold.comuni_aggregati USING GIST(centroide);
CREATE INDEX idx_gold_comuni_isochrone ON gold.comuni_aggregati USING GIST(isochrone_60min);

-- Attribute indexes
CREATE INDEX idx_gold_comuni_regione ON gold.comuni_aggregati(codice_regione);
CREATE INDEX idx_gold_comuni_provincia ON gold.comuni_aggregati(codice_provincia);
CREATE INDEX idx_gold_comuni_attractor ON gold.comuni_aggregati(attractor_level);

The following example query shows how the denormalized structure allows filtering by gas territorial area and population threshold with a single statement, no joins, and an estimated execution time under 10 ms.

sql

-- Query: Municipalities in ATO Gas "ATO-PIE-01" with population > 5000
SELECT
    denominazione,
    popolazione_2024,
    ato_gas_gestore,
    num_tabacchi,
    attractor_level
FROM gold.comuni_aggregati
WHERE ato_gas_id = 'ATO-PIE-01'
  AND popolazione_2024 > 5000
ORDER BY popolazione_2024 DESC;

-- Execution: Index scan on ato_gas_id, no joins, < 10ms

B. gold.dls_attractors

Contains the results of the territorial attractor analysis produced by the SLO algorithm. Each row represents a municipality classified according to its attractiveness level (metropolitan, urban, semi-urban, rural), with partial scores by service category, the 60-minute isochrone geometry, and the list of reachable municipalities. This table is the main output of Work Package 3 and directly feeds the cartographic visualizations and the territorial policy simulator.

sql

CREATE TABLE gold.dls_attractors (
    codice_istat VARCHAR(6) PRIMARY KEY REFERENCES gold.comuni_aggregati(codice_istat),
    denominazione VARCHAR(255) NOT NULL,

    -- Attractor classification
    attractor_level VARCHAR(50) NOT NULL,  -- 'metropolitan', 'urban', 'semi-urban', 'rural'
    attractor_score NUMERIC(5,2),  -- 0-100 score

    -- Service availability (weighted scores)
    servizi_sanitari_score NUMERIC(5,2),
    servizi_educativi_score NUMERIC(5,2),
    servizi_commerciali_score NUMERIC(5,2),
    servizi_trasporti_score NUMERIC(5,2),

    -- Isochrone analysis (60 min travel time)
    isochrone_60min GEOMETRY(MultiPolygon, 4326),
    comuni_raggiungibili_60min INTEGER[],  -- Array of ISTAT codes
    popolazione_raggiungibile_60min INTEGER,

    -- Clustering
    cluster_id INTEGER NOT NULL,
    cluster_centroid GEOMETRY(Point, 4326),

    -- Metadata
    calculation_date TIMESTAMP NOT NULL DEFAULT NOW(),

    CONSTRAINT valid_attractor_level CHECK (attractor_level IN ('metropolitan', 'urban', 'semi-urban', 'rural'))
);

C. Additional data marts

The two data marts described above cover the analytical requirements identified in the design phase. The Gold layer is however designed to accommodate additional thematic tables as the project's needs grow. New data marts can be added without modifying the Silver structure, simply by defining a new refresh function that reads from the EAV layer and materializes the data in the required form. Possible examples include views for mobility flow analysis, support tables for the territorial policy simulator (WP5), or thematic aggregations for specific territorial disadvantage indicators.

Analytical access with DuckDB

The Gold data marts are the primary data access interface for the project, but with different modes depending on the type of use. REST APIs and SLO algorithms query the Gold tables directly via PostgreSQL, leveraging spatial indexes and low-latency point queries. For exploratory analysis — Jupyter notebooks, researchers' Python scripts, hypothesis testing — DuckDB is used in embedded mode instead.

DuckDB reads Gold tables from PostgreSQL via direct federation, without needing to copy data, with SQL syntax identical to the standard:

python

import duckdb

con = duckdb.connect()
con.execute("INSTALL postgres; LOAD postgres;")
con.execute("""
    ATTACH 'dbname=maps_db host=localhost user=maps password=...'
    AS maps (TYPE postgres, READ_ONLY);
""")

# Query on the Gold layer: high-attractiveness municipalities in mountain areas
result = con.execute("""
    SELECT denominazione, attractor_score, popolazione_2024, zona_altimetrica
    FROM maps.gold.dls_attractors a
    JOIN maps.gold.comuni_aggregati c USING (codice_istat)
    WHERE a.attractor_level = 'urban'
      AND c.zona_altimetrica = 'montagna'
    ORDER BY a.attractor_score DESC
    LIMIT 20
""").df()

When data must be distributed to external researchers or archived for offline analysis, the Gold layer can be exported in Parquet format — a compressed columnar format that DuckDB reads directly, without a database, with performance equivalent to in-memory.

Silver → Gold transformations

The Gold layer is populated through a SQL function scheduled by Prefect on a daily basis. The function executes a complete table recreation cycle: first it empties the existing content with TRUNCATE, then rebuilds it with a pivot query that transforms EAV rows from the Silver layer into columns of the denormalized Gold table. At the end, an ANALYZE command updates the query optimizer statistics.

sql

-- Refresh Gold tables from Silver (scheduled daily)
CREATE OR REPLACE FUNCTION gold.refresh_comuni_aggregati()
RETURNS void AS $$
BEGIN
    -- Truncate and rebuild (snapshot approach)
    TRUNCATE gold.comuni_aggregati;

    -- Pivot EAV → Wide table
    INSERT INTO gold.comuni_aggregati
    SELECT
        c.codice_istat,
        c.denominazione,
        c.geometria,
        ST_Centroid(c.geometria) AS centroide,

        -- Pivot attributes from Silver
        MAX(CASE WHEN e.attribute = 'popolazione_2024' THEN e.value::INTEGER END) AS popolazione_2024,
        MAX(CASE WHEN e.attribute = 'popolazione_2023' THEN e.value::INTEGER END) AS popolazione_2023,
        MAX(CASE WHEN e.attribute = 'superficie_kmq' THEN e.value::NUMERIC END) AS superficie_kmq,

        -- Calculated fields
        (MAX(CASE WHEN e.attribute = 'popolazione_2024' THEN e.value::NUMERIC END) -
         MAX(CASE WHEN e.attribute = 'popolazione_2023' THEN e.value::NUMERIC END)) /
         NULLIF(MAX(CASE WHEN e.attribute = 'popolazione_2023' THEN e.value::NUMERIC END), 0) * 100
            AS crescita_popolazione_pct,

        -- Territorial areas
        MAX(CASE WHEN e.attribute = 'ato_gas_id' THEN e.value END) AS ato_gas_id,
        MAX(CASE WHEN e.attribute = 'ato_gas_gestore' THEN e.value END) AS ato_gas_gestore,

        NOW() AS last_updated
    FROM
        gold.comuni_anagrafica c
    LEFT JOIN silver.territory_identifiers ti ON ti.scheme = 'istat' AND ti.identifier = c.codice_istat
    LEFT JOIN silver.territory_attributes e ON e.territory_id = ti.territory_id
            AND e.valid_to IS NULL  -- Current data only
    GROUP BY
        c.codice_istat, c.denominazione, c.geometria;

    -- Analyze for query optimizer
    ANALYZE gold.comuni_aggregati;
END;
$$ LANGUAGE plpgsql;

-- Schedule refresh (called by Prefect flow daily)
SELECT gold.refresh_comuni_aggregati();

The SQL function approach scheduled by Prefect is the choice adopted for the MAPS project, as it keeps the transformation logic close to the database and leverages PostGIS spatial capabilities. There are however equally valid alternative approaches.

With dbt (data build tool), Silver → Gold transformations are expressed as versioned SQL models, with automatic dependency graphs, integrated tests, and generated documentation; it is the preferable choice when the number of Gold models grows and maintainability becomes a priority.

With pure Prefect, transformations occur in Python via dedicated tasks, eliminating the dependency on SQL functions in the database; it is preferable when transformations require complex logic not expressible in SQL, or when a single control point for the entire pipeline is desired.

Comparison between EAV schema and traditional schema

The MAPS project acquires over 180 datasets from heterogeneous sources — ISTAT, Ministries, public registries, OpenData — each with its own attributes, variable temporal coverage, and non-uniform availability across municipalities. This heterogeneity poses a Silver layer design problem: how to structure a schema that can accommodate new attributes without requiring database changes, that records the provenance of each value, and that manages changes over time?

The traditional approach, based on a "wide" table with one column per attribute, is inadequate in this context. The table would have hundreds of columns, most of which would be empty for any specific municipality, and would require a schema migration every time a new dataset is added. Temporal traceability would be absent or would require parallel structures.

codice_istat	popolazione	n_asili	has_ospedale	n_scuole	ato_gas_id	...
001001	45230	12	true	?	?	...
001002	8420	2	false	?	?	...

The EAV (Entity-Attribute-Value) model solves these problems by inverting the structure: instead of having one column per attribute, each attribute becomes a row. The schema remains fixed — three fundamental columns plus metadata — while the content grows vertically as new datasets are added, with no structural modifications.

entity	attribute	value	valid_from	source
001001	popolazione	45230	2021-01-01	ISTAT
001001	n_asili_nido	12	2021-01-01	ISTAT
001001	has_ospedale	true	2015-01-01	MinSalute
001002	popolazione	8420	2021-01-01	ISTAT
001002	cod_pre_fusione	001045	2017-01-01	ISTAT

The source column guarantees traceability of provenance for each individual value; the valid_from and valid_to columns allow managing time series and municipal changes without parallel structures. Storage is efficient because only attributes that are actually available are recorded, not missing values.

The main trade-off is query complexity: to read a set of attributes for a municipality, a pivot operation (CASE WHEN) is required that transforms EAV rows into columns. This complexity is however confined to the Silver → Gold transformation layer, invisible to users of the data marts. Read performance, potentially penalized by the higher number of rows, is mitigated by appropriate indexes on codice_istat and attributo. Values are stored as text and converted to the correct type during Gold layer population.

The EAV schema in the Silver layer represents the optimal trade-off for MAPS: maximum flexibility in acquisition, long-term manageability, and no structural coupling between the arrival of new data and the database schema.

Technology stack

The choice of technology stack for MAPS is guided by three criteria: adequacy for the scale of data handled, independence from commercial vendors, and tool maturity. The project works with approximately 180 datasets at municipal granularity — a volume in the order of tens of gigabytes, not petabytes — that does not require large-scale cloud infrastructure but benefits from specialized tools for spatial data and territorial analysis.

The entire stack consists of open source software, each chosen for a precise and non-overlapping role. PostgreSQL with the PostGIS extension is the core of the system: it manages both the Silver layer EAV schema and the Gold layer denormalized tables, with native support for geographic operations such as isochrone calculation. Prefect orchestrates pipeline execution, from Bronze data collection through to Gold layer population, with monitoring and error handling. OpenMetadata catalogs all datasets internally, tracking lineage and quality; CKAN exposes approved datasets outward for open data publication. Great Expectations oversees data quality in the Bronze → Silver transition, while DuckDB provides researchers with a lightweight and portable analytical interface on the Gold layer.

Main components

Component	Technology	Version	Role
Primary storage	PostgreSQL + PostGIS	17 + 3.5	Master data, spatial operations
Orchestration	Prefect	3.x	Scheduling, monitoring, ETL lineage
Analytics layer	DuckDB	1.x	Analytics queries (federation from PostgreSQL)
Internal governance	OpenMetadata	1.x	Internal catalog, lineage, data quality, profiling
Open data catalog	CKAN	2.11	Public open data catalog, DCAT-AP_IT
Data quality	Great Expectations	1.x	Validation, anomaly detection
PDF extraction	Docling	Latest	Table extraction from PDF (97.9% accuracy)
Object storage	MinIO (or GCS)	Latest	Raw files, PDFs, Excel archives

graph TB
    subgraph "Data Sources"
        A1[ISTAT CSV]
        A2[Min PDF]
        A3[HTML Tables]
    end

    subgraph "Orchestration"
        B[Prefect
Worker Pools]
    end

    subgraph "Storage Layers"
        C[Bronze
File System]
        D[Silver
PostgreSQL EAV]
        E[Gold
PostgreSQL + PostGIS]
    end

    subgraph "Internal Governance & Analytics"
        F[OpenMetadata
Internal Catalog]
        G[Great Expectations
Quality]
        H[DuckDB
Analytics]
    end

    subgraph "Public Access"
        I[CKAN
Open Data Catalog]
        J[dati.gov.it]
    end

    A1 --> B
    A2 --> B
    A3 --> B
    B --> C
    C --> D
    D --> E
    D --> G
    B -.metadata.-> F
    E --> F
    E --> H
    E --> I
    I --> J

Rationale for technology choices

The detailed rationale for technology choices — quantitative comparisons between the alternatives considered for each component, cost analysis, and constraint evaluation — is documented in Appendix A.

Deployment architecture

MAPS platform services are distributed as orchestrated containers on a managed Kubernetes cluster (DigitalOcean DOKS). The PostgreSQL + PostGIS database is managed as a managed service by the same provider, eliminating the operational overhead of backup, failover, and patching. Persistent storage for the Bronze layer is provided by block volumes. HTTPS traffic is terminated by a Load Balancer with Let's Encrypt certificates, distributed to exposed services via Ingress Controller.

The distributed services also include the Datastore, the portal that exposes the Gold layer tables to data analysts in navigable form, introduced to make the data usable without querying the Silver layer EAV schema directly. Its role in the architecture is described in the ETL pipeline chapter (see D2.2).

The complete solution design — infrastructure choice rationale, sizing, service topology, security, and cost estimates — is documented in deliverable D2.1.2.

Data lineage and governance

Data governance is a structural component of the MAPS architecture, not a subsequent addition. A project that integrates over 200 datasets from heterogeneous public sources, transforms them through three progressive layers, and publishes them as open data for scientific research purposes needs to know at all times where each piece of data comes from, what transformations it has undergone, and with what level of quality it has reached its destination. Without this traceability, the analytical results of WP3 — the map of Daily Life Systems, the classification of attractors — would not be verifiable or reproducible.

Governance is articulated in three complementary dimensions: lineage tracking (where the data comes from and how it was transformed), quality measurement (how complete, consistent, and reliable the data is), and cataloging (what datasets exist, who is responsible for them, how they are structured).

Data lineage

Lineage describes the journey of each piece of data from the external source to its final use. In the context of the Medallion architecture, this journey passes through at least three steps: acquisition in the Bronze layer, normalization in the Silver layer, and aggregation in the Gold layer. Each step may involve significant transformations — parsing a PDF, reconciling ISTAT codes for municipal mergers, calculating spatial aggregations — and lineage must document them all.

OpenMetadata is the chosen tool for lineage tracking. It integrates natively with PostgreSQL, from which it automatically reads the schema structure (tables, columns, types), and with Prefect, from which it receives pipeline execution metadata (flow runs, task runs, status, duration, errors). This dual integration allows the complete chain to be reconstructed: which flow produced which table, starting from which source data, in which execution, with what outcome.

Lineage is articulated at three levels of granularity. The dataset level tracks dependencies between tables: the Gold table comuni_aggregati depends on the Silver table territory_attributes, which in turn depends on the Bronze files of each source. The column level tracks the provenance of individual attributes: the popolazione_residente field in the Gold layer derives from the same-named attribute in the Silver layer, originating from the ISTAT population CSV file. The pipeline level tracks executions: which Prefect run updated a given table, when, with what duration, and with what outcome.

graph TB
    subgraph "External Sources"
        S1[ISTAT CSV
Population]
        S2[MinAmbiente HTML
ATO Gas]
        S3[MinSalute PDF
ASL]
    end

    subgraph "Bronze Layer"
        B1[popolazione_2024.csv]
        B2[ato_gas_page.html]
        B3[elenco_asl.pdf]
    end

    subgraph "Silver Layer"
        SV[territory_attributes
Generalized EAV Schema]
    end

    subgraph "Gold Layer"
        G1[comuni_aggregati]
        G2[dls_attractors]
        G3[isochrone_maps]
    end

    subgraph "Applications"
        A1[CKAN Open Data]
        A2[API Web]
        A3[DuckDB Analytics]
    end

    S1 -->|"Prefect: download"| B1
    S2 -->|"Prefect: download"| B2
    S3 -->|"Prefect: download"| B3

    B1 -->|"Prefect: parse CSV"| SV
    B2 -->|"Prefect: parse HTML"| SV
    B3 -->|"Prefect: Docling extract"| SV

    SV -->|"SQL: aggregate"| G1
    SV -->|"PostGIS: spatial"| G2
    SV -->|"PostGIS: isochrone"| G3

    G1 --> A1
    G1 --> A2
    G1 --> A3
    G2 --> A2
    G3 --> A2

    style B1 fill:#cd7f32,color:#000
    style B2 fill:#cd7f32,color:#000
    style B3 fill:#cd7f32,color:#000
    style SV fill:#c0c0c0,color:#000
    style G1 fill:#ffd700,color:#000
    style G2 fill:#ffd700,color:#000
    style G3 fill:#ffd700,color:#000

The practical value of lineage emerges in several operational scenarios. When an external source publishes an update, lineage immediately indicates which Silver and Gold tables must be recalculated, avoiding unnecessary re-executions or, worse, oversights. When an analysis produces unexpected results, lineage allows tracing back the chain and verifying whether the cause is an anomalous source datum, a transformation error, or a change in the source structure. When a dataset is published on CKAN, lineage provides the methodological documentation required by DCAT-AP_IT standards: provenance, transformations applied, date of last update.

Data Quality

Data quality is verified systematically at each stage of the pipeline, with validation rules defined using Great Expectations. Validations are not generic checks applied uniformly to all datasets: each source has its own expected quality profile, with thresholds and rules calibrated to the specific nature of the data.

In the Bronze → Silver transition, validations oversee structural integrity. For datasets at municipal granularity, the fundamental check is coverage: the number of municipalities present in the file must be consistent with the reference universe (7,896 municipalities as of 2024, with annual variations due to mergers and splits). ISTAT code validity (format, length, membership in the reference municipal registry), absence of duplicates, data type consistency (numeric values in numeric fields, dates in valid format), and completeness (percentage of null values for each attribute) are also verified.

In the Silver → Gold transition, validations focus on semantic consistency. Territorial aggregations must produce totals consistent with the source data (for example, the sum of the population of municipalities in a province must correspond to the provincial figure). PostGIS spatial operations must generate valid geometries (no self-intersections, complete territorial coverage). Time series must maintain continuity: a municipality that existed in 2023 must have a successor in 2024, or be documented as dissolved.

Validation results are recorded in OpenMetadata as quality metrics associated with each table. The five main metrics are: completeness (percentage of non-null values), uniqueness (absence of duplicate records), validity (conformity of values to the expected domain), consistency (compatibility between correlated attributes), and timeliness (distance between the source publication date and the acquisition date). These metrics are accessible in the internal catalog and allow assessing the health of each dataset at a glance.

When a validation fails — for example, a CSV file contains only 5,000 municipalities instead of the expected 7,896 — the pipeline halts and the data is not propagated to the next layer. The error is recorded in the ingestion log, reported in the Prefect interface, and associated with the corresponding table in OpenMetadata. The Bronze data remains available for problem analysis, but the Silver layer is protected from partial or corrupted ingestions.

Cataloging

OpenMetadata acts as the internal catalog of the entire data platform. Every table, in each of the three layers, is registered with its metadata: structure (columns, types, constraints), provenance (source, pipeline that generated it), quality (metrics from the latest validation), responsibility (dataset owner), and documentation (textual description, methodological notes).

The catalog is populated partly automatically — OpenMetadata reads the structure of PostgreSQL tables and the metadata from Prefect executions — and partly manually, for information that requires domain knowledge: the description of an attribute, the calculation methodology of an indicator, caveats on the use of a dataset. This combination of automation and curation allows the catalog to be kept up to date without an excessive documentation burden.

The catalog serves three operational needs:

for the development team, it is the tool for understanding what data is available, in what format, and with what degree of reliability, before starting an analysis or a new pipeline;
for project governance, it is the registry that documents the progress of ingestion: how many datasets have been acquired, how many validated, how many published;
for open data publication, it is the source from which CKAN draws the DCAT-AP_IT metadata to associate with datasets exported to the public catalog.

Best practices

The MAPS project's ingestion pipelines follow three architectural principles that guarantee reliability and maintainability over time: operation idempotency, incremental data loading, and controlled schema evolution.

Idempotency

Every pipeline must be re-executable at any time without generating duplicates or corrupting existing data. This requirement is essential in a context where data sources are heterogeneous and partial failures are frequent: an interrupted download, a malformed PDF file, a network timeout can stop execution midway, and the pipeline must be able to restart from the beginning without side effects.

Idempotency is implemented at the database level through upsert operations: when a record is inserted into the Silver layer, the composite key (ISTAT code, attribute, source, dataset, reference year) determines whether it is new data or an update. If the record already exists, the value is overwritten and the update timestamp is recorded; if it does not exist, it is created. The result is identical whether on the first execution or on any subsequent execution: the final state of the database depends exclusively on the source data, not on the history of executions.

In the Bronze layer, idempotency is guaranteed by the structure of the layer itself: downloaded files overwrite any previous versions at the same position in the file system, identified by the source/dataset/year combination. The SHA-256 hash of the file allows verifying whether the content has actually changed compared to the previous version, avoiding unnecessary transformations in cases where the source has not published updates.

Incremental loading

The loading strategy differs between the Silver layer and the Gold layer, reflecting the different needs of the two strata.

The Silver layer adopts an incremental approach with temporal versioning. When a dataset is updated, the pipeline compares the new data with what is already present: new records are inserted, modified records are updated by closing the previous version (field valid_to) and opening a new version (field valid_from), unchanged records are left untouched. This strategy minimizes write volume and preserves the history of changes, making it possible to query the state of data at any past date.

The Gold layer instead adopts a snapshot approach: denormalized tables are fully rebuilt from Silver data every time an update requires it. Full reconstruction is preferable to incremental updating for two reasons. The first is the aggregated nature of Gold data: a change to a single Silver record can propagate across multiple Gold tables and multiple rows of each table, making point updates complex and fragile. The second is the contained volume: with approximately 8,000 municipalities and a few hundred attributes per table, full reconstruction takes a few minutes and does not justify the additional complexity of an incremental mechanism.

Schema evolution

The architecture must handle two types of evolution over time: the addition of new data sources and modification of the structure of existing data.

Adding new sources is the most frequent and simplest case. The Silver layer's EAV schema is designed to accommodate new attributes without structural modifications: a new dataset translates into new rows in the territory_attributes table with a new value in the attribute field, without needing to add columns or alter tables. In the Gold layer, adding a new attribute requires a new column in the corresponding denormalized table, but this operation is non-breaking: the column is added with a default value and populated by the pipeline, with no impact on existing queries.

Modifying the structure of existing data is less frequent but more delicate. It can happen that an ISTAT source changes a file's format, adds or removes columns, or changes value encoding between one release and the next. In these cases, the Bronze → Silver transformation pipeline must be updated to handle the new format, maintaining compatibility with previous versions still present in the Bronze layer. The Silver layer's temporal versioning ensures that data already acquired with the old format remains valid and queryable, while new data follows the updated format.

OpenMetadata records every schema change as a lineage event, allowing tracking of when a table was altered, by which pipeline, and for what reason. This tracking is particularly useful during debugging: when an analysis produces results different from those expected, the history of schema changes makes it possible to verify whether the cause is a change in data structure rather than in the data itself.

Metrics and monitoring

A three-layer architecture with over 200 datasets and chained transformation pipelines requires a monitoring system that can answer three questions at all times: is data entering correctly?, are transformations working?, and is the produced data usable?

Platform health indicators

Monitoring is organized along the three layers of the Medallion architecture, with specific indicators for each.

For the Bronze layer, the main indicator is the acquisition success rate. Each execution of a download pipeline is recorded in the bronze.ingestion_log table with the outcome (success, failure, partial), the size of the downloaded file, and the content hash. Monitoring verifies that planned acquisitions are actually executed and that downloaded files have sizes consistent with expectations: an ISTAT population CSV file weighing a few kilobytes instead of the expected megabytes signals a problem at the source or in the download. Bronze storage growth is contained — in the order of a few gigabytes per year, given the semi-annual or annual update frequency of sources — and is monitored to ensure that persistent volumes have sufficient capacity.

For the Silver layer, the key indicators are ingestion latency and data quality. Latency measures the time elapsed between the completion of the Bronze download and the availability of the normalized data in the Silver layer: for most datasets (CSV, Excel) this time is in the order of minutes; for complex PDFs requiring extraction with Docling it can reach several tens of minutes. Quality metrics — completeness, uniqueness, validity — are those produced by Great Expectations validations and recorded in OpenMetadata, as described in the previous section. A deterioration of these indicators over time signals a change in the structure or quality of the source, requiring intervention on the transformation pipeline.

For the Gold layer, monitoring focuses on query performance and aggregation consistency. Response times of analytical queries are tracked via the PostgreSQL pg_stat_statements extension, with the objective of keeping the 95th percentile below 100 milliseconds for the most common queries. Reconstruction of Gold tables from the Silver layer must complete in reasonable times — in the order of minutes for the current project volume — and monitoring signals any slowdowns that may indicate data growth beyond planned thresholds or aggregation queries requiring optimization.

Pipeline monitoring

Prefect natively provides monitoring of pipeline execution, with visibility into the status of each flow run (running, completed, failed), the execution times of each task, and detailed logs in case of error. The Prefect web interface allows consulting execution history, identifying failed tasks, and re-executing pipelines with a single intervention.

Pipeline monitoring integrates with the lineage tracked in OpenMetadata: when a flow fails, it is immediately possible to trace which Silver and Gold tables have not been updated, and therefore which downstream data may be stale. This correlation between pipeline status and data status is particularly useful when multiple pipelines share destination tables: the failure of a single pipeline does not compromise the entire layer, but monitoring must signal which portions of the data have remained at the last successful update.

Infrastructure monitoring

At the infrastructure level, monitoring of the Kubernetes cluster and the PostgreSQL database covers the classic resource utilization metrics: CPU, memory, disk I/O, storage space. DigitalOcean provides basic metrics for DOKS cluster nodes and for the managed database; for more detailed monitoring, the Prometheus + Grafana stack can be deployed in the cluster as an additional service.

Attention thresholds are calibrated on the load profiles described in the sizing section: sustained CPU utilization above 80% on worker nodes during ETL execution windows is normal; the same level in the absence of active pipelines indicates a problem. Database disk occupancy is monitored with a threshold at 70% of allocated capacity, leaving margin for growth and for PostgreSQL maintenance operations (vacuum, reindex) that require temporary space.

Data-Lake Technical Architecture ​

Medallion Pattern: Bronze → Silver → Gold ​

Overview ​

Rationale ​

Complete data flow ​

Bronze Layer: Raw Data Archive ​

Technical characteristics ​

Directory structure ​

Metadata tracking ​

Bronze layer guarantees ​

Use case: HTML → structured data ​

Silver Layer: Clean and validated data ​

Technical characteristics ​

EAV Schema (Entity-Attribute-Value) ​

EAV data example ​

Bronze → Silver transformations ​

Temporal versioning ​

Gold Layer: Analysis-ready data ​

Technical characteristics ​

Main data marts ​

Analytical access with DuckDB ​

Silver → Gold transformations ​

Comparison between EAV schema and traditional schema ​

Technology stack ​

Main components ​

Rationale for technology choices ​

Deployment architecture ​

Data lineage and governance ​

Data lineage ​

Data Quality ​

Cataloging ​

Best practices ​

Idempotency ​

Incremental loading ​

Schema evolution ​

Metrics and monitoring ​

Platform health indicators ​

Pipeline monitoring ​

Infrastructure monitoring ​

Data-Lake Technical Architecture

Medallion Pattern: Bronze → Silver → Gold

Overview

Rationale

Complete data flow

Bronze Layer: Raw Data Archive

Technical characteristics

Directory structure

Metadata tracking

Bronze layer guarantees

Use case: HTML → structured data

Silver Layer: Clean and validated data

Technical characteristics

EAV Schema (Entity-Attribute-Value)

EAV data example

Bronze → Silver transformations

Temporal versioning

Gold Layer: Analysis-ready data

Technical characteristics

Main data marts

Analytical access with DuckDB

Silver → Gold transformations

Comparison between EAV schema and traditional schema

Technology stack

Main components

Rationale for technology choices

Deployment architecture

Data lineage and governance

Data lineage

Data Quality

Cataloging

Best practices

Idempotency

Incremental loading

Schema evolution

Metrics and monitoring

Platform health indicators

Pipeline monitoring

Infrastructure monitoring