🇬🇧 English
🇬🇧 English
Appearance
🇬🇧 English
🇬🇧 English
Appearance
Deliverable D2.1.1: Data-Lake Technical Design Document
The MAPS Data Lake architecture adopts the Medallion pattern (also known as Multi-Hop Architecture), a best practice in the Modern Data Stack ecosystem that organizes data into three layers of progressive refinement with increasing levels of quality, cleanliness, and business-readiness:
EXTERNAL SOURCES → BRONZE → SILVER → GOLD → APPLICATIONSFor the MAPS project, the Medallion architecture is particularly well-suited for five main reasons.
Data source heterogeneity is the first determining factor. The project manages over 180 datasets from different public sources (ISTAT, Ministries, local authorities), distributed across multiple formats (CSV, Excel, PDF, JSON, HTML) with variable quality ranging from structured datasets to semi-structured documents. The layered approach allows this heterogeneity to be standardized progressively without losing the original information.
Audit and compliance requirements represent the second critical element. Although the data is public, scientific research requires complete traceability of provenance and transformations applied. Article 30 of the GDPR also requires documentation of data processing activities. The immutable Bronze layer provides the original "source of truth" needed to satisfy these requirements.
Transformation complexity requires a multi-stage architecture. Pipelines operate in multi-step mode, moving from acquisition from heterogeneous sources (CSV, Excel, PDF, JSON, HTML) to validation, then to normalization into EAV schema, and finally to territorial aggregations. Managing time series from 2010 to 2025 with municipal mergers and splits adds further complexity. The logical separation between raw acquisition (Bronze), cleaning (Silver), and business logic (Gold) simplifies the development and debugging of transformations.
Data reuse justifies the introduction of a validated intermediate layer. The same datasets are used for multiple analyses (demographics, services, mobility), making re-processing from the external source for each use inefficient. The Silver layer acts as a validated enterprise cache, accelerating downstream analyses.
Performance and scalability constitute the final enabling factor. Analytical queries operate on over 8,000 municipalities with dozens of attributes each, while PostGIS spatial operations are computationally intensive. The denormalized Gold layer optimizes the performance of frequent queries, reducing response times from seconds to milliseconds.
graph TB
A[External Sources
ISTAT, GTFS, Ministries, OpenData] --> B[Prefect Orchestration
Worker Pools: istat, pdf, analytics]
B --> C[Bronze Layer
File System - Raw Files]
C --> D[Silver Layer
PostgreSQL - EAV Schema]
D --> E[Gold Layer
PostgreSQL + PostGIS]
B -.metadata.-> F[OpenMetadata
Internal Governance]
E --> F
E --> G[DuckDB
Analytics]
E --> H[API v2
Web Apps]
E --> I[CKAN
Open Data Catalog]
I --> J[dati.gov.it
National Portal]
style C fill:#cd7f32
style D fill:#c0c0c0
style E fill:#ffd700
Data flows from external sources through three progressive layers, orchestrated by Prefect. The Bronze layer preserves the original files; the Silver layer normalizes them into EAV schema on PostgreSQL; the Gold layer produces analytical tables with PostGIS spatial extensions. Published data reaches the national portal via CKAN, while OpenMetadata tracks governance throughout the entire flow.
Role: Immutable archive of original data exactly as downloaded from the source.
Key principle: "Never modify, always preserve"
| Aspect | MAPS Specification |
|---|---|
| Storage | Local file system /data/bronze/ (op-linkurious server) |
| Format | Original files without transformations (CSV, XLSX, PDF, JSON, HTML, Parquet) |
| Naming convention | /data/bronze/{source}/{year}/{dataset}_{timestamp}.{ext} |
| Retention policy | Indefinite (storage cost is low: ~50GB total for 180 datasets) |
| Backup | Daily snapshots via backup.sh script |
| Access mode | Write-once, read-rarely (only for reprocessing or audit) |
/data/bronze/
├── istat/
│ ├── 2024/
│ │ ├── popolazione_comuni_20240218.csv
│ │ ├── pendolarismo_matrix_20240218.csv
│ │ ├── confini_amministrativi_20240218.geojson
│ │ └── _metadata/
│ │ ├── popolazione_comuni_20240218.json # Metadata file
│ │ └── checksums.sha256
│ └── 2023/
│ └── popolazione_comuni_20230315.csv
├── minlavoro/
│ └── 2024/
│ ├── tabacchi_adm_report_20240218.pdf
│ └── _metadata/
│ └── tabacchi_adm_report_20240218.json
├── minsalute/
│ └── 2023/
│ ├── strutture_sanitarie_asl_20231120.xlsx
│ └── _metadata/
│ └── strutture_sanitarie_asl_20231120.json
└── minambiente/
└── 2024/
├── ato_gas_page_20240115.html # ← HTML file
└── _metadata/
└── ato_gas_page_20240115.jsonTracking occurs at two complementary levels: a local JSON file alongside each acquired file, and a centralized table in PostgreSQL that records the status of all acquisitions.
Each Bronze file has a corresponding JSON file with acquisition metadata:
Example: /data/bronze/istat/2024/_metadata/popolazione_comuni_20240218.json
{
"file_path": "/data/bronze/istat/2024/popolazione_comuni_20240218.csv",
"fonte": "istat",
"dataset": "popolazione",
"anno_riferimento": 2024,
"download_timestamp": "2024-02-18T03:00:15Z",
"download_url": "https://www.istat.it/storage/cartografia/popolazione_comuni_2024.csv",
"file_size_bytes": 3355482,
"file_hash_sha256": "a3f5b8c9d2e1f4a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0",
"mime_type": "text/csv",
"encoding": "UTF-8",
"rows_detected": 7901,
"columns_detected": 12,
"prefect_flow_run_id": "abc123-456-def-789",
"ingestion_status": "completed"
}The file is written by the Prefect flow at the end of each acquisition and resides in the _metadata/ subdirectory next to the corresponding data file. It records essential operational information: provenance, integrity hash, status, and the identifier of the flow that performed the download. It serves three functions in the architecture: it enables targeted reprocessing (if an error occurs in the Silver layer, the pipeline identifies the Bronze file to reprocess without re-downloading from the source); it guarantees the audit trail via SHA256 hash; and it provides the technical metadata that the flow publishes to OpenMetadata via API, feeding the centralized governance catalog.
Database tracking (PostgreSQL):
While the JSON is local to the file system, the bronze.ingestion_log table centralizes the status of all acquisitions, enabling aggregate queries by source, dataset, and status. It acts as a registry that Prefect can query to manage re-executions and guarantee deduplication via the uniqueness constraint on the file hash.
-- Schema: bronze
CREATE TABLE bronze.ingestion_log (
id BIGSERIAL PRIMARY KEY,
fonte VARCHAR(50) NOT NULL,
dataset VARCHAR(100) NOT NULL,
anno_rif INTEGER NOT NULL,
file_path TEXT NOT NULL,
file_size_bytes BIGINT,
file_hash_sha256 CHAR(64),
download_url TEXT,
download_timestamp TIMESTAMP NOT NULL,
prefect_flow_run_id UUID,
status VARCHAR(20) NOT NULL, -- 'completed', 'failed', 'in_progress'
error_message TEXT,
created_at TIMESTAMP DEFAULT NOW(),
UNIQUE (fonte, dataset, anno_rif, file_hash_sha256)
);
-- Index for frequent queries
CREATE INDEX idx_ingestion_log_fonte_dataset ON bronze.ingestion_log(fonte, dataset);
CREATE INDEX idx_ingestion_log_status ON bronze.ingestion_log(status);
CREATE INDEX idx_ingestion_log_timestamp ON bronze.ingestion_log(download_timestamp DESC);The Bronze layer provides three fundamental guarantees:
(fonte, dataset, anno_rif, file_hash) constraint.Some Italian public sources publish data exclusively as HTML tables embedded in web pages, without making downloadable CSV or Excel files available. This is the case for several regional registries of accredited healthcare facilities, lists of public bodies, and sections of ministerial portals where data is presented for online consultation but cannot be exported directly. This type of source requires a dedicated acquisition approach, based on HTTP GET of the page and parsing with BeautifulSoup, before proceeding to data normalization.
The Bronze-HTML flow follows the same logic as other formats: the HTML page is saved in full to the Bronze layer before any transformation.
sequenceDiagram
participant S as Source Website
participant P as Prefect Flow
participant B as Bronze Layer
participant SI as Silver Layer
S->>P: HTTP GET HTML page
P->>B: Save raw HTML file
P->>B: Log metadata JSON
Note over B: HTML preserved unchanged
P->>P: Parse HTML with BeautifulSoup
P->>P: Clean & Validate DataFrame
P->>SI: Load to EAV schema
The advantage of the Bronze pattern over direct acquisition without intermediate archiving is resilience: if the parser fails because the HTML structure of the page has changed, the code can be corrected and the already-saved file reprocessed, without returning to the source. If the source server is temporarily unreachable or the page is removed, the Bronze layer preserves the original copy and allows reprocessing. The history of HTML versions acquired over time also makes it possible to detect retroactive changes to published data.
Role: Single reference source for all analytical processes in the project. Data is clean, validated, and normalized from the Bronze layer.
Key principle: "Verify before trusting, then archive"
| Aspect | MAPS Specification |
|---|---|
| Storage | PostgreSQL schema silver |
| Format | Normalized relational tables (EAV schema) |
| Naming convention | Schema silver, core tables: territory_types, territories, territory_identifiers, territory_names, territory_containments, territory_attributes |
| Retention policy | Temporal versioning (valid_from, valid_to) for time series |
| Backup | Daily PostgreSQL snapshots + WAL archiving |
| Access mode | Read-heavy (analytical queries), write-moderate (batch ingestion) |
The MAPS project acquires over 180 datasets with heterogeneous structures: population, healthcare facilities, transport infrastructure, environmental data. Adopting a traditional relational model would require a separate table for each dataset, making it difficult to manage changes over time and to add new sources. The EAV (Entity-Attribute-Value) model solves this problem by representing each observation as a row (territory_id, attribute, value): the schema remains stable as acquired datasets change, and new attributes are added without modifications to the database structure.
A further requirement emerged during the project: data is not always available at municipal level. Some datasets are published at provincial or regional level; others refer to functional aggregations such as Labour Market Areas (Sistemi Locali del Lavoro, SLL) or the new Homogeneous Local Systems (Sistemi Locali Omogenei, SLO) that this project aims to define. The Silver schema therefore supports territorial entities of any type, not just municipalities.
The model is articulated in six core tables: territory_types for the registry of entity types; territories for the master registry of all territorial entities, each with a stable surrogate id that never changes even when ISTAT codes are reassigned; territory_identifiers for temporal ISTAT codes and other external identifiers; territory_names for official and alternate names with language support; territory_containments for temporal hierarchical containment (comune → provincia, with precise dates for province changes); and territory_attributes for all observed values with complete traceability. Administrative boundary change events (mergers, incorporations, splits) are tracked through territory_relationships, which stores predecessor/successor links derived from ISTAT variazioni amministrative data.
erDiagram
TERRITORY_TYPES ||--o{ TERRITORIES : "typed as"
TERRITORIES ||--o{ TERRITORY_IDENTIFIERS : "has identifiers"
TERRITORIES ||--o{ TERRITORY_NAMES : "has names"
TERRITORIES ||--o{ TERRITORY_CONTAINMENTS : "member"
TERRITORIES ||--o{ TERRITORY_CONTAINMENTS : "container"
TERRITORIES ||--o{ TERRITORY_ATTRIBUTES : "has attributes"
TERRITORY_TYPES {
varchar code PK
varchar label
varchar authority
text coding_scheme
integer hierarchy_level
}
TERRITORIES {
serial id PK
varchar type_code FK
varchar label
date valid_from
date valid_to
text end_reason
}
TERRITORY_IDENTIFIERS {
serial id PK
integer territory_id FK
varchar scheme
varchar identifier
date valid_from
date valid_to
}
TERRITORY_NAMES {
serial id PK
integer territory_id FK
varchar name
varchar language
varchar name_type
date valid_from
date valid_to
}
TERRITORY_CONTAINMENTS {
serial id PK
integer member_id FK
integer container_id FK
date valid_from
date valid_to
}
TERRITORY_ATTRIBUTES {
serial id PK
integer territory_id FK
varchar type_code
varchar attribute
text value
varchar data_type
varchar source
date valid_from
date valid_to
}
The territory_attributes table is the heart of the Silver schema. Each row represents a single observed value for a territorial entity over a validity period, with complete traceability of the source. The valid_from and valid_to columns implement temporal versioning according to the SCD Type 2 pattern. Crucially, territory_attributes references the stable surrogate territory_id rather than the ISTAT code directly: when a municipality changes province and receives a new ISTAT code, the attribute history is preserved without any data migration.
-- Territory type registry
CREATE TABLE silver.territory_types (
code VARCHAR(30) PRIMARY KEY,
label VARCHAR(100) NOT NULL,
authority VARCHAR(100), -- "ISTAT", "SNAI", "MAPS", ...
coding_scheme TEXT,
hierarchy_level INTEGER -- 1=nation ... 5=municipality; NULL=functional
);
-- Master registry — stable surrogate id, never changes
CREATE TABLE silver.territories (
id SERIAL PRIMARY KEY,
type_code VARCHAR(30) NOT NULL REFERENCES silver.territory_types(code),
label VARCHAR(255),
valid_from DATE, -- from variazioni CS events; NULL = unknown start
valid_to DATE, -- from variazioni ES events; NULL = still active
end_reason TEXT
);
-- Temporal external identifiers (ISTAT codes, cadastral codes, ...)
CREATE TABLE silver.territory_identifiers (
id SERIAL PRIMARY KEY,
territory_id INTEGER NOT NULL REFERENCES silver.territories(id),
scheme VARCHAR(50) NOT NULL, -- 'istat', 'catasto', 'fiscale', 'uts', ...
identifier VARCHAR(50) NOT NULL,
valid_from DATE,
valid_to DATE,
fonte VARCHAR(100)
);
-- Temporal names (official + bilingual + alternate)
CREATE TABLE silver.territory_names (
id SERIAL PRIMARY KEY,
territory_id INTEGER NOT NULL REFERENCES silver.territories(id),
name TEXT NOT NULL,
language VARCHAR(10), -- 'it', 'de', 'fr', 'sl', ...
name_type VARCHAR(50), -- 'official', 'local', 'alias', ...
valid_from DATE,
valid_to DATE
);
-- Temporal hierarchical containment (comune → provincia, with province-change dates)
CREATE TABLE silver.territory_containments (
id SERIAL PRIMARY KEY,
member_id INTEGER NOT NULL REFERENCES silver.territories(id),
container_id INTEGER NOT NULL REFERENCES silver.territories(id),
valid_from DATE,
valid_to DATE
);
-- Generalized EAV — references stable territory_id, not ISTAT code
CREATE TABLE silver.territory_attributes (
id SERIAL PRIMARY KEY,
territory_id INTEGER NOT NULL REFERENCES silver.territories(id),
type_code VARCHAR(30) NOT NULL, -- denormalised from territories for query convenience
attribute VARCHAR(255) NOT NULL,
value TEXT,
data_type VARCHAR(50), -- integer, float, string, boolean, date
source VARCHAR(100),
valid_from DATE,
valid_to DATE, -- NULL = current value
UNIQUE (territory_id, attribute, valid_from)
);
CREATE INDEX idx_ta_territory ON silver.territory_attributes(territory_id);
CREATE INDEX idx_ta_type ON silver.territory_attributes(type_code);
CREATE INDEX idx_ta_attribute ON silver.territory_attributes(attribute);
CREATE INDEX idx_ta_source ON silver.territory_attributes(source);To make the model concrete, a real example is provided: the municipality of AGLIÈ (ISTAT code 001001), with attributes from three distinct sources, alongside a provincial-level record and a Labour Market Area record. All entity types coexist in the same table thanks to the generalised EAV schema. Territory resolution always goes through the stable territory_id FK; the ISTAT code is looked up in territory_identifiers.
-- First, resolve territory_id for AGLIÈ (ISTAT code 001001)
-- This join is handled by territory_resolver.py at transform time
SELECT t.id FROM silver.territories t
JOIN silver.territory_identifiers ti ON ti.territory_id = t.id
WHERE t.type_code = 'comune' AND ti.scheme = 'istat' AND ti.identifier = '001001';
-- → id = 42 (example)
-- Municipality AGLIÈ (territory_id=42) — attributes from three sources
INSERT INTO silver.territory_attributes
(territory_id, type_code, attribute, value, data_type, source, valid_from)
VALUES
(42, 'comune', 'popolazione', '2635', 'integer', 'ISTAT', '2024-01-01'),
(42, 'comune', 'superficie_kmq', '13.98', 'float', 'ISTAT', '2024-01-01'),
(42, 'comune', 'num_tabacchi', '2', 'integer', 'ADM', '2024-01-01'),
(42, 'comune', 'num_asl', '1', 'integer', 'MinSalute', '2023-01-01'),
(42, 'comune', 'ato_gas_id', 'ATO-PIE-01', 'string', 'MinAmbiente', '2024-01-01');
-- Province of Turin (territory_id=107) — provincial-level dataset
INSERT INTO silver.territory_attributes
(territory_id, type_code, attribute, value, data_type, source, valid_from)
VALUES
(107, 'provincia', 'popolazione', '2259523', 'integer', 'ISTAT', '2023-01-01'),
(107, 'provincia', 'pil_pro_capite', '32400', 'integer', 'ISTAT', '2022-01-01');
-- Labour Market Area (territory_id=8) — functional aggregation
INSERT INTO silver.territory_attributes
(territory_id, type_code, attribute, value, data_type, source, valid_from)
VALUES
(8, 'sll', 'addetti_industria', '15200', 'integer', 'ISTAT', '2021-01-01');Each dataset acquired in the Bronze layer is transformed into Silver through a standard Prefect pipeline in four phases: parsing the source file in its original format (CSV, Excel, PDF, HTML), cleaning and normalizing values, validation with Great Expectations, and loading into the EAV schema. This structure is uniform for all datasets, regardless of source.
@task(name="bronze-to-silver-transform")
def transform_bronze_to_silver(
bronze_file_path: str,
entity_type: str, # 'comune', 'provincia', 'regione', 'sll', ... — used to resolve territory_id
fonte: str,
dataset: str,
anno: int
):
"""
Standard transformation pipeline: Bronze → Silver
Steps:
1. Parse: Read Bronze file (CSV/PDF/Excel/HTML)
2. Clean: Normalize encoding, trim whitespace, fix typos
3. Validate: Check data quality, enforce business rules
4. Load: Resolve territory_id and insert into silver.territory_attributes
"""
# Step 1: Parse
if bronze_file_path.endswith('.csv'):
df = pd.read_csv(bronze_file_path)
elif bronze_file_path.endswith('.pdf'):
df = extract_pdf_tables(bronze_file_path) # Docling
elif bronze_file_path.endswith('.xlsx'):
df = pd.read_excel(bronze_file_path)
elif bronze_file_path.endswith('.html'):
df = parse_html_to_dataframe(bronze_file_path) # BeautifulSoup
# Step 2: Clean
df = clean_dataframe(df)
# Step 3: Validate with Great Expectations
validation_results = validate_dataframe(df, fonte, dataset)
if not validation_results.passed:
raise ValueError(f"Validation failed: {validation_results.errors}")
# Step 4: Resolve territory_id and load to Silver EAV
load_to_territory_attributes(df, entity_type, fonte, anno)In the 2010-2025 period, approximately 100 municipal mergers or splits occurred. When two municipalities merge into a new one, the historical data of the pre-existing entities remains valid for the period preceding the merger: it is not deleted or overwritten, but temporally bounded via the valid_from and valid_to columns. The new municipality receives its own records from its establishment date. Queries can therefore interrogate the state of the territory at any historical date with a simple temporal filter.
The following example shows a merger that occurred in 2019: municipalities 001234 and 001235 merge into the new municipality 001236. In the surrogate identity model, the two original settlements retain their territories rows (now with valid_to set); the new municipality gets a new row. The relationship is recorded in territory_relationships.
-- territory_id=10 → comune 001234 (valid_to 2019-01-01)
-- territory_id=11 → comune 001235 (valid_to 2019-01-01)
-- territory_id=12 → comune 001236 (valid_from 2019-01-01, valid_to NULL)
-- Attributes before the merger (valid_to closes on merger date)
INSERT INTO silver.territory_attributes
(territory_id, type_code, attribute, value, source, valid_from, valid_to)
VALUES
(10, 'comune', 'popolazione', '1500', 'ISTAT', '2018-01-01', '2019-01-01'),
(11, 'comune', 'popolazione', '800', 'ISTAT', '2018-01-01', '2019-01-01');
-- Attributes after the merger
INSERT INTO silver.territory_attributes
(territory_id, type_code, attribute, value, source, valid_from, valid_to)
VALUES
(12, 'comune', 'popolazione', '2300', 'ISTAT', '2019-01-01', NULL);
-- Query: Population of the settlement with ISTAT code 001234 in 2018
SELECT ta.value FROM silver.territory_attributes ta
JOIN silver.territory_identifiers ti ON ti.territory_id = ta.territory_id
WHERE ti.scheme = 'istat' AND ti.identifier = '001234'
AND ta.attribute = 'popolazione'
AND '2018-12-31' BETWEEN ta.valid_from AND COALESCE(ta.valid_to, '9999-12-31');
-- Result: 1500Role: Data optimized for specific applications — aggregated, denormalized, enriched with spatial geometries. This is the layer that directly feeds dashboards, APIs, and SLO algorithms.
Key principle: "Optimize for queries, not for space"
| Aspect | MAPS Specification |
|---|---|
| Storage | PostgreSQL schema gold with PostGIS extensions |
| Format | Denormalized tables (wide table) with spatial geometries |
| Naming convention | Schema gold, thematic tables by domain (municipalities, attractors, time series) |
| Retention policy | Snapshots updated periodically (daily or weekly); rebuildable from Silver layer |
| Backup | Daily snapshots (but rebuildable from Silver layer) |
| Access mode | High-frequency read-only (aggregate queries, point access via API, full scans for SLO algorithms) |
A data mart is a thematic, optimized view of data designed to answer a specific set of analytical questions. In the Gold layer, each data mart is a denormalized table that aggregates information from multiple Silver sources into a single structure, eliminating the need for complex joins at query time. This approach sacrifices storage space — data is duplicated compared to Silver — in favor of query speed and ease of use for application and algorithm developers.
The MAPS Gold layer defines three main data marts, each oriented toward a distinct analytical domain.
A. gold.comuni_aggregati
This is the central table of the Gold layer: a denormalized representation of each Italian municipality, with all relevant attributes collected in a single row. Each record integrates demographic data from ISTAT, service counts from the Ministries, geographic information with PostGIS geometry, and attributes calculated by the SLO algorithm (attractiveness level, 60-minute isochrone). This structure enables complex analytical queries in under 10 ms, with no joins between tables.
CREATE TABLE gold.comuni_aggregati (
-- Identifiers
codice_istat VARCHAR(6) PRIMARY KEY,
denominazione VARCHAR(255) NOT NULL,
denominazione_full VARCHAR(255), -- With province abbreviation
-- Administrative hierarchy
codice_regione CHAR(2) NOT NULL,
denominazione_regione VARCHAR(100) NOT NULL,
codice_provincia CHAR(3),
denominazione_provincia VARCHAR(100),
sigla_provincia CHAR(2),
-- Demographic attributes (from ISTAT)
popolazione_2024 INTEGER,
popolazione_2023 INTEGER,
popolazione_2022 INTEGER,
crescita_popolazione_pct NUMERIC(5,2), -- Calculated: (2024-2023)/2023*100
-- Geographic attributes
superficie_kmq NUMERIC(10,2),
densita_abitanti_kmq NUMERIC(10,2), -- Calculated: pop/superficie
altitudine_m INTEGER,
zona_altimetrica VARCHAR(50), -- 'montagna', 'collina', 'pianura'
-- PostGIS geometry
geometria GEOMETRY(MultiPolygon, 4326) NOT NULL,
centroide GEOMETRY(Point, 4326),
-- Services (from Ministries)
num_strutture_sanitarie INTEGER DEFAULT 0,
num_asl INTEGER DEFAULT 0,
num_scuole_primarie INTEGER DEFAULT 0,
num_scuole_secondarie INTEGER DEFAULT 0,
num_tabacchi INTEGER DEFAULT 0,
num_uffici_postali INTEGER DEFAULT 0,
-- Infrastructure (from OpenData)
ha_stazione_ferroviaria BOOLEAN DEFAULT FALSE,
ha_casello_autostradale BOOLEAN DEFAULT FALSE,
ha_aeroporto BOOLEAN DEFAULT FALSE,
-- Territorial areas (from public registries)
ato_gas_id VARCHAR(50),
ato_gas_denominazione VARCHAR(255),
ato_gas_gestore VARCHAR(255),
ato_acqua_id VARCHAR(50),
ato_rifiuti_id VARCHAR(50),
-- DLS attributes (Calculated)
attractor_level VARCHAR(50), -- 'metropolitan', 'urban', 'semi-urban', 'rural'
cluster_dls_id INTEGER,
isochrone_60min GEOMETRY(MultiPolygon, 4326),
-- Metadata
last_updated TIMESTAMP NOT NULL DEFAULT NOW(),
data_completeness_pct NUMERIC(5,2), -- % populated attributes
-- Constraints
CONSTRAINT valid_codice CHECK (codice_istat ~ '^\d{6}$')
);
-- Spatial indexes
CREATE INDEX idx_gold_comuni_geometria ON gold.comuni_aggregati USING GIST(geometria);
CREATE INDEX idx_gold_comuni_centroide ON gold.comuni_aggregati USING GIST(centroide);
CREATE INDEX idx_gold_comuni_isochrone ON gold.comuni_aggregati USING GIST(isochrone_60min);
-- Attribute indexes
CREATE INDEX idx_gold_comuni_regione ON gold.comuni_aggregati(codice_regione);
CREATE INDEX idx_gold_comuni_provincia ON gold.comuni_aggregati(codice_provincia);
CREATE INDEX idx_gold_comuni_attractor ON gold.comuni_aggregati(attractor_level);The following example query shows how the denormalized structure allows filtering by gas territorial area and population threshold with a single statement, no joins, and an estimated execution time under 10 ms.
-- Query: Municipalities in ATO Gas "ATO-PIE-01" with population > 5000
SELECT
denominazione,
popolazione_2024,
ato_gas_gestore,
num_tabacchi,
attractor_level
FROM gold.comuni_aggregati
WHERE ato_gas_id = 'ATO-PIE-01'
AND popolazione_2024 > 5000
ORDER BY popolazione_2024 DESC;
-- Execution: Index scan on ato_gas_id, no joins, < 10msB. gold.dls_attractors
Contains the results of the territorial attractor analysis produced by the SLO algorithm. Each row represents a municipality classified according to its attractiveness level (metropolitan, urban, semi-urban, rural), with partial scores by service category, the 60-minute isochrone geometry, and the list of reachable municipalities. This table is the main output of Work Package 3 and directly feeds the cartographic visualizations and the territorial policy simulator.
CREATE TABLE gold.dls_attractors (
codice_istat VARCHAR(6) PRIMARY KEY REFERENCES gold.comuni_aggregati(codice_istat),
denominazione VARCHAR(255) NOT NULL,
-- Attractor classification
attractor_level VARCHAR(50) NOT NULL, -- 'metropolitan', 'urban', 'semi-urban', 'rural'
attractor_score NUMERIC(5,2), -- 0-100 score
-- Service availability (weighted scores)
servizi_sanitari_score NUMERIC(5,2),
servizi_educativi_score NUMERIC(5,2),
servizi_commerciali_score NUMERIC(5,2),
servizi_trasporti_score NUMERIC(5,2),
-- Isochrone analysis (60 min travel time)
isochrone_60min GEOMETRY(MultiPolygon, 4326),
comuni_raggiungibili_60min INTEGER[], -- Array of ISTAT codes
popolazione_raggiungibile_60min INTEGER,
-- Clustering
cluster_id INTEGER NOT NULL,
cluster_centroid GEOMETRY(Point, 4326),
-- Metadata
calculation_date TIMESTAMP NOT NULL DEFAULT NOW(),
CONSTRAINT valid_attractor_level CHECK (attractor_level IN ('metropolitan', 'urban', 'semi-urban', 'rural'))
);C. Additional data marts
The two data marts described above cover the analytical requirements identified in the design phase. The Gold layer is however designed to accommodate additional thematic tables as the project's needs grow. New data marts can be added without modifying the Silver structure, simply by defining a new refresh function that reads from the EAV layer and materializes the data in the required form. Possible examples include views for mobility flow analysis, support tables for the territorial policy simulator (WP5), or thematic aggregations for specific territorial disadvantage indicators.
The Gold data marts are the primary data access interface for the project, but with different modes depending on the type of use. REST APIs and SLO algorithms query the Gold tables directly via PostgreSQL, leveraging spatial indexes and low-latency point queries. For exploratory analysis — Jupyter notebooks, researchers' Python scripts, hypothesis testing — DuckDB is used in embedded mode instead.
DuckDB reads Gold tables from PostgreSQL via direct federation, without needing to copy data, with SQL syntax identical to the standard:
import duckdb
con = duckdb.connect()
con.execute("INSTALL postgres; LOAD postgres;")
con.execute("""
ATTACH 'dbname=maps_db host=localhost user=maps password=...'
AS maps (TYPE postgres, READ_ONLY);
""")
# Query on the Gold layer: high-attractiveness municipalities in mountain areas
result = con.execute("""
SELECT denominazione, attractor_score, popolazione_2024, zona_altimetrica
FROM maps.gold.dls_attractors a
JOIN maps.gold.comuni_aggregati c USING (codice_istat)
WHERE a.attractor_level = 'urban'
AND c.zona_altimetrica = 'montagna'
ORDER BY a.attractor_score DESC
LIMIT 20
""").df()When data must be distributed to external researchers or archived for offline analysis, the Gold layer can be exported in Parquet format — a compressed columnar format that DuckDB reads directly, without a database, with performance equivalent to in-memory.
The Gold layer is populated through a SQL function scheduled by Prefect on a daily basis. The function executes a complete table recreation cycle: first it empties the existing content with TRUNCATE, then rebuilds it with a pivot query that transforms EAV rows from the Silver layer into columns of the denormalized Gold table. At the end, an ANALYZE command updates the query optimizer statistics.
-- Refresh Gold tables from Silver (scheduled daily)
CREATE OR REPLACE FUNCTION gold.refresh_comuni_aggregati()
RETURNS void AS $$
BEGIN
-- Truncate and rebuild (snapshot approach)
TRUNCATE gold.comuni_aggregati;
-- Pivot EAV → Wide table
INSERT INTO gold.comuni_aggregati
SELECT
c.codice_istat,
c.denominazione,
c.geometria,
ST_Centroid(c.geometria) AS centroide,
-- Pivot attributes from Silver
MAX(CASE WHEN e.attribute = 'popolazione_2024' THEN e.value::INTEGER END) AS popolazione_2024,
MAX(CASE WHEN e.attribute = 'popolazione_2023' THEN e.value::INTEGER END) AS popolazione_2023,
MAX(CASE WHEN e.attribute = 'superficie_kmq' THEN e.value::NUMERIC END) AS superficie_kmq,
-- Calculated fields
(MAX(CASE WHEN e.attribute = 'popolazione_2024' THEN e.value::NUMERIC END) -
MAX(CASE WHEN e.attribute = 'popolazione_2023' THEN e.value::NUMERIC END)) /
NULLIF(MAX(CASE WHEN e.attribute = 'popolazione_2023' THEN e.value::NUMERIC END), 0) * 100
AS crescita_popolazione_pct,
-- Territorial areas
MAX(CASE WHEN e.attribute = 'ato_gas_id' THEN e.value END) AS ato_gas_id,
MAX(CASE WHEN e.attribute = 'ato_gas_gestore' THEN e.value END) AS ato_gas_gestore,
NOW() AS last_updated
FROM
gold.comuni_anagrafica c
LEFT JOIN silver.territory_identifiers ti ON ti.scheme = 'istat' AND ti.identifier = c.codice_istat
LEFT JOIN silver.territory_attributes e ON e.territory_id = ti.territory_id
AND e.valid_to IS NULL -- Current data only
GROUP BY
c.codice_istat, c.denominazione, c.geometria;
-- Analyze for query optimizer
ANALYZE gold.comuni_aggregati;
END;
$$ LANGUAGE plpgsql;
-- Schedule refresh (called by Prefect flow daily)
SELECT gold.refresh_comuni_aggregati();The SQL function approach scheduled by Prefect is the choice adopted for the MAPS project, as it keeps the transformation logic close to the database and leverages PostGIS spatial capabilities. There are however equally valid alternative approaches.
With dbt (data build tool), Silver → Gold transformations are expressed as versioned SQL models, with automatic dependency graphs, integrated tests, and generated documentation; it is the preferable choice when the number of Gold models grows and maintainability becomes a priority.
With pure Prefect, transformations occur in Python via dedicated tasks, eliminating the dependency on SQL functions in the database; it is preferable when transformations require complex logic not expressible in SQL, or when a single control point for the entire pipeline is desired.
The MAPS project acquires over 180 datasets from heterogeneous sources — ISTAT, Ministries, public registries, OpenData — each with its own attributes, variable temporal coverage, and non-uniform availability across municipalities. This heterogeneity poses a Silver layer design problem: how to structure a schema that can accommodate new attributes without requiring database changes, that records the provenance of each value, and that manages changes over time?
The traditional approach, based on a "wide" table with one column per attribute, is inadequate in this context. The table would have hundreds of columns, most of which would be empty for any specific municipality, and would require a schema migration every time a new dataset is added. Temporal traceability would be absent or would require parallel structures.
| codice_istat | popolazione | n_asili | has_ospedale | n_scuole | ato_gas_id | ... |
|---|---|---|---|---|---|---|
| 001001 | 45230 | 12 | true | ? | ? | ... |
| 001002 | 8420 | 2 | false | ? | ? | ... |
The EAV (Entity-Attribute-Value) model solves these problems by inverting the structure: instead of having one column per attribute, each attribute becomes a row. The schema remains fixed — three fundamental columns plus metadata — while the content grows vertically as new datasets are added, with no structural modifications.
| entity | attribute | value | valid_from | source |
|---|---|---|---|---|
| 001001 | popolazione | 45230 | 2021-01-01 | ISTAT |
| 001001 | n_asili_nido | 12 | 2021-01-01 | ISTAT |
| 001001 | has_ospedale | true | 2015-01-01 | MinSalute |
| 001002 | popolazione | 8420 | 2021-01-01 | ISTAT |
| 001002 | cod_pre_fusione | 001045 | 2017-01-01 | ISTAT |
The source column guarantees traceability of provenance for each individual value; the valid_from and valid_to columns allow managing time series and municipal changes without parallel structures. Storage is efficient because only attributes that are actually available are recorded, not missing values.
The main trade-off is query complexity: to read a set of attributes for a municipality, a pivot operation (CASE WHEN) is required that transforms EAV rows into columns. This complexity is however confined to the Silver → Gold transformation layer, invisible to users of the data marts. Read performance, potentially penalized by the higher number of rows, is mitigated by appropriate indexes on codice_istat and attributo. Values are stored as text and converted to the correct type during Gold layer population.
The EAV schema in the Silver layer represents the optimal trade-off for MAPS: maximum flexibility in acquisition, long-term manageability, and no structural coupling between the arrival of new data and the database schema.
The choice of technology stack for MAPS is guided by three criteria: adequacy for the scale of data handled, independence from commercial vendors, and tool maturity. The project works with approximately 180 datasets at municipal granularity — a volume in the order of tens of gigabytes, not petabytes — that does not require large-scale cloud infrastructure but benefits from specialized tools for spatial data and territorial analysis.
The entire stack consists of open source software, each chosen for a precise and non-overlapping role. PostgreSQL with the PostGIS extension is the core of the system: it manages both the Silver layer EAV schema and the Gold layer denormalized tables, with native support for geographic operations such as isochrone calculation. Prefect orchestrates pipeline execution, from Bronze data collection through to Gold layer population, with monitoring and error handling. OpenMetadata catalogs all datasets internally, tracking lineage and quality; CKAN exposes approved datasets outward for open data publication. Great Expectations oversees data quality in the Bronze → Silver transition, while DuckDB provides researchers with a lightweight and portable analytical interface on the Gold layer.
| Component | Technology | Version | Role |
|---|---|---|---|
| Primary storage | PostgreSQL + PostGIS | 17 + 3.5 | Master data, spatial operations |
| Orchestration | Prefect | 3.x | Scheduling, monitoring, ETL lineage |
| Analytics layer | DuckDB | 1.x | Analytics queries (federation from PostgreSQL) |
| Internal governance | OpenMetadata | 1.x | Internal catalog, lineage, data quality, profiling |
| Open data catalog | CKAN | 2.11 | Public open data catalog, DCAT-AP_IT |
| Data quality | Great Expectations | 1.x | Validation, anomaly detection |
| PDF extraction | Docling | Latest | Table extraction from PDF (97.9% accuracy) |
| Object storage | MinIO (or GCS) | Latest | Raw files, PDFs, Excel archives |
graph TB
subgraph "Data Sources"
A1[ISTAT CSV]
A2[Min PDF]
A3[HTML Tables]
end
subgraph "Orchestration"
B[Prefect
Worker Pools]
end
subgraph "Storage Layers"
C[Bronze
File System]
D[Silver
PostgreSQL EAV]
E[Gold
PostgreSQL + PostGIS]
end
subgraph "Internal Governance & Analytics"
F[OpenMetadata
Internal Catalog]
G[Great Expectations
Quality]
H[DuckDB
Analytics]
end
subgraph "Public Access"
I[CKAN
Open Data Catalog]
J[dati.gov.it]
end
A1 --> B
A2 --> B
A3 --> B
B --> C
C --> D
D --> E
D --> G
B -.metadata.-> F
E --> F
E --> H
E --> I
I --> J
The detailed rationale for technology choices — quantitative comparisons between the alternatives considered for each component, cost analysis, and constraint evaluation — is documented in Appendix A.
MAPS platform services are distributed as orchestrated containers on a managed Kubernetes cluster (DigitalOcean DOKS). The PostgreSQL + PostGIS database is managed as a managed service by the same provider, eliminating the operational overhead of backup, failover, and patching. Persistent storage for the Bronze layer is provided by block volumes. HTTPS traffic is terminated by a Load Balancer with Let's Encrypt certificates, distributed to exposed services via Ingress Controller.
The complete solution design — infrastructure choice rationale, sizing, service topology, security, and cost estimates — is documented in deliverable D2.1.2.
Data governance is a structural component of the MAPS architecture, not a subsequent addition. A project that integrates over 200 datasets from heterogeneous public sources, transforms them through three progressive layers, and publishes them as open data for scientific research purposes needs to know at all times where each piece of data comes from, what transformations it has undergone, and with what level of quality it has reached its destination. Without this traceability, the analytical results of WP3 — the map of Daily Life Systems, the classification of attractors — would not be verifiable or reproducible.
Governance is articulated in three complementary dimensions: lineage tracking (where the data comes from and how it was transformed), quality measurement (how complete, consistent, and reliable the data is), and cataloging (what datasets exist, who is responsible for them, how they are structured).
Lineage describes the journey of each piece of data from the external source to its final use. In the context of the Medallion architecture, this journey passes through at least three steps: acquisition in the Bronze layer, normalization in the Silver layer, and aggregation in the Gold layer. Each step may involve significant transformations — parsing a PDF, reconciling ISTAT codes for municipal mergers, calculating spatial aggregations — and lineage must document them all.
OpenMetadata is the chosen tool for lineage tracking. It integrates natively with PostgreSQL, from which it automatically reads the schema structure (tables, columns, types), and with Prefect, from which it receives pipeline execution metadata (flow runs, task runs, status, duration, errors). This dual integration allows the complete chain to be reconstructed: which flow produced which table, starting from which source data, in which execution, with what outcome.
Lineage is articulated at three levels of granularity. The dataset level tracks dependencies between tables: the Gold table comuni_aggregati depends on the Silver table territory_attributes, which in turn depends on the Bronze files of each source. The column level tracks the provenance of individual attributes: the popolazione_residente field in the Gold layer derives from the same-named attribute in the Silver layer, originating from the ISTAT population CSV file. The pipeline level tracks executions: which Prefect run updated a given table, when, with what duration, and with what outcome.
graph TB
subgraph "External Sources"
S1[ISTAT CSV
Population]
S2[MinAmbiente HTML
ATO Gas]
S3[MinSalute PDF
ASL]
end
subgraph "Bronze Layer"
B1[popolazione_2024.csv]
B2[ato_gas_page.html]
B3[elenco_asl.pdf]
end
subgraph "Silver Layer"
SV[territory_attributes
Generalized EAV Schema]
end
subgraph "Gold Layer"
G1[comuni_aggregati]
G2[dls_attractors]
G3[isochrone_maps]
end
subgraph "Applications"
A1[CKAN Open Data]
A2[API Web]
A3[DuckDB Analytics]
end
S1 -->|"Prefect: download"| B1
S2 -->|"Prefect: download"| B2
S3 -->|"Prefect: download"| B3
B1 -->|"Prefect: parse CSV"| SV
B2 -->|"Prefect: parse HTML"| SV
B3 -->|"Prefect: Docling extract"| SV
SV -->|"SQL: aggregate"| G1
SV -->|"PostGIS: spatial"| G2
SV -->|"PostGIS: isochrone"| G3
G1 --> A1
G1 --> A2
G1 --> A3
G2 --> A2
G3 --> A2
style B1 fill:#cd7f32,color:#000
style B2 fill:#cd7f32,color:#000
style B3 fill:#cd7f32,color:#000
style SV fill:#c0c0c0,color:#000
style G1 fill:#ffd700,color:#000
style G2 fill:#ffd700,color:#000
style G3 fill:#ffd700,color:#000
The practical value of lineage emerges in several operational scenarios. When an external source publishes an update, lineage immediately indicates which Silver and Gold tables must be recalculated, avoiding unnecessary re-executions or, worse, oversights. When an analysis produces unexpected results, lineage allows tracing back the chain and verifying whether the cause is an anomalous source datum, a transformation error, or a change in the source structure. When a dataset is published on CKAN, lineage provides the methodological documentation required by DCAT-AP_IT standards: provenance, transformations applied, date of last update.
Data quality is verified systematically at each stage of the pipeline, with validation rules defined using Great Expectations. Validations are not generic checks applied uniformly to all datasets: each source has its own expected quality profile, with thresholds and rules calibrated to the specific nature of the data.
In the Bronze → Silver transition, validations oversee structural integrity. For datasets at municipal granularity, the fundamental check is coverage: the number of municipalities present in the file must be consistent with the reference universe (7,896 municipalities as of 2024, with annual variations due to mergers and splits). ISTAT code validity (format, length, membership in the reference municipal registry), absence of duplicates, data type consistency (numeric values in numeric fields, dates in valid format), and completeness (percentage of null values for each attribute) are also verified.
In the Silver → Gold transition, validations focus on semantic consistency. Territorial aggregations must produce totals consistent with the source data (for example, the sum of the population of municipalities in a province must correspond to the provincial figure). PostGIS spatial operations must generate valid geometries (no self-intersections, complete territorial coverage). Time series must maintain continuity: a municipality that existed in 2023 must have a successor in 2024, or be documented as dissolved.
Validation results are recorded in OpenMetadata as quality metrics associated with each table. The five main metrics are: completeness (percentage of non-null values), uniqueness (absence of duplicate records), validity (conformity of values to the expected domain), consistency (compatibility between correlated attributes), and timeliness (distance between the source publication date and the acquisition date). These metrics are accessible in the internal catalog and allow assessing the health of each dataset at a glance.
When a validation fails — for example, a CSV file contains only 5,000 municipalities instead of the expected 7,896 — the pipeline halts and the data is not propagated to the next layer. The error is recorded in the ingestion log, reported in the Prefect interface, and associated with the corresponding table in OpenMetadata. The Bronze data remains available for problem analysis, but the Silver layer is protected from partial or corrupted ingestions.
OpenMetadata acts as the internal catalog of the entire data platform. Every table, in each of the three layers, is registered with its metadata: structure (columns, types, constraints), provenance (source, pipeline that generated it), quality (metrics from the latest validation), responsibility (dataset owner), and documentation (textual description, methodological notes).
The catalog is populated partly automatically — OpenMetadata reads the structure of PostgreSQL tables and the metadata from Prefect executions — and partly manually, for information that requires domain knowledge: the description of an attribute, the calculation methodology of an indicator, caveats on the use of a dataset. This combination of automation and curation allows the catalog to be kept up to date without an excessive documentation burden.
The catalog serves three operational needs:
The MAPS project's ingestion pipelines follow three architectural principles that guarantee reliability and maintainability over time: operation idempotency, incremental data loading, and controlled schema evolution.
Every pipeline must be re-executable at any time without generating duplicates or corrupting existing data. This requirement is essential in a context where data sources are heterogeneous and partial failures are frequent: an interrupted download, a malformed PDF file, a network timeout can stop execution midway, and the pipeline must be able to restart from the beginning without side effects.
Idempotency is implemented at the database level through upsert operations: when a record is inserted into the Silver layer, the composite key (ISTAT code, attribute, source, dataset, reference year) determines whether it is new data or an update. If the record already exists, the value is overwritten and the update timestamp is recorded; if it does not exist, it is created. The result is identical whether on the first execution or on any subsequent execution: the final state of the database depends exclusively on the source data, not on the history of executions.
In the Bronze layer, idempotency is guaranteed by the structure of the layer itself: downloaded files overwrite any previous versions at the same position in the file system, identified by the source/dataset/year combination. The SHA-256 hash of the file allows verifying whether the content has actually changed compared to the previous version, avoiding unnecessary transformations in cases where the source has not published updates.
The loading strategy differs between the Silver layer and the Gold layer, reflecting the different needs of the two strata.
The Silver layer adopts an incremental approach with temporal versioning. When a dataset is updated, the pipeline compares the new data with what is already present: new records are inserted, modified records are updated by closing the previous version (field valid_to) and opening a new version (field valid_from), unchanged records are left untouched. This strategy minimizes write volume and preserves the history of changes, making it possible to query the state of data at any past date.
The Gold layer instead adopts a snapshot approach: denormalized tables are fully rebuilt from Silver data every time an update requires it. Full reconstruction is preferable to incremental updating for two reasons. The first is the aggregated nature of Gold data: a change to a single Silver record can propagate across multiple Gold tables and multiple rows of each table, making point updates complex and fragile. The second is the contained volume: with approximately 8,000 municipalities and a few hundred attributes per table, full reconstruction takes a few minutes and does not justify the additional complexity of an incremental mechanism.
The architecture must handle two types of evolution over time: the addition of new data sources and modification of the structure of existing data.
Adding new sources is the most frequent and simplest case. The Silver layer's EAV schema is designed to accommodate new attributes without structural modifications: a new dataset translates into new rows in the territory_attributes table with a new value in the attribute field, without needing to add columns or alter tables. In the Gold layer, adding a new attribute requires a new column in the corresponding denormalized table, but this operation is non-breaking: the column is added with a default value and populated by the pipeline, with no impact on existing queries.
Modifying the structure of existing data is less frequent but more delicate. It can happen that an ISTAT source changes a file's format, adds or removes columns, or changes value encoding between one release and the next. In these cases, the Bronze → Silver transformation pipeline must be updated to handle the new format, maintaining compatibility with previous versions still present in the Bronze layer. The Silver layer's temporal versioning ensures that data already acquired with the old format remains valid and queryable, while new data follows the updated format.
OpenMetadata records every schema change as a lineage event, allowing tracking of when a table was altered, by which pipeline, and for what reason. This tracking is particularly useful during debugging: when an analysis produces results different from those expected, the history of schema changes makes it possible to verify whether the cause is a change in data structure rather than in the data itself.
A three-layer architecture with over 200 datasets and chained transformation pipelines requires a monitoring system that can answer three questions at all times: is data entering correctly?, are transformations working?, and is the produced data usable?
Monitoring is organized along the three layers of the Medallion architecture, with specific indicators for each.
For the Bronze layer, the main indicator is the acquisition success rate. Each execution of a download pipeline is recorded in the bronze.ingestion_log table with the outcome (success, failure, partial), the size of the downloaded file, and the content hash. Monitoring verifies that planned acquisitions are actually executed and that downloaded files have sizes consistent with expectations: an ISTAT population CSV file weighing a few kilobytes instead of the expected megabytes signals a problem at the source or in the download. Bronze storage growth is contained — in the order of a few gigabytes per year, given the semi-annual or annual update frequency of sources — and is monitored to ensure that persistent volumes have sufficient capacity.
For the Silver layer, the key indicators are ingestion latency and data quality. Latency measures the time elapsed between the completion of the Bronze download and the availability of the normalized data in the Silver layer: for most datasets (CSV, Excel) this time is in the order of minutes; for complex PDFs requiring extraction with Docling it can reach several tens of minutes. Quality metrics — completeness, uniqueness, validity — are those produced by Great Expectations validations and recorded in OpenMetadata, as described in the previous section. A deterioration of these indicators over time signals a change in the structure or quality of the source, requiring intervention on the transformation pipeline.
For the Gold layer, monitoring focuses on query performance and aggregation consistency. Response times of analytical queries are tracked via the PostgreSQL pg_stat_statements extension, with the objective of keeping the 95th percentile below 100 milliseconds for the most common queries. Reconstruction of Gold tables from the Silver layer must complete in reasonable times — in the order of minutes for the current project volume — and monitoring signals any slowdowns that may indicate data growth beyond planned thresholds or aggregation queries requiring optimization.
Prefect natively provides monitoring of pipeline execution, with visibility into the status of each flow run (running, completed, failed), the execution times of each task, and detailed logs in case of error. The Prefect web interface allows consulting execution history, identifying failed tasks, and re-executing pipelines with a single intervention.
Pipeline monitoring integrates with the lineage tracked in OpenMetadata: when a flow fails, it is immediately possible to trace which Silver and Gold tables have not been updated, and therefore which downstream data may be stale. This correlation between pipeline status and data status is particularly useful when multiple pipelines share destination tables: the failure of a single pipeline does not compromise the entire layer, but monitoring must signal which portions of the data have remained at the last successful update.
At the infrastructure level, monitoring of the Kubernetes cluster and the PostgreSQL database covers the classic resource utilization metrics: CPU, memory, disk I/O, storage space. DigitalOcean provides basic metrics for DOKS cluster nodes and for the managed database; for more detailed monitoring, the Prometheus + Grafana stack can be deployed in the cluster as an additional service.
Attention thresholds are calibrated on the load profiles described in the sizing section: sustained CPU utilization above 80% on worker nodes during ETL execution windows is normal; the same level in the absence of active pipelines indicates a problem. Database disk occupancy is monitored with a threshold at 70% of allocated capacity, leaving margin for growth and for PostgreSQL maintenance operations (vacuum, reindex) that require temporary space.