๐ฌ๐ง English
๐ฌ๐ง English
Appearance
๐ฌ๐ง English
๐ฌ๐ง English
Appearance
Deliverable D2.3: ETL Pipeline Validation Report
[TODO] This chapter will be completed after full implementation and testing (Task 2.3, M10-M12)
The ETL pipelines must satisfy five classes of acceptance criteria. Temporal integrity is the primary concern for the territorial foundation layer, since all downstream attribute pipelines depend on the correctness of the silver.territories graph.
silver.territories and related tables carry valid valid_from/valid_to timestamps derived from ISTAT variazioni events; no overlapping time windows exist for the same entity; every currently active entity has valid_to IS NULLterritories, territory_identifiers, territory_names, territory_containmentssilver.territories with at least one active record(territori_con_record_attivo / totale_comuni_istat) ร 100silver.territory_attributes carry a valid territory_id resolving to a row in silver.territories(record_con_territory_id_valido / totale_record) ร 100fonte, anno_rif, and a corresponding entry in bronze.ingestion_logThe Great Expectations validation suite for each pipeline is executed as Flow 3 immediately after the transform flow. Results are persisted to the Prefect run log. A failed expectation blocks Flow 4 (metadata registration) to prevent lineage entries for incomplete data.
After running all four foundation pipelines in sequence, the following SQL assertions must return zero rows or zero counts.
-- (1) No two active records for the same territory type and id
SELECT type_code, COUNT(*) AS duplicates
FROM silver.territories
WHERE valid_to IS NULL
GROUP BY type_code, id
HAVING COUNT(*) > 1;
-- (2) No date inversion (valid_from after valid_to)
SELECT id FROM silver.territories
WHERE valid_to IS NOT NULL AND valid_from > valid_to;
-- (3) Every territory has at least one identifier
SELECT t.id FROM silver.territories t
LEFT JOIN silver.territory_identifiers ti ON ti.territory_id = t.id
WHERE ti.id IS NULL;
-- (4) Every active comune is contained in a province
SELECT t.id FROM silver.territories t
WHERE t.type_code = 'COM'
AND t.valid_to IS NULL
AND NOT EXISTS (
SELECT 1 FROM silver.territory_containments tc
WHERE tc.territory_id = t.id AND tc.container_type_code = 'PRO'
);SELECT
COUNT(*) AS active_comuni,
7904 AS istat_baseline_2024,
ROUND(COUNT(*) * 100.0 / 7904, 2) AS coverage_pct
FROM silver.territories
WHERE type_code = 'COM'
AND valid_to IS NULL;| Metric | Target | Result | Status |
|---|---|---|---|
| Active comuni | 7,904 | TBD | pending |
| Coverage | 100% | TBD | pending |
| Temporal violations | 0 | TBD | pending |
| Orphan identifiers | 0 | TBD | pending |
| Containment violations | 0 | TBD | pending |
| Correction | Lookup key (COD_CATASTO) | Expected outcome | Status |
|---|---|---|---|
| Baranzate (MI) stale valid_to | A618 | valid_to IS NULL | pending |
| Castegnero (VI) ADM alias | C056 | alias name present in territory_names | pending |
| Metric | Target | Result | Status |
|---|---|---|---|
| Municipal coverage | โฅ95% | TBD | pending |
| Resolution failures | 0% | TBD | pending |
| Execution time | <24h | TBD | pending |
| Lineage completeness | 100% | TBD | pending |
Additional attribute pipeline results will be added as each pipeline reaches production.
[TODO] Actual measurements after full run on production data
| Pipeline | Records processed | Execution time |
|---|---|---|
| variazioni-amministrative (ingestion) | ~500K events | TBD |
| province-regioni (transform) | ~120 entities | TBD |
| comuni (transform) | ~8,000 comuni ร time windows | TBD |
| territory-corrections | ~20 patches | TBD |
| ADM Tabaccherie | ~8,000 municipalities | TBD |
Description: municipalities that changed ISTAT code due to Assegnazione Provinciale (AP) events could not be resolved by code alone, producing duplicate or missing territory records in naive implementations. Solution: the comuni transform uses a union-find algorithm to group all code variants for the same physical municipality under a single stable territory_id, keyed on the Belfiore cadastral code (COD_CATASTO), which is invariant across province reorganisations. All historical ISTAT codes are stored as entries in silver.territory_identifiers with appropriate valid_from/valid_to ranges. Status: Resolved
Description: the ADM JSF/PrimeFaces portal timed out after 10 minutes on pages listing many municipalities per province. Solution: increased timeout to 30 minutes; implemented retry with exponential backoff in the rate_limiter utility. Status: Resolved
silver.territory_attributes.territory_id to enforce referential integrity at the storage layer, not only at the application layer.[TODO] Final summary after testing completion
The MAPS Data Lake validation strategy treats temporal integrity as the primary quality gate. Because every attribute pipeline ultimately resolves source identifiers to territory_id values maintained by the territorial foundation, the correctness of that foundation โ expressed through sound valid_from/valid_to chains and complete containment hierarchies โ is the necessary condition for any downstream quality claim. The acceptance criteria and SQL assertions defined in this chapter will be evaluated against production data as each pipeline completes Task 2.3 hardening.