Skip to content

ETL Pipeline Validation Report โ€‹

Deliverable D2.3: ETL Pipeline Validation Report

[TODO] This chapter will be completed after full implementation and testing (Task 2.3, M10-M12)

Validation methodology โ€‹

Acceptance criteria โ€‹

The ETL pipelines must satisfy five classes of acceptance criteria. Temporal integrity is the primary concern for the territorial foundation layer, since all downstream attribute pipelines depend on the correctness of the silver.territories graph.

Temporal integrity (foundation pipelines) โ€‹

  • Target: all territory records in silver.territories and related tables carry valid valid_from/valid_to timestamps derived from ISTAT variazioni events; no overlapping time windows exist for the same entity; every currently active entity has valid_to IS NULL
  • Metric: count of integrity violations across territories, territory_identifiers, territory_names, territory_containments
  • Threshold: PASS if 0 violations; FAIL on any violation

Coverage โ€‹

  • Target: all Italian municipalities (7,904 as of 2024) represented in silver.territories with at least one active record
  • Metric: (territori_con_record_attivo / totale_comuni_istat) ร— 100
  • Threshold: FAIL if <99%, WARNING if 99โ€“99.9%, PASS if 100%

Accuracy โ€‹

  • Target: all records in silver.territory_attributes carry a valid territory_id resolving to a row in silver.territories
  • Metric: (record_con_territory_id_valido / totale_record) ร— 100
  • Threshold: FAIL if <99.9%

Timeliness โ€‹

  • Target: ingestion pipelines complete within defined scheduling windows
  • Metric: flow run duration as reported by Prefect
  • Threshold: <24h for foundation and priority pipelines

Lineage โ€‹

  • Target: complete traceability from source URL โ†’ Bronze file โ†’ Silver record
  • Metric: % of Silver records with populated fonte, anno_rif, and a corresponding entry in bronze.ingestion_log
  • Threshold: 100%

Validation pipeline โ€‹

The Great Expectations validation suite for each pipeline is executed as Flow 3 immediately after the transform flow. Results are persisted to the Prefect run log. A failed expectation blocks Flow 4 (metadata registration) to prevent lineage entries for incomplete data.

Foundation pipeline validation โ€‹

Temporal integrity checks โ€‹

After running all four foundation pipelines in sequence, the following SQL assertions must return zero rows or zero counts.

sql
-- (1) No two active records for the same territory type and id
SELECT type_code, COUNT(*) AS duplicates
FROM silver.territories
WHERE valid_to IS NULL
GROUP BY type_code, id
HAVING COUNT(*) > 1;

-- (2) No date inversion (valid_from after valid_to)
SELECT id FROM silver.territories
WHERE valid_to IS NOT NULL AND valid_from > valid_to;

-- (3) Every territory has at least one identifier
SELECT t.id FROM silver.territories t
LEFT JOIN silver.territory_identifiers ti ON ti.territory_id = t.id
WHERE ti.id IS NULL;

-- (4) Every active comune is contained in a province
SELECT t.id FROM silver.territories t
WHERE t.type_code = 'COM'
  AND t.valid_to IS NULL
  AND NOT EXISTS (
    SELECT 1 FROM silver.territory_containments tc
    WHERE tc.territory_id = t.id AND tc.container_type_code = 'PRO'
  );

Coverage check โ€‹

sql
SELECT
  COUNT(*)             AS active_comuni,
  7904                 AS istat_baseline_2024,
  ROUND(COUNT(*) * 100.0 / 7904, 2) AS coverage_pct
FROM silver.territories
WHERE type_code = 'COM'
  AND valid_to IS NULL;
MetricTargetResultStatus
Active comuni7,904TBDpending
Coverage100%TBDpending
Temporal violations0TBDpending
Orphan identifiers0TBDpending
Containment violations0TBDpending

Territory corrections validation โ€‹

CorrectionLookup key (COD_CATASTO)Expected outcomeStatus
Baranzate (MI) stale valid_toA618valid_to IS NULLpending
Castegnero (VI) ADM aliasC056alias name present in territory_namespending

Attribute pipeline validation โ€‹

ADM Tabaccherie โ€‹

MetricTargetResultStatus
Municipal coverageโ‰ฅ95%TBDpending
Resolution failures0%TBDpending
Execution time<24hTBDpending
Lineage completeness100%TBDpending

Additional attribute pipeline results will be added as each pipeline reaches production.

Performance benchmarks โ€‹

[TODO] Actual measurements after full run on production data

PipelineRecords processedExecution time
variazioni-amministrative (ingestion)~500K eventsTBD
province-regioni (transform)~120 entitiesTBD
comuni (transform)~8,000 comuni ร— time windowsTBD
territory-corrections~20 patchesTBD
ADM Tabaccherie~8,000 municipalitiesTBD

Issues identified and resolutions โ€‹

Issue #1: Municipal code instability across province reorganisations โ€‹

Description: municipalities that changed ISTAT code due to Assegnazione Provinciale (AP) events could not be resolved by code alone, producing duplicate or missing territory records in naive implementations. Solution: the comuni transform uses a union-find algorithm to group all code variants for the same physical municipality under a single stable territory_id, keyed on the Belfiore cadastral code (COD_CATASTO), which is invariant across province reorganisations. All historical ISTAT codes are stored as entries in silver.territory_identifiers with appropriate valid_from/valid_to ranges. Status: Resolved

Issue #2: ADM portal scraping timeout โ€‹

Description: the ADM JSF/PrimeFaces portal timed out after 10 minutes on pages listing many municipalities per province. Solution: increased timeout to 30 minutes; implemented retry with exponential backoff in the rate_limiter utility. Status: Resolved

Recommendations โ€‹

Short-term (before Task 2.3 sign-off) โ€‹

  1. Run the complete foundation pipeline on production data and record the benchmark results in the tables above.
  2. Add a database-level FK constraint on silver.territory_attributes.territory_id to enforce referential integrity at the storage layer, not only at the application layer.
  3. Configure Prefect alerting (email or Slack) on foundation pipeline failures so that data-quality regressions are caught before downstream pipelines run.

Medium-term (Task 2.3 hardening phase) โ€‹

  1. Extend Great Expectations Silver suites to all attribute pipelines as they reach production.
  2. Schedule automated temporal integrity checks (the SQL assertions in ยง2 of this chapter) as a daily Prefect flow.
  3. Implement incremental ingestion for SITUAS datasets to avoid full re-downloads on each pipeline run.

Long-term โ€‹

  1. Expose foundation pipeline validation results in OpenMetadata as dataset quality metrics, making the territorial graph health visible in the data catalog.
  2. Integrate temporal integrity checks into the CI pipeline so that regressions in transform logic are caught before deployment.

Conclusions โ€‹

[TODO] Final summary after testing completion

The MAPS Data Lake validation strategy treats temporal integrity as the primary quality gate. Because every attribute pipeline ultimately resolves source identifiers to territory_id values maintained by the territorial foundation, the correctness of that foundation โ€” expressed through sound valid_from/valid_to chains and complete containment hierarchies โ€” is the necessary condition for any downstream quality claim. The acceptance criteria and SQL assertions defined in this chapter will be evaluated against production data as each pipeline completes Task 2.3 hardening.