ETL Pipeline Validation Report

Deliverable D2.3: ETL Pipeline Validation Report

[TODO] This chapter will be completed after full implementation and testing (Task 2.3, M10-M12)

Validation methodology

Acceptance criteria

The ETL pipelines must satisfy five classes of acceptance criteria. Temporal integrity is the primary concern for the territorial foundation layer, since all downstream attribute pipelines depend on the correctness of the silver.territories graph.

Temporal integrity (foundation pipelines)

Target: all territory records in silver.territories and related tables carry valid valid_from/valid_to timestamps derived from ISTAT variazioni events; no overlapping time windows exist for the same entity; every currently active entity has valid_to IS NULL
Metric: count of integrity violations across territories, territory_identifiers, territory_names, territory_containments
Threshold: PASS if 0 violations; FAIL on any violation

Coverage

Target: all Italian municipalities (7,904 as of 2024) represented in silver.territories with at least one active record
Metric: (territori_con_record_attivo / totale_comuni_istat) × 100
Threshold: FAIL if <99%, WARNING if 99–99.9%, PASS if 100%

Accuracy

Target: all records in silver.territory_attributes carry a valid territory_id resolving to a row in silver.territories
Metric: (record_con_territory_id_valido / totale_record) × 100
Threshold: FAIL if <99.9%

Timeliness

Target: ingestion pipelines complete within defined scheduling windows
Metric: flow run duration as reported by Prefect
Threshold: <24h for foundation and priority pipelines

Lineage

Target: complete traceability from source URL → Bronze file → Silver record
Metric: % of Silver records with populated fonte, anno_rif, and a corresponding entry in bronze.ingestion_log
Threshold: 100%

Validation pipeline

The Great Expectations validation suite for each pipeline is executed as Flow 3 immediately after the transform flow. Results are persisted to the Prefect run log. A failed expectation blocks Flow 4 (metadata registration) to prevent lineage entries for incomplete data.

Foundation pipeline validation

Temporal integrity checks

After running all four foundation pipelines in sequence, the following SQL assertions must return zero rows or zero counts.

sql

-- (1) No two active records for the same territory type and id
SELECT type_code, COUNT(*) AS duplicates
FROM silver.territories
WHERE valid_to IS NULL
GROUP BY type_code, id
HAVING COUNT(*) > 1;

-- (2) No date inversion (valid_from after valid_to)
SELECT id FROM silver.territories
WHERE valid_to IS NOT NULL AND valid_from > valid_to;

-- (3) Every territory has at least one identifier
SELECT t.id FROM silver.territories t
LEFT JOIN silver.territory_identifiers ti ON ti.territory_id = t.id
WHERE ti.id IS NULL;

-- (4) Every active comune is contained in a province
SELECT t.id FROM silver.territories t
WHERE t.type_code = 'COM'
  AND t.valid_to IS NULL
  AND NOT EXISTS (
    SELECT 1 FROM silver.territory_containments tc
    WHERE tc.territory_id = t.id AND tc.container_type_code = 'PRO'
  );

Coverage check

sql

SELECT
  COUNT(*)             AS active_comuni,
  7904                 AS istat_baseline_2024,
  ROUND(COUNT(*) * 100.0 / 7904, 2) AS coverage_pct
FROM silver.territories
WHERE type_code = 'COM'
  AND valid_to IS NULL;

Metric	Target	Result	Status
Active comuni	7,904	TBD	pending
Coverage	100%	TBD	pending
Temporal violations	0	TBD	pending
Orphan identifiers	0	TBD	pending
Containment violations	0	TBD	pending

Territory corrections validation

Correction	Lookup key (COD_CATASTO)	Expected outcome	Status
Baranzate (MI) stale valid_to	A618	valid_to IS NULL	pending
Castegnero (VI) ADM alias	C056	alias name present in territory_names	pending

Attribute pipeline validation

ADM Tabaccherie

Metric	Target	Result	Status
Municipal coverage	≥95%	TBD	pending
Resolution failures	0%	TBD	pending
Execution time	<24h	TBD	pending
Lineage completeness	100%	TBD	pending

Additional attribute pipeline results will be added as each pipeline reaches production.

Performance benchmarks

[TODO] Actual measurements after full run on production data

Pipeline	Records processed	Execution time
variazioni-amministrative (ingestion)	~500K events	TBD
province-regioni (transform)	~120 entities	TBD
comuni (transform)	~8,000 comuni × time windows	TBD
territory-corrections	~20 patches	TBD
ADM Tabaccherie	~8,000 municipalities	TBD

Issues identified and resolutions

Issue #1: Municipal code instability across province reorganisations

Description: municipalities that changed ISTAT code due to Assegnazione Provinciale (AP) events could not be resolved by code alone, producing duplicate or missing territory records in naive implementations. Solution: the comuni transform uses a union-find algorithm to group all code variants for the same physical municipality under a single stable territory_id, keyed on the Belfiore cadastral code (COD_CATASTO), which is invariant across province reorganisations. All historical ISTAT codes are stored as entries in silver.territory_identifiers with appropriate valid_from/valid_to ranges. Status: Resolved

Issue #2: ADM portal scraping timeout

Description: the ADM JSF/PrimeFaces portal timed out after 10 minutes on pages listing many municipalities per province. Solution: increased timeout to 30 minutes; implemented retry with exponential backoff in the rate_limiter utility. Status: Resolved

Recommendations

Short-term (before Task 2.3 sign-off)

Run the complete foundation pipeline on production data and record the benchmark results in the tables above.
Add a database-level FK constraint on silver.territory_attributes.territory_id to enforce referential integrity at the storage layer, not only at the application layer.
Configure Prefect alerting (email or Slack) on foundation pipeline failures so that data-quality regressions are caught before downstream pipelines run.

Medium-term (Task 2.3 hardening phase)

Extend Great Expectations Silver suites to all attribute pipelines as they reach production.
Schedule automated temporal integrity checks (the SQL assertions in §2 of this chapter) as a daily Prefect flow.
Implement incremental ingestion for SITUAS datasets to avoid full re-downloads on each pipeline run.

Long-term

Expose foundation pipeline validation results in OpenMetadata as dataset quality metrics, making the territorial graph health visible in the data catalog.
Integrate temporal integrity checks into the CI pipeline so that regressions in transform logic are caught before deployment.

Conclusions

[TODO] Final summary after testing completion

The MAPS Data Lake validation strategy treats temporal integrity as the primary quality gate. Because every attribute pipeline ultimately resolves source identifiers to territory_id values maintained by the territorial foundation, the correctness of that foundation — expressed through sound valid_from/valid_to chains and complete containment hierarchies — is the necessary condition for any downstream quality claim. The acceptance criteria and SQL assertions defined in this chapter will be evaluated against production data as each pipeline completes Task 2.3 hardening.

ETL Pipeline Validation Report ​

Validation methodology ​

Acceptance criteria ​

Temporal integrity (foundation pipelines) ​

Coverage ​

Accuracy ​

Timeliness ​

Lineage ​

Validation pipeline ​

Foundation pipeline validation ​

Temporal integrity checks ​

Coverage check ​

Territory corrections validation ​

Attribute pipeline validation ​

ADM Tabaccherie ​

Performance benchmarks ​

Issues identified and resolutions ​

Issue #1: Municipal code instability across province reorganisations ​

Issue #2: ADM portal scraping timeout ​

Recommendations ​

Short-term (before Task 2.3 sign-off) ​

Medium-term (Task 2.3 hardening phase) ​

Long-term ​

Conclusions ​