๐ฌ๐ง English
๐ฌ๐ง English
Appearance
๐ฌ๐ง English
๐ฌ๐ง English
Appearance
Deliverable D2.3.1: QA/QC verification and validation of raw dataset acquisition, cleaning, and normalisation
This chapter documents the framework and methodology for validating the MAPS Data Lake ETL pipelines. It defines the acceptance criteria, the temporal integrity control assertions, quality governance through OpenMetadata tags, the territorial reconciliation methodology, and the content validation carried out by the data stewards. Per-dataset development and execution on production data are carried out by the development team; the authoritative source for the validation status of each dataset is the ETL dashboard, described in the delivery report.
The ETL pipelines must satisfy five classes of acceptance criteria. Temporal integrity is the primary concern for the territorial foundation layer, since all downstream attribute pipelines depend on the correctness of the silver.territories graph.
silver.territories and related tables carry valid valid_from/valid_to timestamps derived from ISTAT variazioni events; no overlapping time windows exist for the same entity; every currently active entity has valid_to IS NULLterritories, territory_identifiers, territory_names, territory_containmentssilver.territories with at least one active record(territori_con_record_attivo / totale_comuni_istat) ร 100silver.territory_attributes carry a valid territory_id resolving to a row in silver.territories(record_con_territory_id_valido / totale_record) ร 100fonte, anno_rif, and a corresponding entry in bronze.ingestion_logThe data quality framework of each pipeline runs as Flow 3 immediately after the transform flow. It is a custom framework that checks the Bronze files via DuckDB and the Silver records via SQL queries; the results are persisted to the Prefect run log and the overall outcome is written as a Quality tag on OpenMetadata. A failed check blocks Flow 4 (metadata registration) to prevent lineage entries for incomplete data.
After running all four foundation pipelines in sequence, the following SQL assertions must return zero rows or zero counts.
-- (1) No two active records for the same territory type and id
SELECT type_code, COUNT(*) AS duplicates
FROM silver.territories
WHERE valid_to IS NULL
GROUP BY type_code, id
HAVING COUNT(*) > 1;
-- (2) No date inversion (valid_from after valid_to)
SELECT id FROM silver.territories
WHERE valid_to IS NOT NULL AND valid_from > valid_to;
-- (3) Every territory has at least one identifier
SELECT t.id FROM silver.territories t
LEFT JOIN silver.territory_identifiers ti ON ti.territory_id = t.id
WHERE ti.id IS NULL;
-- (4) Every active comune is contained in a province
SELECT t.id FROM silver.territories t
WHERE t.type_code = 'COM'
AND t.valid_to IS NULL
AND NOT EXISTS (
SELECT 1 FROM silver.territory_containments tc
WHERE tc.territory_id = t.id AND tc.container_type_code = 'PRO'
);SELECT
COUNT(*) AS active_comuni,
7904 AS istat_baseline_2024,
ROUND(COUNT(*) * 100.0 / 7904, 2) AS coverage_pct
FROM silver.territories
WHERE type_code = 'COM'
AND valid_to IS NULL;The following table reports the thresholds applied by the framework to the coverage and integrity metrics of the territorial foundation. These are acceptance criteria, not production results: the validation status of each dataset is observable from the ETL dashboard (last_sync synchronisation of 12 June 2026), the authoritative source of per-dataset status.
| Metric | PASS threshold | FAIL threshold |
|---|---|---|
| Active comuni | = 7,904 (ISTAT 2024 baseline) | count below baseline with coverage <99% |
| Coverage | 100% | <99% (WARNING between 99% and 99.9%) |
| Temporal violations | 0 | any violation |
| Orphan identifiers | 0 | any orphan entry |
| Containment violations | 0 | any active comune without a province |
The framework verifies that known territory corrections are reflected in silver.territories and related tables. The table lists the cases and the expected outcome according to the applied criteria; verification on production data is tracked by the ETL dashboard.
| Correction | Lookup key (COD_CATASTO) | Expected outcome |
|---|---|---|
| Baranzate (MI) stale valid_to | A618 | valid_to IS NULL |
| Castegnero (VI) ADM alias | C056 | alias name present in territory_names |
Attribute pipelines resolve source identifiers to territory_id values maintained by the territorial foundation and populate silver.territory_attributes. The framework applies to each of them the same classes of acceptance criteria (coverage, accuracy, timeliness, lineage). The table reports the applied thresholds, illustrated on the reference pipeline ADM Tabaccherie; the actual validation status of individual datasets is observable from the ETL dashboard.
| Metric | Threshold |
|---|---|
| Municipal coverage | โฅ95% |
| Resolution failures | 0% |
| Execution time | <24h |
| Lineage completeness | 100% |
The same thresholds apply to the further attribute pipelines as the development team brings them into production; per-dataset progress is tracked by the ETL dashboard.
The timeliness criterion sets, for foundation and priority pipelines, a flow-run completion threshold below 24 hours, measured by Prefect. The table reports the expected data volumes for the main pipelines, which size the processing against that threshold. Actual execution times are recorded in the Prefect flow-run logs and are not available in consolidated form outside those logs; the per-dataset execution status is observable from the ETL dashboard.
| Pipeline | Records processed (expected) |
|---|---|
| variazioni-amministrative (ingestion) | ~500K events |
| province-regioni (transform) | ~120 entities |
| comuni (transform) | ~8,000 comuni ร time windows |
| territory-corrections | ~20 patches |
| ADM Tabaccherie | ~8,000 municipalities |
Description: municipalities that changed ISTAT code due to Assegnazione Provinciale (AP) events could not be resolved by code alone, producing duplicate or missing territory records in naive implementations. Solution: the comuni transform uses a union-find algorithm to group all code variants for the same physical municipality under a single stable territory_id, keyed on the Belfiore cadastral code (COD_CATASTO), which is invariant across province reorganisations. All historical ISTAT codes are stored as entries in silver.territory_identifiers with appropriate valid_from/valid_to ranges. Status: Resolved
Description: the ADM JSF/PrimeFaces portal timed out after 10 minutes on pages listing many municipalities per province. Solution: increased timeout to 30 minutes; implemented retry with exponential backoff in the rate_limiter utility. Status: Resolved
The validation outcome does not remain confined to the flow-run logs but is published in the data catalogue as a tag, making the quality status of each dataset consultable directly from OpenMetadata. Traceability is founded on a taxonomy of tags that classifies each entity along multiple dimensions:
bronze, silver, gold).The Qualitร tag is the direct outcome of the data quality framework and takes four values, two for each validated layer: bronze-validato and bronze-in-revisione for the checks on the raw file, silver-validato and silver-in-revisione for the checks on the EAV records. The validato value is written when all the checks of the respective layer pass the defined thresholds; otherwise the dataset remains in-revisione and Flow 4 does not proceed to register lineage.
The writing of the Quality tag is being extended to all data quality flows. At present about 13 of the 46 data quality flows write the tag to OpenMetadata; the remaining roughly 33 have yet to implement its writing. Completion of this coverage is the responsibility of the development team and proceeds according to the per-dataset rollout; the ETL dashboard is the authoritative source of the validation status of each dataset.
Validating the time series requires comparing observations published on the administrative geography of different years. Because Italian municipal boundaries change over time, mainly through mergers, a direct comparison on territorial identifiers fails at the years of variation. The reconciliation methodology projects all historical observations onto a fixed reference geography โ the municipalities active on 31 December of the target year, currently 2025 โ reducing the code variants of the same physical municipality to a stable identifier through the transitive closure of the succession chains, and aggregating the values of the merged municipalities according to the measure type of the variable.
The aggregation follows five rules, chosen on the basis of the nature of the quantity and defined per column:
sum (extensive quantities): the values of the merged municipalities are summed. It applies to population, number of enterprises, dwellings, and counts and totals in general.mean (intensive quantities referring to population): rates, percentages over population, and average incomes are combined as the mean of the predecessors' values weighted by their population at the merger date.mean_area (intensive quantities referring to area): land-use and soil-consumption percentages are combined as a mean weighted by the surface of the administrative boundaries.last (categorical and administrative values): for territorial labels and codings (altimetric zone, climatic zone, NUTS3 code, morphological classification) the value of the municipality with the larger population at the merger date is taken.none (non-reconcilable variables): for quantities whose aggregation method is not yet established, or for which reconciliation is methodologically inappropriate, the result is NULL for municipalities born from a merger, with a warning to the user; observations of municipalities not involved in mergers pass through unchanged.The aggregation rule is a column-level metadatum, stored in datastore.variable_definitions.aggregation_rule. It is inferred automatically from heuristics on the attribute's type and name and completed and validated column by column by the data stewards: this is the step that makes the entire historical series of a dataset reconcilable. Population weights come from the materialised view gold.population_weights (ISTAT ASTER for 2002โ2018, ISTAT DEMO POS from 2019, contiguous and referring to 1 January); area weights from the administrative boundary geometries. In its current application the methodology handles the 348 municipality extinctions recorded by ISTAT since 1991. The complete methodology โ succession chains, multi-year builds, weight sources, and known limitations โ is described in gst-maps-pipelines/docs/reconciliation-methodology.md.
The technical verification (Chapter 5) ascertains that each phase has produced a result that is present and structurally correct; the content correctness of the data instead requires human judgement, entrusted to the data stewards and articulated across three activities.
The first is the validation of dataset and column metadata in OpenMetadata โ labels, descriptions, dimensions, and tags. In this phase the data stewards also complete the definition of the aggregation rules for reconciliation to the 2025 territories, column by column, as described in the previous section.
The second is the verification of the BronzeโGold correspondence: through the Datastore, the data stewards generate the Gold tables of one or more sources and compare the values in the final tables with those of the original Bronze file, ascertaining that the chain of acquisition, cleaning, and normalisation has not introduced losses or alterations. This is the QA/QC validation of raw dataset acquisition, cleaning, and normalisation that gives this deliverable its name.
The third is the feedback to development: problems detected in the data or in the application's functionality are reported on ClickUp and taken up by the development team for correction. The cycle closes when a new verification confirms the fix and the dataset's status is updated in the dashboard.
The following recommendations concern rollout and consolidation activities that are the responsibility of the development team. The data quality framework, the acceptance criteria, the temporal integrity assertions, the tag taxonomy, and the reconciliation methodology are already delivered and operational; what follows completes their application on production data and hardens their operation.
silver.territory_attributes.territory_id to enforce referential integrity at the storage layer, not only at the application layer.gold.population_weights, so that population weights stay aligned with new extinctions.The MAPS Data Lake validation strategy treats temporal integrity as the primary quality gate. Because every attribute pipeline ultimately resolves source identifiers to territory_id values maintained by the territorial foundation, the correctness of that foundation โ expressed through sound valid_from/valid_to chains and complete containment hierarchies โ is the necessary condition for any downstream quality claim.
What has been delivered in terms of validation is complete and operational: the custom data quality framework run as Flow 3, the acceptance criteria and thresholds for coverage, accuracy, timeliness, temporal integrity, and lineage, the SQL control assertions, the OpenMetadata tag taxonomy with the Quality tag as the validation outcome, and the territorial reconciliation methodology that handles 348 municipality extinctions through the gold.population_weights view. Execution on production data and the per-dataset rollout remain the responsibility of the development team, in particular the completion of the Quality tag writing in the data quality flows that do not yet produce it. Consistently with the delivery report, the authoritative source of the validation status of each dataset is the ETL dashboard, which ensures its verifiability over time.