ETL Pipelines: architecture and verification tools

Deliverable D2.2.1: technical verification and validation of the ETL pipelines

This chapter documents the delivered ETL pipeline architecture for the MAPS Data Lake and the tools through which the pipelines are verified and accepted. The architecture comprises the flow model, the Medallion pattern with the EAV schema, the shared utilities, and a reference implementation; the verification tools comprise an automated verification engine, an operational skill, and a tracking dashboard. The platform tracks 232 datasets across 27 sources, of which 100 are priority datasets (ETL dashboard, synchronisation of 12 June 2026); the currently versioned code counts about 38 pipeline units across 17 sources (gst-maps-pipelines v0.38.4). The difference between the two counts is one of granularity: the dashboard tracks datasets at the processing-unit level, including sources not yet coded, whereas the code counts pipeline directories, and families that adopt a shared orchestrator cover many datasets with few units. Data is normalised to municipal or supra-municipal granularity and stored in the Silver EAV layer. Per-dataset development of the flows is carried out by the development team; the complete operational guide — conventions, patterns, deployment procedures, and a reference implementation — is in Appendix A2 — ETL Pipeline Developer Guide.

The five pipeline phases

Each dataset goes through five phases. The first four are standardised ETL flows that separate data acquisition from transformation, validation, and cataloguing; the fifth integrates the dataset into the Datastore, making it usable. The separation between the flows is deliberate: if the transformation logic changes, only the transform flow needs to be re-run without re-downloading source data; if a validation rule is updated, only the quality flow is triggered. This independence makes iterative development faster and reduces the risk of re-introducing errors in the stages that were not modified.

The four ETL flows are built by the development team, which follows the guidelines and the already-built pipelines as a reference:

Flow 1 — Ingestion: the data source is identified, the download or scraping logic is implemented, and the raw files are written to the Bronze layer. The flow is considered complete when at least one well-formed file is available at the expected Bronze path and logged in bronze.ingestion_log.
Flow 2 — Transform: the Bronze files are read, territory identifiers are resolved via territory_resolver, the source columns are mapped to EAV attributes, and the records are written to silver.territory_attributes. This is the most dataset-specific flow.
Flow 3 — Data quality: a custom validation framework checks both the Bronze files and the Silver records against the acceptance criteria defined for completeness, accuracy, and consistency, and at the end writes the outcome as a Quality tag on OpenMetadata.
Flow 4 — Metadata: Bronze files and Silver tables are registered in OpenMetadata, creating a complete lineage entry from the source to the EAV layer.

The fifth phase — Datastore integration is not the responsibility of the development team but of the verification responsible, and is carried out with the tools described below. It aligns the transformed dataset to the Datastore: it registers the variable definitions and the dimension labels, synchronises tags and descriptions from OpenMetadata, and refreshes the catalogue's materialised views, until the dataset is searchable and can be generated as a Gold table. The concrete procedure and the remediation of misalignments are described in Appendix A2.

Each pipeline lives in a dedicated directory under flows/{ente}/{dataset}/ in the gst-maps-pipelines repository, containing the four flow files, a prefect.yaml deployment manifest, a requirements.txt, and a README.md describing the source, format, and update frequency. The naming convention, shared utilities, team workflow patterns, and deployment procedures are described in detail in Appendix A2.

Implemented pipelines

ISTAT territorial foundation

The territorial foundation is a set of four coordinated pipelines that must run in sequence before any attribute pipeline can load data into Silver. All attribute pipelines resolve territory identifiers through silver.territories; this table — and its associated identifier, name, containment, and relationship tables — is populated exclusively by the territorial foundation.

flows/istat/variazioni-amministrative downloads the complete ISTAT administrative change history via the SITUAS REST API at situas-servizi.istat.it/publish. SITUAS organises its datasets by pfun code; the pipeline requests ten datasets covering regioni (pfun 107, 106, 108), province (pfun 113, 112, 114), and comuni (pfun 129, 98, 104, 105), each as a full time series from 17 March 1861. The parameter type differs by dataset: some accept a period range (pdatada/pdataa), others a single reference date (pdata). The shared situas_client.py utility handles both variants. Raw responses are written to Bronze as CSV under data/bronze/istat/variazioni-amministrative/.

flows/istat/province-regioni reads the variazioni Bronze files and populates silver.territories, silver.territory_identifiers, silver.territory_names, and silver.territory_containments for ripartizioni, regioni, and province. The processing order is deliberate: ripartizioni first, then regioni, then province, then variazioni events (CS = Costituzione, ES = Estinzione). Administrative change events that affected a province's name or boundaries are applied inline, so that valid_from and valid_to timestamps on each record reflect the actual lifecycle of that administrative unit.

flows/istat/comuni is the most complex transform, structured in two phases. Phase 1 uses a union-find algorithm to identify equivalence chains: when a municipality's ISTAT code changes due to an AP (Assegnazione Provinciale) or RN (Ridenominazione) event, the linked settlements are grouped under a single stable territory_id. The stable key is the Belfiore cadastral code (COD_CATASTO), which is invariant across province reorganisations. Phase 2 writes annual snapshots of names, ISTAT codes, and containments for each municipality, with valid_from/valid_to derived from the corresponding variazioni event.

flows/istat/territory-corrections applies a curated set of corrections that cannot be derived algorithmically from the SITUAS source data. It uses the Belfiore code as the lookup key and performs two tasks: inserting ADM alias names for municipalities whose official label differs from the ADM administrative registry label (enabling territory resolution from ADM-sourced datasets), and patching known data-quality anomalies in the SITUAS source (stale valid_to values, missing containment entries).

Together, these four pipelines establish a complete, temporally versioned graph of Italian territorial entities from 1861 to the present. Every downstream attribute pipeline resolves its source identifiers to territory_id values via territory_resolver before writing to silver.territory_attributes.

ADM Tabaccherie

The ADM Tabaccherie pipeline acquires the list of licensed tobacco retailers from the Agenzia delle Dogane e dei Monopoli (ADM) web portal. The portal uses a JSF/PrimeFaces interface that requires javax.faces.ViewState token management; the html_parser utility handles this transparently. For each municipality, the pipeline queries the portal, parses the HTML result table, and writes raw HTML pages to Bronze under data/bronze/agenzie-dogane/tabacchi/.

The transform flow resolves municipal identifiers through territory_resolver using the province abbreviation and municipality name from the ADM response, then writes one EAV record per attribute (outlet count, addresses) to silver.territory_attributes. This pipeline is the reference implementation for the service-count attribute pattern used by other ADM-sourced datasets.

Pipelines by family

The versioned pipelines are organised into three groups.

The ISTAT territorial foundation, described above, is the prerequisite for every other pipeline and populates the graph of territorial entities.

The families with a shared orchestrator group homogeneous datasets under a single acquisition and transformation logic: the ISTAT and ASTER SDMX data explorers are the main examples, where a single parameterised orchestrator covers many datasets with few units of code. It is this mechanism that explains much of the distance between the 232 tracked datasets and the roughly 38 versioned pipeline units.

The per-source EAV attribute pipelines follow the pattern described for ADM Tabaccherie: they acquire the data of a single source, resolve territory identifiers, and write EAV records to silver.territory_attributes. Besides ISTAT (territorial foundation, SDMX data explorers, ASTER, and other datasets) and ADM, the following sources have pipeline code: AGENAS, ISPRA, ACI, AGCOM, ICCU, MUR, RFI, RGS, Ministry of Health, Civil Protection, Customs Agencies, Ministry of Labour, SISTAN, and Ministry of Education.

Of the roughly 38 versioned pipeline units, 30 already have all four flows (ingestion, transform, data quality, metadata); the remaining ones are being completed on the quality or metadata flows.

Technical verification tools

The technical verification of the pipelines rests on three tools: an automated verification engine, an operational skill that orchestrates it, and a tracking dashboard.

The verification engine (scripts/etl_status.py and the scripts/etl_verify/ package) is read-only and derives its verdicts from the actual production state: it queries the database via PostgREST, OpenMetadata, the Bronze file cache, and the Datastore catalogue, rather than progress flags recorded elsewhere. For each (dataset, phase) pair it produces a verdict on two axes — implementation (yes, no, shared, n/a) and production status (verified or missing) — accompanied by an evidence string. There is no third "unverifiable" value: if a check cannot be performed, the engine fails explicitly rather than emitting a degraded verdict. The five verified steps correspond to the five phases: for ingestion, the presence of files in Bronze; for transform, the populated Silver table; for data quality, the Quality tag written on OpenMetadata; for metadata, the lineage reference being set; for the Datastore, the dataset's configurability in the catalogue. The verification thus covers the presence and structure of each phase's result; the semantic correctness of the data is the subject of the content validation described in Chapter 6.

Before the verdicts are considered trustworthy, the engine is measured against a set of hand-verified cases through the self-test option: if it does not reproduce the known ground truth — because of schema drift or an unreachable API — the self-test fails and the verdicts must not be used.

The skill verify-etl-dataset, run by the verification responsible within the Claude environment, codifies the procedure: running the self-test, verifying the dataset or set of datasets, interpreting the verdicts, performing any remediation and alignment to the Datastore (the fifth phase), and synchronising the status. It is the tool through which the responsible coordinates the verification scripts and completes the Datastore integration.

Progress is followed through the etl-dashboard, a view onto the status tracked in ClickUp, where each dataset has a subtask for each of the five phases and a priority indicator. The engine's verdicts translate into proposals to close or reopen subtasks, applied after confirmation; the dashboard reflects the resulting status and is the authoritative source of the per-dataset status. The summary quantitative picture is reported in the delivery report.

Data quality framework

Flow 3 of each pipeline runs a custom validation framework, articulated in two levels that correspond to the layers of the data lake. The Bronze layer checks operate on the raw file via DuckDB and verify format, size, expected structure, the presence of the column identifying the territory, and the row count. The Silver layer checks operate on the EAV records via SQL queries and verify the existence of the destination table, the count of written records, territorial coverage against active municipalities, and referential integrity towards silver.territories. At the end of validation the flow writes the overall outcome as a Quality tag on the corresponding entity in OpenMetadata, making the quality status of each dataset consultable directly from the catalogue.

The choice of a custom framework over a generic validation library answers three needs: operational lightness, native integration with the tools already adopted by the platform (DuckDB for Bronze and OpenMetadata for publishing the outcome), and reduction of external dependencies. The detail of the acceptance criteria and the completeness, accuracy, and consistency metrics is covered in Chapter 6; this chapter describes the role of Flow 3 in the pipeline architecture.

Data access: the Datastore

The Silver EAV layer is optimised for normalisation and traceability, not for direct consultation. For data analysts, direct access to Silver is impractical: extracting a piece of information requires non-trivial SQL queries on the entity-attribute-value schema and, for multidimensional datasets, the laborious reconstruction of labels and dimensions from the codes. The Datastore, reachable at datastore.maps.gransassotech.org, solves this problem by interposing a consultation interface between the data lake and the end user.

The Datastore offers a navigable catalogue of the available datasets, full-text search, a cart for composing dataset selections, the generation of ready-to-use Gold tables in "wide" format, a Bronze layer browser for inspecting the source files, and the download of the selected data. The wide tables are produced from the EAV records of the Silver layer: the Datastore pivots the attributes into columns, materialising in the Gold layer the tabular representation expected by analysis tools. In this way the EAV schema remains the canonical and versioned representation of the data, while the Gold layer and the Datastore offer a denormalised and immediately usable view of it.

Monitoring and testing

Monitoring and logging

Flow runs, task duration, and error rates are monitored through the cluster's Prefect interface, from which failure notifications are also configured. Worker logs are available in the same interface for each flow run and are written to the mapped data/logs/workers/ volume on the host.

Testing

The test suite counts 82 files and about 920 test functions. The configuration is centralised in pytest.ini, which defines, among other things, the integration marker to distinguish end-to-end integration tests from unit tests, which can be run selectively. Coverage concerns both the shared utilities and the individual pipelines, with particular attention to the territory resolution logic and the transformations towards the EAV layer.

ETL Pipelines: architecture and verification tools ​

The five pipeline phases ​

Implemented pipelines ​

ISTAT territorial foundation ​

ADM Tabaccherie ​

Pipelines by family ​

Technical verification tools ​

Data quality framework ​

Data access: the Datastore ​

Monitoring and testing ​

Monitoring and logging ​

Testing ​