Appendix A1 — Technology Choice Rationale

This appendix documents the detailed rationale behind the technology choices adopted for the MAPS platform, with quantitative comparisons among the alternatives considered for each component. The document is a companion to the Data-Lake Technical Design Document.

1. PostgreSQL + PostGIS vs BigQuery/Cloud Data warehouse

The choice of PostgreSQL 17 with the PostGIS 3.5 extension, deployed self-hosted, arises from the combination of three factors specific to the MAPS project. The first is data scale: the reference dataset consists of approximately 10,000 municipalities with up to 1,000 attributes each — a volume in the tens of gigabytes, not the petabytes for which cloud data warehouses such as BigQuery are designed. The second is the centrality of spatial operations: isochrone calculation, geographic buffers, and geometry intersections are routine operations in the project, and PostGIS is the industry standard for this type of processing, with an ecosystem of tools and documentation incomparably more mature than BigQuery GIS. The third is cost control: PostgreSQL has fixed and predictable infrastructure costs (approximately €2,400/year), whereas BigQuery applies per-query pricing that, with intensive access patterns, can range between €600 and over €6,000/year with no possibility of planning.

Criterion	PostgreSQL + PostGIS	BigQuery	Decision
Data scale	Optimized for 10⁴-10⁶ rows	Optimized for petabytes	PostgreSQL (MAPS scale: ~10⁴ municipalities × ~10³ attributes)
Spatial operations	PostGIS = industry standard	BigQuery GIS = limited	PostgreSQL (isochrones, buffers, native intersections)
Costs	Predictable (fixed infra)	Per-query pricing	PostgreSQL (~€2.4k/year vs €600-6k+/year unpredictable)
Vendor lock-in	Zero	High	PostgreSQL (independence principle)
Maturity	30+ years	~15 years	PostgreSQL (established reliability)
Integration	Universal	GCP ecosystem	PostgreSQL (works with every tool)

BigQuery would be justified in scenarios with volumes exceeding 100 GB per single query, with hundreds of concurrent users, with the need for a fully managed service without operational overhead, or with an analytics budget exceeding €5,000 per year. None of these conditions apply to MAPS.

2. DuckDB vs BigQuery for analytics

For exploratory data analysis, DuckDB in embedded mode is the adopted alternative to BigQuery. The main reason is the capability for direct federation with PostgreSQL: DuckDB reads the Gold tables without needing to export data, eliminating an intermediate ETL step and keeping data within the reference system. Added to this is the absence of licensing costs — BigQuery charges approximately €5 per terabyte of data analysed — and the simplicity of use in Python contexts: DuckDB installs as a library and requires no separate infrastructure. On MAPS volumes (approximately 50 GB), typical analytical queries execute in under 10 ms, performance equivalent to that of BigQuery on datasets of this size.

Criterion	DuckDB	BigQuery	Decision
Sizing	GB scale	TB/PB scale	DuckDB (MAPS volumes: ~50GB)
Costs	Zero	€5/TB query	DuckDB (saving €600-1,200/year)
Performance	ms on GB	ms on TB	DuckDB (queries < 10ms on MAPS volume)
Federation	Reads from PostgreSQL	Requires export	DuckDB (no additional ETL)
Portability	Self-contained file	Vendor lock-in	DuckDB (portable Parquet export)
Developer UX	Embedded Python	REST API	DuckDB (zero setup overhead)

3. Prefect vs Airflow/Dagster

Prefect 3.x was chosen as the pipeline orchestrator due to its low adoption overhead compared to the alternatives. Airflow is the most widely used tool in the sector, but requires complex infrastructure, a steep learning curve, and a high operational burden that is difficult to justify for a project of MAPS's size. Dagster offers a modern architecture and a good user experience, but also has a more demanding learning curve. Prefect allows moving from zero to operational pipelines in a few hours, with simple Python decorators, a modern web interface, and an architecture that cleanly separates the orchestration server from the execution workers.

Criterion	Prefect	Airflow	Dagster	Decision
Setup complexity	Low	High	Medium	Prefect
Learning curve	Gentle	Steep	Steep	Prefect
Python-native	Full	Partial	Full	Prefect
Time-to-value	Fast (hours)	Slow (days)	Medium	Prefect
UI/UX	Modern	Dated	Modern	Prefect/Dagster
Operational overhead	Low	High	Medium	Prefect

4. Docling vs alternative PDF extraction

A significant portion of the datasets acquired by the project is distributed in PDF format, often containing complex tables. IBM's Docling library, based on the TableFormer model, achieves 97.9% accuracy in PDF table extraction in an independent benchmark on complex financial documents. Camelot and Tabula, the libraries traditionally most widely used for this type of operation, are both in maintenance mode and no longer under active development; available comparative evaluations indicate significantly lower performance on complex tables compared to Docling. Docling is released under the MIT License, natively integrates export to Pandas DataFrame format, and is entirely self-hosted, with no dependency on cloud services. PyMuPDF is used as a fallback for PDFs without complex tables, and pdfplumber for documents with multi-column layouts or particularly problematic structures. Azure Document Intelligence, despite comparable performance (~95%), was excluded due to vendor lock-in constraints and unpredictable costs.

Library	Accuracy	License	Active development	Native Pandas	Decision
Docling (TableFormer AI)	97.9%	MIT	Yes (2025, LF AI)	Yes	Primary
pdfplumber	85-90%	MIT	Yes	Partial	Fallback
PyMuPDF	75-80%	AGPL-3.0	Yes	Partial	Fallback
Camelot	~70%	MIT	Maintenance mode	No	Excluded
Tabula	~70%	MIT	Maintenance mode	No	Excluded
Azure Doc Intelligence	~95%	Commercial	Yes	Yes	Excluded (vendor lock-in)

5. OpenMetadata vs DataHub/Atlan/Atlas

OpenMetadata 1.x is the tool chosen for internal data governance. Compared to the alternatives, it combines a contained infrastructure cost (approximately €1,200/year for a dedicated VM), single-machine installation, a modern interface, and direct integration with the project's PostgreSQL/Prefect stack. DataHub requires two or three virtual machines and has slightly higher costs; Atlan offers an excellent user experience but with commercial licenses in the range of €12,000-30,000/year; Apache Atlas is designed for Hadoop ecosystems and would require more than ten virtual machines to operate correctly, making it entirely disproportionate. OpenMetadata is operational within one to two days and introduces no vendor dependencies.

Criterion	OpenMetadata	DataHub	Atlan	Atlas	Decision
Annual cost	€1.2k (infra)	€1.8-2.4k	€12-30k (lic.)	€3-5k	OpenMetadata
Setup	1 VM	2-3 VMs	1 VM	11+ VMs	OpenMetadata
UI/UX	Modern	Functional	Excellent	Dated	OpenMetadata
Stack fit	PostgreSQL/Prefect	Good	Good	Hadoop-only	OpenMetadata
Vendor lock-in	Zero	Zero	High	Zero	OpenMetadata
Time-to-value	1-2 days	2-4 days	Medium	Weeks	OpenMetadata

6. CKAN for Open Data vs extended use of OpenMetadata

Open data publication requires a tool distinct from OpenMetadata, which is optimised for internal governance and does not support the DCAT-AP_IT standard or integration with the national portal dati.gov.it. CKAN 2.11, self-hosted, is the adopted choice for the public catalogue: it natively implements DCAT-AP_IT, includes a harvester for dati.gov.it, exposes a web portal optimised for non-technical users, and manages multi-format downloads with data preview. Extending OpenMetadata to cover these requirements would require custom development estimated at two to three person-months, with no guarantee of standards compliance.

Criterion	CKAN	Extended OpenMetadata	Decision
Target audience	External public	Internal data operators	CKAN (separation of concerns)
DCAT-AP_IT standard	Native	Not supported	CKAN (WP6 D6.3 requirement)
dati.gov.it integration	Built-in harvester	Requires custom development	CKAN (standard CSW/DCAT API)
User-friendly portal	Public-optimised web UI	Technical data engineers UI	CKAN (user experience)
Rate limiting & API keys	Native	Limited	CKAN (public access control)
Dataset downloads	Multi-format with preview	Metadata only	CKAN (full data access)
SEO and discoverability	Search engine optimised	Not optimised	CKAN (public findability)
Licensing metadata	Complete (Creative Commons)	Basic	CKAN (open data compliance)

The two tools serve complementary and non-overlapping roles. OpenMetadata receives technical metadata from Prefect during pipeline execution, tracks the Bronze → Silver → Gold lineage, records Great Expectations validation results, and provides the internal team with an operational catalogue. CKAN receives from the Gold layer only the datasets approved for publication, enriches them with complete DCAT-AP_IT metadata (licence, temporal coverage, geographic extent, methodology), and makes them accessible to citizens, researchers, and third-party developers, with automatic harvesting towards dati.gov.it.

graph LR
    subgraph "Internal - Data Operators"
        A[Prefect Pipeline] --> B[OpenMetadata]
        B --> C[Lineage Tracking]
        B --> D[Data Quality Metrics]
        B --> E[Schema Profiling]
    end

    subgraph "Public - External Users"
        F[Gold Layer] --> G[CKAN]
        G --> H[DCAT-AP_IT Metadata]
        G --> I[dati.gov.it Harvester]
        G --> J[Public Downloads]
    end

    F -.selected datasets.-> A

From a cost perspective, adding CKAN to the infrastructure entails an increase of approximately €600/year for the dedicated VM, against a saving of €15,000-20,000 in custom development that would otherwise be required to bring OpenMetadata into compliance with open data standards.

Item	OpenMetadata only	OpenMetadata + CKAN	Delta
Infra (VM)	€1.2k/year	€1.8k/year	+€600/year
Custom development	€15-20k (DCAT-AP_IT)	€0	-€15-20k
Maintenance	High (custom code)	Low (standard stack)	Significant

Appendix A1 — Technology Choice Rationale ​

1. PostgreSQL + PostGIS vs BigQuery/Cloud Data warehouse ​

2. DuckDB vs BigQuery for analytics ​

3. Prefect vs Airflow/Dagster ​

4. Docling vs alternative PDF extraction ​

5. OpenMetadata vs DataHub/Atlan/Atlas ​

6. CKAN for Open Data vs extended use of OpenMetadata ​