Skip to content

Appendix A1 — Technology Choice Rationale

This appendix documents the detailed rationale behind the technology choices adopted for the MAPS platform, with quantitative comparisons among the alternatives considered for each component. The document is a companion to the Data-Lake Technical Design Document.

1. PostgreSQL + PostGIS vs BigQuery/Cloud Data warehouse

The choice of PostgreSQL 17 with the PostGIS 3.5 extension, deployed self-hosted, arises from the combination of three factors specific to the MAPS project. The first is data scale: the reference dataset consists of approximately 10,000 municipalities with up to 1,000 attributes each — a volume in the tens of gigabytes, not the petabytes for which cloud data warehouses such as BigQuery are designed. The second is the centrality of spatial operations: isochrone calculation, geographic buffers, and geometry intersections are routine operations in the project, and PostGIS is the industry standard for this type of processing, with an ecosystem of tools and documentation incomparably more mature than BigQuery GIS. The third is cost control: PostgreSQL has fixed and predictable infrastructure costs (approximately €2,400/year), whereas BigQuery applies per-query pricing that, with intensive access patterns, can range between €600 and over €6,000/year with no possibility of planning.

CriterionPostgreSQL + PostGISBigQueryDecision
Data scaleOptimized for 10⁴-10⁶ rowsOptimized for petabytesPostgreSQL (MAPS scale: ~10⁴ municipalities × ~10³ attributes)
Spatial operationsPostGIS = industry standardBigQuery GIS = limitedPostgreSQL (isochrones, buffers, native intersections)
CostsPredictable (fixed infra)Per-query pricingPostgreSQL (~€2.4k/year vs €600-6k+/year unpredictable)
Vendor lock-inZeroHighPostgreSQL (independence principle)
Maturity30+ years~15 yearsPostgreSQL (established reliability)
IntegrationUniversalGCP ecosystemPostgreSQL (works with every tool)

BigQuery would be justified in scenarios with volumes exceeding 100 GB per single query, with hundreds of concurrent users, with the need for a fully managed service without operational overhead, or with an analytics budget exceeding €5,000 per year. None of these conditions apply to MAPS.

2. DuckDB vs BigQuery for analytics

For exploratory data analysis, DuckDB in embedded mode is the adopted alternative to BigQuery. The main reason is the capability for direct federation with PostgreSQL: DuckDB reads the Gold tables without needing to export data, eliminating an intermediate ETL step and keeping data within the reference system. Added to this is the absence of licensing costs — BigQuery charges approximately €5 per terabyte of data analysed — and the simplicity of use in Python contexts: DuckDB installs as a library and requires no separate infrastructure. On MAPS volumes (approximately 50 GB), typical analytical queries execute in under 10 ms, performance equivalent to that of BigQuery on datasets of this size.

CriterionDuckDBBigQueryDecision
SizingGB scaleTB/PB scaleDuckDB (MAPS volumes: ~50GB)
CostsZero€5/TB queryDuckDB (saving €600-1,200/year)
Performancems on GBms on TBDuckDB (queries < 10ms on MAPS volume)
FederationReads from PostgreSQLRequires exportDuckDB (no additional ETL)
PortabilitySelf-contained fileVendor lock-inDuckDB (portable Parquet export)
Developer UXEmbedded PythonREST APIDuckDB (zero setup overhead)

3. Prefect vs Airflow/Dagster

Prefect 3.x was chosen as the pipeline orchestrator due to its low adoption overhead compared to the alternatives. Airflow is the most widely used tool in the sector, but requires complex infrastructure, a steep learning curve, and a high operational burden that is difficult to justify for a project of MAPS's size. Dagster offers a modern architecture and a good user experience, but also has a more demanding learning curve. Prefect allows moving from zero to operational pipelines in a few hours, with simple Python decorators, a modern web interface, and an architecture that cleanly separates the orchestration server from the execution workers.

CriterionPrefectAirflowDagsterDecision
Setup complexityLowHighMediumPrefect
Learning curveGentleSteepSteepPrefect
Python-nativeFullPartialFullPrefect
Time-to-valueFast (hours)Slow (days)MediumPrefect
UI/UXModernDatedModernPrefect/Dagster
Operational overheadLowHighMediumPrefect

4. Docling vs alternative PDF extraction

A significant portion of the datasets acquired by the project is distributed in PDF format, often containing complex tables. IBM's Docling library, based on the TableFormer model, achieves 97.9% accuracy in PDF table extraction in an independent benchmark on complex financial documents. Camelot and Tabula, the libraries traditionally most widely used for this type of operation, are both in maintenance mode and no longer under active development; available comparative evaluations indicate significantly lower performance on complex tables compared to Docling. Docling is released under the MIT License, natively integrates export to Pandas DataFrame format, and is entirely self-hosted, with no dependency on cloud services. PyMuPDF is used as a fallback for PDFs without complex tables, and pdfplumber for documents with multi-column layouts or particularly problematic structures. Azure Document Intelligence, despite comparable performance (~95%), was excluded due to vendor lock-in constraints and unpredictable costs.

LibraryAccuracyLicenseActive developmentNative PandasDecision
Docling (TableFormer AI)97.9%MITYes (2025, LF AI)YesPrimary
pdfplumber85-90%MITYesPartialFallback
PyMuPDF75-80%AGPL-3.0YesPartialFallback
Camelot~70%MITMaintenance modeNoExcluded
Tabula~70%MITMaintenance modeNoExcluded
Azure Doc Intelligence~95%CommercialYesYesExcluded (vendor lock-in)

5. OpenMetadata vs DataHub/Atlan/Atlas

OpenMetadata 1.x is the tool chosen for internal data governance. Compared to the alternatives, it combines a contained infrastructure cost (approximately €1,200/year for a dedicated VM), single-machine installation, a modern interface, and direct integration with the project's PostgreSQL/Prefect stack. DataHub requires two or three virtual machines and has slightly higher costs; Atlan offers an excellent user experience but with commercial licenses in the range of €12,000-30,000/year; Apache Atlas is designed for Hadoop ecosystems and would require more than ten virtual machines to operate correctly, making it entirely disproportionate. OpenMetadata is operational within one to two days and introduces no vendor dependencies.

CriterionOpenMetadataDataHubAtlanAtlasDecision
Annual cost€1.2k (infra)€1.8-2.4k€12-30k (lic.)€3-5kOpenMetadata
Setup1 VM2-3 VMs1 VM11+ VMsOpenMetadata
UI/UXModernFunctionalExcellentDatedOpenMetadata
Stack fitPostgreSQL/PrefectGoodGoodHadoop-onlyOpenMetadata
Vendor lock-inZeroZeroHighZeroOpenMetadata
Time-to-value1-2 days2-4 daysMediumWeeksOpenMetadata

6. CKAN for Open Data vs extended use of OpenMetadata

Open data publication requires a tool distinct from OpenMetadata, which is optimised for internal governance and does not support the DCAT-AP_IT standard or integration with the national portal dati.gov.it. CKAN 2.11, self-hosted, is the adopted choice for the public catalogue: it natively implements DCAT-AP_IT, includes a harvester for dati.gov.it, exposes a web portal optimised for non-technical users, and manages multi-format downloads with data preview. Extending OpenMetadata to cover these requirements would require custom development estimated at two to three person-months, with no guarantee of standards compliance.

CriterionCKANExtended OpenMetadataDecision
Target audienceExternal publicInternal data operatorsCKAN (separation of concerns)
DCAT-AP_IT standardNativeNot supportedCKAN (WP6 D6.3 requirement)
dati.gov.it integrationBuilt-in harvesterRequires custom developmentCKAN (standard CSW/DCAT API)
User-friendly portalPublic-optimised web UITechnical data engineers UICKAN (user experience)
Rate limiting & API keysNativeLimitedCKAN (public access control)
Dataset downloadsMulti-format with previewMetadata onlyCKAN (full data access)
SEO and discoverabilitySearch engine optimisedNot optimisedCKAN (public findability)
Licensing metadataComplete (Creative Commons)BasicCKAN (open data compliance)

The two tools serve complementary and non-overlapping roles. OpenMetadata receives technical metadata from Prefect during pipeline execution, tracks the Bronze → Silver → Gold lineage, records Great Expectations validation results, and provides the internal team with an operational catalogue. CKAN receives from the Gold layer only the datasets approved for publication, enriches them with complete DCAT-AP_IT metadata (licence, temporal coverage, geographic extent, methodology), and makes them accessible to citizens, researchers, and third-party developers, with automatic harvesting towards dati.gov.it.

graph LR
    subgraph "Internal - Data Operators"
        A[Prefect Pipeline] --> B[OpenMetadata]
        B --> C[Lineage Tracking]
        B --> D[Data Quality Metrics]
        B --> E[Schema Profiling]
    end

    subgraph "Public - External Users"
        F[Gold Layer] --> G[CKAN]
        G --> H[DCAT-AP_IT Metadata]
        G --> I[dati.gov.it Harvester]
        G --> J[Public Downloads]
    end

    F -.selected datasets.-> A

From a cost perspective, adding CKAN to the infrastructure entails an increase of approximately €600/year for the dedicated VM, against a saving of €15,000-20,000 in custom development that would otherwise be required to bring OpenMetadata into compliance with open data standards.

ItemOpenMetadata onlyOpenMetadata + CKANDelta
Infra (VM)€1.2k/year€1.8k/year+€600/year
Custom development€15-20k (DCAT-AP_IT)€0-€15-20k
MaintenanceHigh (custom code)Low (standard stack)Significant