🇬🇧 English
🇬🇧 English
Appearance
🇬🇧 English
🇬🇧 English
Appearance
This appendix documents the detailed rationale behind the technology choices adopted for the MAPS platform, with quantitative comparisons among the alternatives considered for each component. The document is a companion to the Data-Lake Technical Design Document.
The choice of PostgreSQL 17 with the PostGIS 3.5 extension, deployed self-hosted, arises from the combination of three factors specific to the MAPS project. The first is data scale: the reference dataset consists of approximately 10,000 municipalities with up to 1,000 attributes each — a volume in the tens of gigabytes, not the petabytes for which cloud data warehouses such as BigQuery are designed. The second is the centrality of spatial operations: isochrone calculation, geographic buffers, and geometry intersections are routine operations in the project, and PostGIS is the industry standard for this type of processing, with an ecosystem of tools and documentation incomparably more mature than BigQuery GIS. The third is cost control: PostgreSQL has fixed and predictable infrastructure costs (approximately €2,400/year), whereas BigQuery applies per-query pricing that, with intensive access patterns, can range between €600 and over €6,000/year with no possibility of planning.
| Criterion | PostgreSQL + PostGIS | BigQuery | Decision |
|---|---|---|---|
| Data scale | Optimized for 10⁴-10⁶ rows | Optimized for petabytes | PostgreSQL (MAPS scale: ~10⁴ municipalities × ~10³ attributes) |
| Spatial operations | PostGIS = industry standard | BigQuery GIS = limited | PostgreSQL (isochrones, buffers, native intersections) |
| Costs | Predictable (fixed infra) | Per-query pricing | PostgreSQL (~€2.4k/year vs €600-6k+/year unpredictable) |
| Vendor lock-in | Zero | High | PostgreSQL (independence principle) |
| Maturity | 30+ years | ~15 years | PostgreSQL (established reliability) |
| Integration | Universal | GCP ecosystem | PostgreSQL (works with every tool) |
BigQuery would be justified in scenarios with volumes exceeding 100 GB per single query, with hundreds of concurrent users, with the need for a fully managed service without operational overhead, or with an analytics budget exceeding €5,000 per year. None of these conditions apply to MAPS.
For exploratory data analysis, DuckDB in embedded mode is the adopted alternative to BigQuery. The main reason is the capability for direct federation with PostgreSQL: DuckDB reads the Gold tables without needing to export data, eliminating an intermediate ETL step and keeping data within the reference system. Added to this is the absence of licensing costs — BigQuery charges approximately €5 per terabyte of data analysed — and the simplicity of use in Python contexts: DuckDB installs as a library and requires no separate infrastructure. On MAPS volumes (approximately 50 GB), typical analytical queries execute in under 10 ms, performance equivalent to that of BigQuery on datasets of this size.
| Criterion | DuckDB | BigQuery | Decision |
|---|---|---|---|
| Sizing | GB scale | TB/PB scale | DuckDB (MAPS volumes: ~50GB) |
| Costs | Zero | €5/TB query | DuckDB (saving €600-1,200/year) |
| Performance | ms on GB | ms on TB | DuckDB (queries < 10ms on MAPS volume) |
| Federation | Reads from PostgreSQL | Requires export | DuckDB (no additional ETL) |
| Portability | Self-contained file | Vendor lock-in | DuckDB (portable Parquet export) |
| Developer UX | Embedded Python | REST API | DuckDB (zero setup overhead) |
Prefect 3.x was chosen as the pipeline orchestrator due to its low adoption overhead compared to the alternatives. Airflow is the most widely used tool in the sector, but requires complex infrastructure, a steep learning curve, and a high operational burden that is difficult to justify for a project of MAPS's size. Dagster offers a modern architecture and a good user experience, but also has a more demanding learning curve. Prefect allows moving from zero to operational pipelines in a few hours, with simple Python decorators, a modern web interface, and an architecture that cleanly separates the orchestration server from the execution workers.
| Criterion | Prefect | Airflow | Dagster | Decision |
|---|---|---|---|---|
| Setup complexity | Low | High | Medium | Prefect |
| Learning curve | Gentle | Steep | Steep | Prefect |
| Python-native | Full | Partial | Full | Prefect |
| Time-to-value | Fast (hours) | Slow (days) | Medium | Prefect |
| UI/UX | Modern | Dated | Modern | Prefect/Dagster |
| Operational overhead | Low | High | Medium | Prefect |
A significant portion of the datasets acquired by the project is distributed in PDF format, often containing complex tables. IBM's Docling library, based on the TableFormer model, achieves 97.9% accuracy in PDF table extraction in an independent benchmark on complex financial documents. Camelot and Tabula, the libraries traditionally most widely used for this type of operation, are both in maintenance mode and no longer under active development; available comparative evaluations indicate significantly lower performance on complex tables compared to Docling. Docling is released under the MIT License, natively integrates export to Pandas DataFrame format, and is entirely self-hosted, with no dependency on cloud services. PyMuPDF is used as a fallback for PDFs without complex tables, and pdfplumber for documents with multi-column layouts or particularly problematic structures. Azure Document Intelligence, despite comparable performance (~95%), was excluded due to vendor lock-in constraints and unpredictable costs.
| Library | Accuracy | License | Active development | Native Pandas | Decision |
|---|---|---|---|---|---|
| Docling (TableFormer AI) | 97.9% | MIT | Yes (2025, LF AI) | Yes | Primary |
| pdfplumber | 85-90% | MIT | Yes | Partial | Fallback |
| PyMuPDF | 75-80% | AGPL-3.0 | Yes | Partial | Fallback |
| Camelot | ~70% | MIT | Maintenance mode | No | Excluded |
| Tabula | ~70% | MIT | Maintenance mode | No | Excluded |
| Azure Doc Intelligence | ~95% | Commercial | Yes | Yes | Excluded (vendor lock-in) |
OpenMetadata 1.x is the tool chosen for internal data governance. Compared to the alternatives, it combines a contained infrastructure cost (approximately €1,200/year for a dedicated VM), single-machine installation, a modern interface, and direct integration with the project's PostgreSQL/Prefect stack. DataHub requires two or three virtual machines and has slightly higher costs; Atlan offers an excellent user experience but with commercial licenses in the range of €12,000-30,000/year; Apache Atlas is designed for Hadoop ecosystems and would require more than ten virtual machines to operate correctly, making it entirely disproportionate. OpenMetadata is operational within one to two days and introduces no vendor dependencies.
| Criterion | OpenMetadata | DataHub | Atlan | Atlas | Decision |
|---|---|---|---|---|---|
| Annual cost | €1.2k (infra) | €1.8-2.4k | €12-30k (lic.) | €3-5k | OpenMetadata |
| Setup | 1 VM | 2-3 VMs | 1 VM | 11+ VMs | OpenMetadata |
| UI/UX | Modern | Functional | Excellent | Dated | OpenMetadata |
| Stack fit | PostgreSQL/Prefect | Good | Good | Hadoop-only | OpenMetadata |
| Vendor lock-in | Zero | Zero | High | Zero | OpenMetadata |
| Time-to-value | 1-2 days | 2-4 days | Medium | Weeks | OpenMetadata |
Open data publication requires a tool distinct from OpenMetadata, which is optimised for internal governance and does not support the DCAT-AP_IT standard or integration with the national portal dati.gov.it. CKAN 2.11, self-hosted, is the adopted choice for the public catalogue: it natively implements DCAT-AP_IT, includes a harvester for dati.gov.it, exposes a web portal optimised for non-technical users, and manages multi-format downloads with data preview. Extending OpenMetadata to cover these requirements would require custom development estimated at two to three person-months, with no guarantee of standards compliance.
| Criterion | CKAN | Extended OpenMetadata | Decision |
|---|---|---|---|
| Target audience | External public | Internal data operators | CKAN (separation of concerns) |
| DCAT-AP_IT standard | Native | Not supported | CKAN (WP6 D6.3 requirement) |
| dati.gov.it integration | Built-in harvester | Requires custom development | CKAN (standard CSW/DCAT API) |
| User-friendly portal | Public-optimised web UI | Technical data engineers UI | CKAN (user experience) |
| Rate limiting & API keys | Native | Limited | CKAN (public access control) |
| Dataset downloads | Multi-format with preview | Metadata only | CKAN (full data access) |
| SEO and discoverability | Search engine optimised | Not optimised | CKAN (public findability) |
| Licensing metadata | Complete (Creative Commons) | Basic | CKAN (open data compliance) |
The two tools serve complementary and non-overlapping roles. OpenMetadata receives technical metadata from Prefect during pipeline execution, tracks the Bronze → Silver → Gold lineage, records Great Expectations validation results, and provides the internal team with an operational catalogue. CKAN receives from the Gold layer only the datasets approved for publication, enriches them with complete DCAT-AP_IT metadata (licence, temporal coverage, geographic extent, methodology), and makes them accessible to citizens, researchers, and third-party developers, with automatic harvesting towards dati.gov.it.
graph LR
subgraph "Internal - Data Operators"
A[Prefect Pipeline] --> B[OpenMetadata]
B --> C[Lineage Tracking]
B --> D[Data Quality Metrics]
B --> E[Schema Profiling]
end
subgraph "Public - External Users"
F[Gold Layer] --> G[CKAN]
G --> H[DCAT-AP_IT Metadata]
G --> I[dati.gov.it Harvester]
G --> J[Public Downloads]
end
F -.selected datasets.-> A
From a cost perspective, adding CKAN to the infrastructure entails an increase of approximately €600/year for the dedicated VM, against a saving of €15,000-20,000 in custom development that would otherwise be required to bring OpenMetadata into compliance with open data standards.
| Item | OpenMetadata only | OpenMetadata + CKAN | Delta |
|---|---|---|---|
| Infra (VM) | €1.2k/year | €1.8k/year | +€600/year |
| Custom development | €15-20k (DCAT-AP_IT) | €0 | -€15-20k |
| Maintenance | High (custom code) | Low (standard stack) | Significant |