Skip to content

Cloud architecture solution design ​

Deliverable D2.1.2: Cloud Architecture Solution Design

Guiding principles ​

The MAPS project handles Italian public data at municipal granularity, with volumes on the order of tens of gigabytes and a small operational team. The cloud architecture is designed around three principles derived from this context.

The first is cost predictability. Large cloud platforms (AWS, GCP, Azure) adopt variable pricing models — per query, per GB processed, per request — that make it difficult to estimate monthly expenditure in advance. For a research project with a defined budget, this risk is unacceptable. The chosen infrastructure adopts fixed monthly pricing for all main components.

The second is operational simplicity. Management complexity must be proportional to the scale of the project. Managed services (Kubernetes cluster, PostgreSQL database) eliminate the maintenance burden of the control plane, security updates, and backup management, leaving the team responsible only for application workloads.

The third is scale appropriateness. The infrastructure is sized for the project's actual workloads: approximately 200 datasets at municipal granularity, ETL pipelines running in sporadic batch mode, and concurrent users in the single digits. There is no need to design for enterprise-scale loads.

Provider selection: DigitalOcean ​

The infrastructure is hosted on DigitalOcean. Compared to hyperscalers (AWS, GCP, Azure), DigitalOcean offers a more straightforward management experience, a shallower learning curve, and a control panel that does not require specialised cloud expertise. Pricing is fixed monthly for all main resources (cluster nodes, managed database, block storage, load balancer), with no variable pricing mechanisms.

DigitalOcean supports PostGIS extensions in its managed database service, an essential requirement for the project's spatial operations. The service catalogue covers all components needed for the MAPS architecture without requiring integrations with external providers.

Container orchestration: Kubernetes (DOKS) ​

Platform services are deployed as containers orchestrated by Kubernetes, using the managed version offered by DigitalOcean (DOKS — DigitalOcean Kubernetes Service).

The choice of Kubernetes over Docker Compose is motivated by three considerations. The first is declarative infrastructure management: Kubernetes describes the desired state of the system in versionable manifests, with automatic convergence. If a container stops it is recreated; if a node becomes unavailable, workloads are redistributed to the remaining ones. Docker Compose offers no equivalent for self-healing or for distribution across multiple nodes.

The second is workload separation. Platform services have very different resource consumption profiles: Prefect ETL pipelines are CPU-intensive and sporadic, PostgreSQL requires constant disk I/O, and OpenMetadata and CKAN have moderate but continuous consumption. Kubernetes allows resources (CPU requests and limits, memory) to be allocated per service, preventing a spike in one component from degrading the others.

The third is the evolution outlook. The project envisions subsequent phases that will introduce new services and a larger number of users. Kubernetes supports this growth without requiring an architectural migration: adding nodes to the cluster or new deployments is a routine operation.

DOKS eliminates the complexity of managing the control plane (API server, etcd, scheduler), which remains the provider's responsibility, leaving the team to manage only the application workloads.

Database: managed vs self-hosted ​

The choice between a PostgreSQL instance managed as a service (DigitalOcean Managed Databases) and a self-hosted instance in the Kubernetes cluster has implications for cost, operations, and flexibility.

With the managed service, the provider handles automatic backups with point-in-time recovery, automatic failover, security patching, and version upgrades. Operational overhead is minimal. The main limitation is reduced configuration flexibility: advanced tuning parameters and non-standard extensions may not be available. The cost is higher than self-hosted for equivalent allocated resources (approximately 20–30% more), but the savings in maintenance hours more than compensate for a small team.

With PostgreSQL self-hosted in the cluster (via a Kubernetes operator such as CloudNativePG or Zalando Postgres Operator), full control over configuration, extensions, and tuning is achieved. The operational burden is higher: backups, monitoring, failover, and upgrades are the team's responsibility.

CriterionManaged DatabaseSelf-Hosted (K8s Operator)
Backup and recoveryAutomatic (daily, PITR)To be configured
FailoverAutomaticManaged by operator
PostGISSupportedFull control
Advanced tuningLimitedComplete
Cost (4 vCPU, 8 GB)~$80/month~$48/month (cluster resources)
Operational overheadLowMedium-high
Custom extensionsPredefined listAny

For the MVP phase, the managed database is the recommended option: it reduces operational risk and allows the team to focus on pipeline development. Migration to self-hosted remains possible at any time, should specific configuration requirements emerge that the managed service does not support.

Infrastructure sizing ​

Workload profile ​

The project's data sources have a typical update frequency of semi-annual or annual. Prefect ETL pipelines therefore operate in sporadic batch mode: execution occurs when a source publishes an update, or during the initial bulk loading phases of the datasets. CPU spikes are concentrated in the PDF parsing phases with Docling and in PostGIS spatial transformations, but occur only occasionally. Web services (CKAN, OpenMetadata, Prefect UI) have constant but contained consumption, with concurrent users in the single digits.

Cluster configuration ​

A cluster with three General Purpose nodes of 4 vCPU and 8 GB RAM each is sufficient to guarantee availability and workload distribution. The indicative workload distribution is as follows:

ComponentCPU requestMemory requestStorageNotes
PostgreSQL + PostGIS2 vCPU4 GB100 GB (block storage)Only if self-hosted
Prefect Server0.5 vCPU1 GB10 GBUI and orchestration
Prefect Worker2 vCPU4 GB50 GB (Bronze files)Burst during ETL
OpenMetadata1 vCPU2 GB20 GBCatalogue and lineage
CKAN + Redis1 vCPU2 GB20 GBPublic catalogue
Ingress Controller0.25 vCPU256 MB-HTTPS traffic

Under ordinary conditions the cluster requires approximately 7 vCPU and 13 GB RAM, with peaks up to 10 vCPU during ETL executions. Three nodes of 4 vCPU / 8 GB (total 12 vCPU / 24 GB) provide the margin needed for peaks and to tolerate the loss of one node without service interruption.

Storage ​

DigitalOcean Volumes (block storage) provide persistent storage for the database and the Bronze layer. The database volume is sized at 100 GB with the possibility of live expansion. The Bronze layer requires approximately 50 GB for original files, also expandable. Volumes are replicated by the provider with protection against single-disk failure.

Monthly cost estimate ​

ItemConfigurationMonthly cost
DOKS cluster (control plane)Managed$12
3 General Purpose nodes4 vCPU / 8 GB each$192 ($64/node)
Managed PostgreSQL4 vCPU / 8 GB, 100 GB storage$80
Block storage (150 GB)Bronze + application data$15
Load BalancerHTTPS Ingress$12
Total~$311/month (~$3,700/year)

If the database is self-hosted in the cluster, the Managed PostgreSQL cost ($80/month) is eliminated, reducing the total to approximately $231/month (~$2,770/year), at the cost of greater operational effort.

Service topology ​

graph TB
    subgraph "Internet"
        U[Users / API Client]
    end

    subgraph "DigitalOcean"
        LB[Load Balancer
HTTPS Termination] subgraph "DOKS Cluster (3 nodes)" subgraph "Namespace: maps-app" PS[Prefect Server] PW[Prefect Worker] OM[OpenMetadata] CK[CKAN] RD[Redis] end subgraph "Namespace: maps-data" PG["PostgreSQL + PostGIS
(if self-hosted)"] BZ[Bronze Storage
Persistent Volume] end subgraph "Namespace: ingress" IC[Ingress Controller] end end MDB["Managed PostgreSQL
(if managed)"] end U --> LB LB --> IC IC --> PS IC --> OM IC --> CK PS --> PW PW --> PG PW --> MDB PW --> BZ OM --> PG OM --> MDB CK --> RD CK --> PG CK --> MDB

Services are organised in distinct Kubernetes namespaces to separate application workloads from persistent data. The Ingress Controller handles incoming HTTPS traffic and routing to exposed services (Prefect UI, OpenMetadata, CKAN). Internal services (Prefect Worker, Redis) are not exposed externally and communicate exclusively over the cluster's internal network.

Security ​

Network isolation ​

Kubernetes Network Policies restrict communication between pods: application services can reach the database but do not communicate with each other unless required. The Ingress Controller is the sole external entry point, with TLS termination and certificates managed automatically via cert-manager and Let's Encrypt.

Database access control ​

PostgreSQL implements a three-tier authorisation model: a maps_writer role with write permissions on the Bronze and Silver schemas, a maps_reader role with read-only access to the Gold layer, and a maps_api role with access limited to the tables exposed by web applications. This separation ensures that ETL pipelines cannot alter published data, and that public-facing services cannot access raw or intermediate data.

Secrets management ​

Credentials are managed via a secret management system external to the cluster (AWS Secrets Manager or equivalent), which acts as the source of truth for all infrastructure secrets. Secrets are injected into Kubernetes pods via External Secrets Operator, which automatically synchronises values to Kubernetes Secrets. API keys for external services are configured as Prefect Secret Blocks, also fed by the same mechanism. No credentials are stored in Kubernetes manifests or in the code repository.