Skip to content

Solution Design Architettura Cloud

Deliverable D2.1.2: Solution Design Architettura Cloud

3.1 Approccio: Self-Hosted su Infrastruttura Esistente

Contesto Deployment

Il Data Lake MAPS verrà deployato su infrastruttura esistente DEPP/Openpolis:

  • Server: op-linkurious (8 CPU, 31GB RAM, 621GB disk)
  • Network: Traefik reverse proxy su rete gw
  • Dominio: *.maps.deppsviluppo.org

Rationale Self-Hosted

  1. Costi prevedibili: No per-query pricing (BigQuery), no per-storage (S3/GCS)
  2. Controllo completo: Indipendenza da vendor lock-in
  3. Compliance: Dati rimangono on-premise
  4. Scala adeguata: ~10⁴ rows × ~10³ cols non richiedono cloud-scale

3.2 Container Architecture

Docker Compose Multi-Service

yaml
version: '3.8'

services:
  # Storage primario
  postgres:
    image: postgis/postgis:17-3.5
    container_name: maps-postgres
    volumes:
      - postgres-data:/var/lib/postgresql/data
      - ./init-scripts:/docker-entrypoint-initdb.d
    networks:
      - maps-internal
    environment:
      POSTGRES_DB: maps_db
      POSTGRES_USER: maps
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 8G

  # Orchestrazione
  prefect-server:
    image: prefecthq/prefect:3-python3.11
    container_name: maps-prefect-server
    command: prefect server start
    networks:
      - maps-internal
      - gw  # Traefik
    environment:
      PREFECT_SERVER_API_HOST: 0.0.0.0
      PREFECT_API_DATABASE_CONNECTION_URL: postgresql://prefect:${PREFECT_PASSWORD}@maps-postgres:5432/prefect
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.maps-prefect.rule=Host(`prefect.maps.deppsviluppo.org`)"

  # Worker pools (multi-pool architecture)
  worker-istat:
    build: ./prefect/flows/istat/
    container_name: maps-worker-istat
    command: prefect worker start --pool istat-pool --type process
    volumes:
      - ./prefect/flows/istat:/flows:ro
      - ./shared-data:/data:rw
    networks:
      - maps-internal
    environment:
      PREFECT_API_URL: http://maps-prefect-server:4200/api
    deploy:
      replicas: 2
      resources:
        limits:
          cpus: '1'
          memory: 1G

  worker-pdf:
    build: ./prefect/flows/pdf-extraction/
    container_name: maps-worker-pdf
    command: prefect worker start --pool pdf-pool --type process
    volumes:
      - ./prefect/flows/pdf-extraction:/flows:ro
      - ./shared-data:/data:rw
    networks:
      - maps-internal
    environment:
      PREFECT_API_URL: http://maps-prefect-server:4200/api
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 4G

  # Data catalog
  openmetadata:
    image: openmetadata/server:latest
    container_name: maps-openmetadata
    networks:
      - maps-internal
      - gw
    environment:
      DB_HOST: maps-postgres
      DB_PORT: 5432
      DB_USER: openmetadata
      DB_PASSWORD: ${OPENMETADATA_PASSWORD}
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.maps-metadata.rule=Host(`metadata.maps.deppsviluppo.org`)"

networks:
  maps-internal:
    driver: bridge
  gw:
    external: true

volumes:
  postgres-data:
  shared-data:

3.3 Resource Allocation

Dimensionamento Servizi

ServizioCPURAMStorageRationale
PostgreSQL4 core8GB200GBWorkload principale, PostGIS operations
Prefect Server1 core2GB10GBLightweight orchestrator
Worker ISTAT (×2)1 core1GB-Ingestion CSV/Excel
Worker PDF4 core4GB-Docling ML models
OpenMetadata2 core4GB50GBMetadata catalog
TOTALE8 core20GB260GBFit su op-linkurious (8/31/621)

Scalabilità Verticale/Orizzontale

Verticale (upgrade risorse singolo servizio):

  • PostgreSQL: fino a 8 core / 16GB (se necessario)
  • Worker PDF: fino a 6 core / 6GB (per Docling pesante)

Orizzontale (replica servizi):

  • Worker ISTAT: scale fino a 4 repliche (docker-compose up -d --scale worker-istat=4)
  • Worker PDF: NO scale (ML models memory-intensive)

3.4 Networking e Sicurezza

Traefik Reverse Proxy

yaml
# labels su servizi per routing
traefik.http.routers.{service}.rule=Host(`{service}.maps.deppsviluppo.org`)
traefik.http.routers.{service}.tls=true
traefik.http.routers.{service}.tls.certresolver=letsencrypt

DNS Configuration (AWS Route53)

bash
# Script dns-setup.sh
aws route53 change-resource-record-sets \
  --hosted-zone-id ${HOSTED_ZONE_ID} \
  --change-batch '{
    "Changes": [{
      "Action": "CREATE",
      "ResourceRecordSet": {
        "Name": "prefect.maps.deppsviluppo.org",
        "Type": "A",
        "TTL": 300,
        "ResourceRecords": [{"Value": "${SERVER_IP}"}]
      }
    }]
  }'

Firewall Rules

bash
# Porte esposte su op-linkurious
80/tcp   - HTTP (redirect a HTTPS)
443/tcp  - HTTPS (Traefik)
5432/tcp - PostgreSQL (solo da rete interna)

Secrets Management

bash
# .env file (NON in git)
POSTGRES_PASSWORD=$(openssl rand -base64 32)
PREFECT_PASSWORD=$(openssl rand -base64 32)
OPENMETADATA_PASSWORD=$(openssl rand -base64 32)

3.5 Backup e Disaster Recovery

Strategy

PostgreSQL:

bash
# Script backup.sh
#!/bin/bash
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
docker exec maps-postgres pg_dump -U maps maps_db | \
  gzip > /backup/maps_db_${TIMESTAMP}.sql.gz

# Retention: 7 daily, 4 weekly, 3 monthly

Shared Data (Bronze layer):

bash
# Rsync incrementale
rsync -avz --progress /data/bronze/ backup-server:/backup/bronze/

RPO/RTO:

  • Recovery Point Objective: 24h (backup giornaliero)
  • Recovery Time Objective: 4h (restore manuale)

[WIP] Questo capitolo sarà completato con:

  • Diagrammi architetturali dettagliati
  • Security hardening checklist
  • Monitoring stack (Prometheus, Grafana)
  • CI/CD pipeline

Prossimo capitolo: Specifiche Infrastruttura Hosting