# Disaster Recovery Drill — Runbook

> **Cadenza**: quarterly (Q1/Q2/Q3/Q4) + post-incident review.
> **Owner**: Platform Engineering team.
> **Audience**: SRE, ops on-call, DBA.

Procedure di backup/restore + drill periodici per scenario **single-region**. Scope: 3 pilastri — **dati**, **configurazione**, **operatività post-incident**.

> Multi-region (DR Tier 2-4) è documentato in [CLAUDE.md "Disaster Recovery Tiers (DR SKU)"](../../CLAUDE.md#disaster-recovery-tiers-dr-sku) ma non implementato in v1.x. Default deploy = DR Tier 0/1.

---

## Obiettivi (SLA interni Tier 0/1)

| Metrica | Valore | Note |
| --- | --- | --- |
| **RPO** (Recovery Point Objective) | **4 ore** | Perdita dati max tollerata |
| **RTO** (Recovery Time Objective) | **4 ore** | Downtime max per ripristino |
| **Retention dati** | 10 anni | GDPR + art. 2220 c.c. italiano (documenti contabili) |
| **Retention backup** | 30gg hot, 365gg cold | Rotation policy |
| **Target uptime** | 99.5% | ~44h outage/anno accettabili |

Per Tier 2-4 (warm/hot standby, active-active) vedi feature matrix in CLAUDE.md.

---

## Inventario asset

| Asset | Tipo | Backup method | RPO effettivo | Runbook dedicato |
|---|---|---|---|---|
| **HANA HDI** (BTP CF) | DB strutturato | BTP auto-backup | 1h | §1 sotto |
| **PostgreSQL** (Kyma in-cluster) | DB strutturato | `cronjob-pg-backup` | 24h (default) | [POSTGRESQL_BACKUP_RESTORE.md](POSTGRESQL_BACKUP_RESTORE.md) |
| **PVC SQLite** (mocked profile) | DB in cluster | `cronjob-pvc-backup.yaml` | 24h | §6 sotto |
| **FatturaPA XML** | Blob | S/4 source + Conservatore | 0h | [ARCHIVE_MIGRATION.md](ARCHIVE_MIGRATION.md) |
| **SystemParameters config** | DB row | `exportConfiguration` action + Git | 1h | §3 sotto |
| **CIAS service key** | Credential | Annual rotation + vault | 365gg | [TOKEN_ROTATION.md](TOKEN_ROTATION.md) |
| **ArchiveLink CS cert** | TLS cert | Quarterly rotation | 90gg | [ARCHIVELINK_CERT_ROTATION.md](ARCHIVELINK_CERT_ROTATION.md) |
| **Kyma APIRule TLS** | TLS cert | Auto-renewal Kyma | 90gg | [KYMA_APIRULE_CERT_ROTATION.md](KYMA_APIRULE_CERT_ROTATION.md) |
| **Docker images** | Artifact | GHCR | 0h | n/a (immutable) |
| **Source code** | Git repo | GitHub | 0h | n/a |

---

## 1. Backup HANA HDI (BTP Cloud Foundry)

Il servizio `hana-hdi` su BTP include backup automatico SAP-managed. Il customer **non può** fare restore selettivo. Per backup point-in-time consigliato `EXPORT` settimanale via HANA client.

```bash
cf service-key NOVAInvoiceSuite-db db-key
cf service-key-get NOVAInvoiceSuite-db db-key | jq -r .credentials > /tmp/hana-cred.json

/usr/sap/hdbclient/hdbsql -U NOVA -d <schema> \
  "EXPORT \"NOVA\".* AS BINARY INTO '/tmp/nova-$(date +%F).bin' \
   WITH CATALOG NO DEPENDENCIES NO DATA WAIT"
```

**Alternativa enterprise**: SAP HANA Cloud Data Lake o backup-export-schedule managed (richiede plan `hana-cloud` invece di `hdi-shared`).

### Restore HANA HDI

**Scenario**: corruzione dati utente (non cluster failure).

```bash
# 1. Aprire ticket SAP per restore point-in-time
#    (SAP Help Portal → Open Incident → Component: BC-DB-HDB-HDI)
#    Fornire: subaccount GUID, HDI instance name, target timestamp
#    SLA SAP: restore completato in 4h dal ticket (business hours EU)

# 2. Prima del restore, mettere l'app in read-only mode
cf stop NOVAInvoiceSuite-srv

# 3. Attendere conferma restore da SAP; verificare schema
cds deploy --to hana --profile production --dry-run

# 4. Smoke test
curl -sf https://<app-url>/health
curl -sf "https://<app-url>/odata/v4/orchestrator/Invoices?\$top=1" \
  -H "Authorization: Bearer $TOKEN"

# 5. Riavvio
cf start NOVAInvoiceSuite-srv
```

---

## 2. Backup PostgreSQL (Kyma)

Per la procedura completa (pg_dump, pg_restore, S3 upload, K8s CronJob) vedi runbook dedicato: [**POSTGRESQL_BACKUP_RESTORE.md**](POSTGRESQL_BACKUP_RESTORE.md).

Procedura quick-reference:

```bash
# Backup (manuale)
kubectl -n nova-invoice-suite exec -it postgresql-0 -- \
  pg_dump -U nova -F c -Z 9 nova_invoice > nova-$(date +%F).dump

# Upload offsite
aws s3 cp nova-$(date +%F).dump s3://nova-backup/pg/
```

---

## 3. Backup configurazione (GitOps)

Configurazione runtime mutabile (`SystemParameters`, `ApprovalRule`, `MatchToleranceConfiguration`, custom fields, process templates) è esportabile via action `exportConfiguration`.

```bash
# Dump JSON completo della config
curl -sf -H "Authorization: Bearer $TOKEN" \
  -X POST https://nova-invoice-suite.<cluster>/odata/v4/orchestrator/exportConfiguration \
  > config-$(date +%F).json

# Commit al repo di backup
git -C ../nova-config-backup add config-$(date +%F).json
git -C ../nova-config-backup commit -m "nova config $(date +%F)"
git -C ../nova-config-backup push
```

`importConfiguration` è **idempotente** (upsert su chiave naturale). Audit entry `CONFIG_IMPORT` con diff incluso.

### Restore configurazione

```bash
curl -sf -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -X POST https://<app-url>/odata/v4/orchestrator/importConfiguration \
  -d @config-$DATE.json
```

---

## 4. Restore PVC (SQLite mocked profile)

```bash
# Il cronjob pvc-backup produce tar.gz in /var/backups/nova/
kubectl -n nova-invoice-suite cp /local/path/nova-db-$DATE.tar.gz \
  pod/<srv-pod>:/tmp/restore.tar.gz

kubectl -n nova-invoice-suite exec <srv-pod> -- \
  sh -c "cd / && tar xzf /tmp/restore.tar.gz && chown -R 1000:1000 /data"

kubectl -n nova-invoice-suite rollout restart deploy/nova-invoice-suite-srv
```

---

## 5. Tuning DBA post-deploy (HANA)

Lista di indici **candidati** da valutare con HANA Plan Cache profiling dopo go-live (non applicare blind — gli indici peggiorano write throughput del billing snapshot):

```sql
-- Monitoring: top-N query lente
SELECT TOP 20
  STATEMENT_HASH, AVG_EXECUTION_TIME, EXECUTION_COUNT,
  SUBSTR(STATEMENT_STRING, 1, 200) AS QUERY_SNIPPET
FROM M_SQL_PLAN_CACHE
WHERE SCHEMA_NAME = CURRENT_SCHEMA
  AND AVG_EXECUTION_TIME > 100000
ORDER BY AVG_EXECUTION_TIME * EXECUTION_COUNT DESC;

-- Se i join (CompanyCode, InvoiceId) sono lenti
CREATE INDEX IDX_LINEITEM_INV      ON "sap_passive_invoice_InvoiceLineItem"      ("CompanyCode","InvoiceId");
CREATE INDEX IDX_WFITEM_INV        ON "sap_passive_invoice_InvoiceWorkflowItem"  ("CompanyCode","InvoiceId");
CREATE INDEX IDX_AUDIT_INV         ON "sap_passive_invoice_AuditLogEntry"        ("CompanyCode","InvoiceId");
CREATE INDEX IDX_PSE_INV           ON "sap_passive_invoice_ProcessStepExecution" ("CompanyCode","InvoiceId");
CREATE INDEX IDX_3WM_INV           ON "sap_passive_invoice_ThreeWayMatchDetail"  ("CompanyCode","InvoiceId");
CREATE INDEX IDX_BPCHECK_INV       ON "sap_passive_invoice_BPResolutionCheck"    ("CompanyCode","InvoiceId");
CREATE INDEX IDX_FISCAL_INV        ON "sap_passive_invoice_FiscalValidationCheck" ("CompanyCode","InvoiceId");
CREATE INDEX IDX_COSTALLOC_INV     ON "sap_passive_invoice_InvoiceCostAllocation" ("CompanyCode","InvoiceId");
CREATE INDEX IDX_INVOICE_STATUS    ON "sap_passive_invoice_GuidEdocInvoice"      ("ProcessingStatus");
CREATE INDEX IDX_INVOICE_CC_STATUS ON "sap_passive_invoice_GuidEdocInvoice"      ("companyCode","ProcessingStatus");
CREATE INDEX IDX_CFV_ENTITY        ON "sap_passive_invoice_CustomFieldValue"     ("EntityName","EntityKey");

-- Verifica impatto (attendere 24h di traffico rappresentativo)
SELECT TABLE_NAME, INDEX_NAME, LAST_ACCESS_TIME
FROM M_CS_INDEX_STATISTICS
WHERE TABLE_NAME LIKE 'sap_passive_invoice_%'
  AND INDEX_NAME LIKE 'IDX_%';
```

**Procedura consigliata**: aggiungere UN indice alla volta, misurare per 1 settimana, iterare. Strumenti: SAP HANA Cockpit → *Performance → SQL Plan Cache*.

PostgreSQL: indici equivalenti già in [db/migrations/pg/002_invoice_children_indices.sql](../../db/migrations/pg/002_invoice_children_indices.sql).

---

## 6. Scenari di incident

### Scenario A — Corruzione dati singola fattura

1. Identificare GUID + CompanyCode
2. Query audit log: `SELECT * FROM AuditLogEntry WHERE InvoiceId = ?`
3. Se disponibile CSV/export precedente: `UPDATE ... SET` manuale con audit entry
4. Altrimenti: re-ingestione da archivio via action `reIngestFromArchive`
5. Documentare in ticket interno

**Downtime**: nessuno. **RPO**: dipende da ultima attività audit valida.

### Scenario B — Crash pod nova-srv (rolling)

K8s rollout automatico riavvia il pod. `readinessProbe` su `/health` evita traffico durante boot. SLA: ~30-60s downtime percepito.

Azione richiesta solo se il crash si ripete:
```bash
kubectl -n nova-invoice-suite logs -l app=nova-invoice-suite --tail=200 --previous
```

### Scenario C — HANA HDI container corrupt

1. Aprire ticket SAP `BC-DB-HDB-HDI` — priority *High* se prod
2. Stop srv (`cf stop`) per evitare writes inconsistenti
3. Seguire procedure §1 "Restore HANA HDI"
4. Ripartire srv, monitorare Alert Notification per 4h

**Downtime atteso**: 2-4h. **RPO**: last BTP auto-backup (~1h).

### Scenario D — Subaccount BTP lost (catastrofico)

Il subaccount BTP ha SLA 99.99% SAP-managed. In caso di perdita:

1. Nuovo subaccount BTP via [terraform/](../../terraform/) — vedi [DEPLOYMENT_GUIDE §2-bis](../DEPLOYMENT_GUIDE.md#2-bis-infrastructure-as-code-terraform--opzionale-ma-raccomandato)
2. `./scripts/cias-bootstrap.sh` → re-provision destinations
3. `mbt build && cf deploy` (CF) o `kubectl apply -f k8s/` (Kyma) → redeploy app
4. `cds deploy --to hana` → schema
5. Restore DB da ultimo backup SAP (ticket `BC-DB-HDB-HDI`)
6. Restore config via `importConfiguration`
7. Re-subscribe Event Mesh topics (`SAP_COM_0594` su S/4)

**Downtime atteso**: 8-24h. **RPO**: fino a 4h. Scenario fuori SLA utente — comunicare a business pre go-live.

### Scenario E — Cert CIAS scaduto (edge case)

Se lo scheduled rotation fallisce e il cert scade:

1. Alert Notification invia email `sub-cert-7d`
2. Manuale: `THRESHOLD_DAYS=400 ./scripts/rotate-cias-cert.sh`
3. Se CIAS non risponde (cert già dead): rigenerare manualmente il binding
   ```bash
   btp delete services/binding NOVAInvoiceSuite-cias-api-key-v1 --confirm
   btp create services/binding --name NOVAInvoiceSuite-cias-api-key-v2 \
     --instance-name NOVAInvoiceSuite-cias-api \
     --parameters '{"xsuaa":{"credential-type":"x509"}}'
   ```
4. Aggiornare i 12 destinations manualmente (via Cockpit) o re-run `terraform apply` con modulo `destinations`

**Downtime**: solo chiamate S/4 (ingestion, posting). UI continua a funzionare su dati locali.

### Scenario F — Exception flood (operational incident)

InvoiceException accumulate >1000 in 1h tipicamente indicano: BP resolution AI mal configurato, fiscal validation rule errata, o duplicate detection threshold troppo aggressivo.

Vedi runbook dedicato: [**EXCEPTION_FLOOD_RECOVERY.md**](EXCEPTION_FLOOD_RECOVERY.md).

---

## Drill quarterly — procedure

Esercitare in environment **TEST** (mai prod) almeno trimestralmente:

### Q1 — Restore DB

**Test PostgreSQL** (Kyma):
1. Snapshot DB pre-drill: `pg_dump -F c > pre-drill-$(date +%F).dump`
2. Drop schema test (NON prod): `psql -c "DROP SCHEMA nova CASCADE;"`
3. Restore: `pg_restore -F c -d nova_invoice pre-drill-*.dump`
4. Smoke test: `curl /health` + `Invoices?$top=1`
5. Misurare: tempo restore + integrità (count rows pre/post)
6. **Target**: <2h totale

**Test HANA HDI** (BTP CF):
- Aprire ticket SAP `BC-DB-HDB-HDI` priority Low + scope "drill"
- Misurare tempo risposta SAP + tempo restore
- **Target**: <4h totale (SLA SAP)

### Q2 — Rotation CIAS cert

1. Forzare con `THRESHOLD_DAYS=400 ./scripts/rotate-cias-cert.sh` (env test)
2. Verifica nuova binding attiva: `btp list services/bindings | grep cias-api-key`
3. Smoke test S/4 connectivity post-rotation: `npm run test:integration -- --testPathPattern=s4`
4. Documentare in `docs/dr-drills-log.md`

### Q3 — Import configuration

1. Recuperare `config-<6mesi-fa>.json` da Git backup repo
2. Eseguire `importConfiguration` su env test
3. Verificare: SystemParameters count, ApprovalRule count, ProcessTemplate count
4. Smoke test: action `analyzeRisk` su sample invoice (verifica che AI risk weights siano stati ripristinati)

### Q4 — End-to-end DR drill

1. Simulazione "subaccount lost" — Scenario D end-to-end su env test
2. Misurare tempo: terraform apply + redeploy + restore DB + restore config
3. **Target totale**: <8h
4. Identificare bottleneck + aggiornare runbook

### Annotare risultati

Append a [docs/dr-drills-log.md](../dr-drills-log.md) (creare al primo drill):

```markdown
## 2026-Q1 Drill — Restore PostgreSQL
- Date: 2026-04-15
- Operator: <name>
- Environment: kyma-test
- Duration: 1h 32min (target <2h ✓)
- Issues: pg_restore warning su trigger sap_*_drafts (non bloccante)
- Actions: aggiungere flag --no-triggers a script + retest
- Sign-off: <approver>
```

---

## Checklist go-live (pre-production)

- [ ] BTP HANA backup attivo (Cockpit → Service Instance → Monitoring)
- [ ] CronJob `nova-pvc-backup` schedulato e testato (Kyma)
- [ ] CronJob `cronjob-pg-backup` schedulato (Kyma + PostgreSQL)
- [ ] Alert Notification subscriptions create (`scripts/setup-alert-notification.sh`)
- [ ] Scheduled drill: restore mensile su env test
- [ ] Runbook letto da 2 persone del team ops (knowledge redundancy)
- [ ] Credentials CIAS + BTP in password vault aziendale
- [ ] `exportConfiguration` programmato nightly (cron job o GitHub Actions)
- [ ] Contatto SAP Support registrato (ticket component mapping sotto)

---

## Contatti escalation

| Componente | SAP Component | Contatto interno |
|---|---|---|
| HANA HDI | `BC-DB-HDB-HDI` | Platform Engineering |
| XSUAA | `BC-CP-CF-SEC-IAM` | Security |
| CIAS | `BC-INS-CIT-RT` | Platform Engineering |
| Kyma | `BC-CP-KYMA` | Platform Engineering |
| Destination Service | `BC-CP-CF-DEST` | Platform Engineering |
| Event Mesh | `BC-CP-MES` | Integration |
| App level (CAP) | — | NOVA Dev Team |

---

## Riferimenti

- [POSTGRESQL_BACKUP_RESTORE.md](POSTGRESQL_BACKUP_RESTORE.md) — backup/restore PG dettagliato
- [KYMA_APIRULE_CERT_ROTATION.md](KYMA_APIRULE_CERT_ROTATION.md) — rotation TLS APIRule
- [TOKEN_ROTATION.md](TOKEN_ROTATION.md) — JOB/BILLING/MCP token quarterly
- [ARCHIVELINK_CERT_ROTATION.md](ARCHIVELINK_CERT_ROTATION.md) — ArchiveLink CS cert
- [EXCEPTION_FLOOD_RECOVERY.md](EXCEPTION_FLOOD_RECOVERY.md) — Scenario F operational
- [ARCHIVE_MIGRATION.md](ARCHIVE_MIGRATION.md) — DMS migration end-to-end
- [CLAUDE.md "Disaster Recovery Tiers"](../../CLAUDE.md#disaster-recovery-tiers-dr-sku) — DR Tier 0-4 design