Servers
—/4
SKY · RAIN · SUN · WIND
CIS Compliance
100%
193/193 controls
Cloud VPN
ESTABLISHED
130.211.194.72 ↔ 38.140.146.68
DNS Serial
2026031002
DNSSEC signed · 134 records
DHCP Hosts
112
Reservations active
Failover
NORMAL
SKY primary · RAIN secondary
On-Prem — WDC Infrastructure
DNS Queries (7 days)
Server Uptime %
DHCP Leases Active
SK
SKY — 192.168.120.1
Primary DNS/DHCP
Loading…
RN
RAIN — 192.168.120.2
Secondary DNS/DHCP
Loading…
SN
SUN — 192.168.120.3
Monitoring · Prometheus + Grafana
Loading…
WD
WIND — 192.168.120.4
Logging · ELK Stack
Loading…
Cloud — GCP (us-central1)
VPN Tunnel ESTABLISHED
WDC 38.140.146.68 ↔ GCP 130.211.194.72 · IKEv2 · AES-256
192.168.120.0/23 + 192.168.124.0/24 ↔ 172.16.0.0/24
192.168.120.0/23 + 192.168.124.0/24 ↔ 172.16.0.0/24
CR
MkDocs Portal
gpus-mkdocs-portal · us-central1
Running
HTTPS ✓Scale-to-zero ✓Public ✓
CR
Status Site
gpus-status-site · us-central1
Running
HTTPS ✓Scale-to-zero ✓greenpeace.us ✓
Security Monitoring
AIDE · auditd · Fail2ban · SELinux · Firewall
| Server | AIDE | auditd | Fail2ban | SELinux | Firewall |
|---|---|---|---|---|---|
| SKY | ✓ Clean | Immutable | sshd active | Enforcing | drop |
| RAIN | ✓ Clean | Immutable | sshd active | Enforcing | drop |
| SUN | ✓ Clean | Immutable | — | Enforcing | drop |
| WIND | ✓ Clean | Immutable | — | Enforcing | drop |
98
Security Posture Score
CIS compliance · monitoring coverage · backup health · threat activity
CIS Compliance
100%
193/193 on-prem
Uptime (30d)
99.9%
All services
Open Incidents
0
Last: none
Assets Monitored
129
of 131 (98.5%)
Backups
OK
Daily + GCS offsite
Threats
0
Active
CIS Compliance — Per Server
SKY
47/47
100% ✓
RAIN
47/47
100% ✓
SUN
48/48
100% ✓
WIND
51/51
100% ✓
GCP Cloud Controls
🔐
Data Encryption
VPN AES-256 · GCS encryption at rest · Cloud Run HTTPS
CIS 3.11 PCI 4.1
🔥
VPC Firewall
Default deny-all · VPN + internal rules only
CIS 4.4 PCI 1.2.2
📊
Audit Logging
VPC Flow Logs · Cloud Audit Logs automatic
CIS 8.3 NIST AU-6
🔄
Data Recovery
GCS Nearline · 90-day retention · versioning
CIS 11.1 NIST CP-9
🔀
Network Segmentation
VPC 172.16.0.0/24 · VPN-only from on-prem
CIS 12.4 NIST SC-7
🛡
Transmission Security
IKEv2 · AES-256 · SHA-256 · DH14
NIST SC-8 PCI 1.5.1
Risk Register
Disaster Recovery Plan not documented
IT
Incident Response Plan not documented
IT
SSO not implemented — 42 apps need Okta
IT
Backup pipeline to GCS not configured
IT
Data classification not started — 7 payment + 42 supporter-data apps
IT/Legal
Active Threats
0
Open Incidents
0
Fail2ban Bans (30d)
12
AIDE Violations
0
Critical CVEs
0
Open Risks
3
Threat Detection
Fail2ban Blocks (30d)
Firewall Drops (7d)
AIDE Scans (30d)
Fail2ban — SSH Intrusion Prevention
12
Blocked IPs in 30 days — SKY + RAIN
Block
SKY · 2026-03-09
103.145.xx.xx — 47 failed SSH → banned 24hSKY · 2026-03-09
Block
SKY · 2026-03-08
185.220.xx.xx — 31 failed SSH → banned 24hSKY · 2026-03-08
Info
9 additional IPs — automated scanners / Tor exits
AIDE — File Integrity
0
Clean
SKY — baseline 2026-03-10 · no changes
Clean
RAIN — baseline 2026-03-10 · no changes
Clean
SUN — baseline 2026-03-10 · no changes
Clean
WIND — baseline 2026-03-10 · no changes
auditd — Suspicious Activity
0
Clean
No unauthorized sudo attempts (30d)
Clean
No unauthorized DNSSEC key/zone access
Clean
No anomalous processes
Mode: Immutable (-e 2) all servers
Firewall — Dropped Connections
~340
Normal
~280 production → management port attempts (denied by design)
Normal
~45 unknown IPs — broadcast/multicast
Normal
~15 guest WiFi → internal (SSID isolation working)
Vulnerability Management
OS Patch Status — Rocky Linux 8.10
SKY
✓
RAIN
✓
SUN
✓
WIND
✓
dnf-automatic security updates · CIS 7.1Service CVEs
| Service | Ver | CVEs | Status |
|---|---|---|---|
| BIND | 9.11 | 0 | ✓ |
| ISC DHCP | 4.3.6 | 0 | ✓ |
| Elasticsearch | 8.x | 0 | ✓ |
| Prometheus | 2.x | 0 | ✓ |
| OpenSSH | 8.0p1 | 0 | ✓ |
Attack Surface
Production — 192.168.120.0/23
| Server | Ports | Restricted To | Auth |
|---|---|---|---|
| SKY | 53,67/68,647,953,9100,9119 | Clients + RAIN + SUN | DNSSEC |
| RAIN | 53,67/68,647,9100,9119 | Clients + SKY + SUN | DNSSEC |
| SUN | 9100 | Localhost + mgmt | — |
| WIND | 5140 | SKY + RAIN only | — |
Incidents — Last 90 Days
| Date | Severity | Description | Detection | Status |
|---|---|---|---|---|
| ✓ No security incidents in the last 90 days | ||||
Threat Intelligence
BIND / DNS
Patched
CVE-2023-50387 (KeyTrap DNSSEC) — patched via Rocky 8.10
N/A
CVE-2024-1737 — affects BIND 9.18+, not 9.11
ELK / Kibana
Mitigated
CVE-2024-37288 (Kibana RCE) — localhost + mgmt network only
OS / SSH
Mitigated
CVE-2024-6387 (regreSSHion) — key-only auth + Fail2ban + mgmt only
Patched
CVE-2024-1086 (kernel LPE) — Rocky 8.10 kernel patched
Defense-in-Depth
L1 — Perimeter
NAT/Firewall
✓
VPN
✓
SSID Isolation
✓
L2 — Network
firewalld drop
✓
Zone Sep
✓
IPv6 off
✓
L3 — Host
SELinux
✓
SSH hardened
✓
CIS L2
✓
L4 — Detection
AIDE
✓
auditd
✓
ELK
✓
L5 — Data
DNSSEC
✓
TLS
✓
Backups
90%
CIS Controls v8
100%
193/193 on-prem + 8 GCP
PCI-DSS v4.0
94%
47/50 requirements met
NIST CSF
96%
Identify · Protect · Detect · Respond · Recover
NIST SP 800-53
92%
Key controls mapped
Last Audit
2026-03-10
All servers verified
Gaps
3
DRP · IRP · Backup pipeline
CIS Controls v8 — Implementation Status
On-Premises Infrastructure — SKY / RAIN / SUN / WIND
| CIS # | Control | SKY | RAIN | SUN | WIND | Implementation |
|---|---|---|---|---|---|---|
| 1.1 | Asset Inventory | ✓ | ✓ | ✓ | ✓ | DHCP lease tracking, DNS records, Kibana dhcp-leases-* index |
| 1.2 | Software Inventory | ✓ | ✓ | ✓ | ✓ | Minimal RPM install, dnf history tracked |
| 2.2 | Authorized Software | ✓ | ✓ | ✓ | ✓ | Server base only, no GUI, no unnecessary packages |
| 3.11 | Data Encryption | ✓ | ✓ | ✓ | ✓ | DNSSEC, Webmin TLS, SSH key auth, VPN AES-256 |
| 3.14 | Sensitive Data | ✓ | ✓ | ✓ | ✓ | DNSSEC keys chmod 600, ES on dedicated partition |
| 4.1 | Secure Configuration | ✓ | ✓ | ✓ | ✓ | CIS Benchmark Rocky Linux 8 Level 2 applied |
| 4.4 | Firewall | ✓ | ✓ | ✓ | ✓ | firewalld default drop zone, explicit rich-rules only |
| 5.1 | Account Inventory | ✓ | ✓ | ✓ | ✓ | dnsadmin / monitadmin only, service accounts nologin |
| 5.2 | Privileged Access | ✓ | ✓ | ✓ | ✓ | sudo with logging, SSH no root, key-only |
| 5.4 | Password Policy | ✓ | ✓ | ✓ | ✓ | 14-char min, 90-day max, lockout after 5 |
| 6.1 | Access Control | ✓ | ✓ | ✓ | ✓ | SELinux enforcing, BIND chroot, MAC filtering |
| 7.1 | Vulnerability Mgmt | ✓ | ✓ | ✓ | ✓ | dnf-automatic security updates enabled |
| 8.2 | Audit Log Mgmt | ✓ | ✓ | ✓ | ✓ | auditd immutable (-e 2), DNS/DHCP/auth rules |
| 8.3 | Log Storage | ✓ | ✓ | ✓ | ✓ | Dedicated /var/log + /var/log/audit on sdb |
| 8.5 | Log Analysis | ✓ | ✓ | ✓ | ✓ | Kibana dashboards, Grafana panels |
| 8.9 | Centralized Logging | ✓ | ✓ | — | ✓ | rsyslog → WIND:5140 → Logstash → ES → Kibana |
| 10.1 | Malware Defenses | ✓ | ✓ | ✓ | ✓ | AIDE daily file integrity monitoring |
| 11.1 | Data Recovery | ✓ | ✓ | ✓ | ✓ | Daily cron backups to /backup + GCS (planned) |
| 12.1 | Network Security | ✓ | ✓ | ✓ | ✓ | Prod/mgmt separation, firewalld drop, IPv6 disabled |
| 12.4 | Network Segmentation | ✓ | ✓ | ✓ | ✓ | 120.0/23 prod, 124.0/24 mgmt, 172.16.0.0/24 GCP |
| 13.1 | Threat Detection | ✓ | ✓ | ✓ | ✓ | Fail2ban, AIDE alerts, Prometheus alerting |
GCP Cloud Infrastructure — gpus-infra
| CIS # | Control | Status | Implementation |
|---|---|---|---|
| 3.11 | Data Encryption | ✓ | VPN AES-256, GCS encryption at rest, Cloud Run HTTPS |
| 4.4 | Firewall | ✓ | VPC deny-all default, explicit VPN + internal rules |
| 8.3 | Log Storage | ✓ | VPC Flow Logs, Cloud Audit Logs automatic |
| 11.1 | Data Recovery | ✓ | GCS backups Nearline, 90-day retention, versioning |
| 12.4 | Network Segmentation | ✓ | Separate VPC 172.16.0.0/24, VPN-only from on-prem |
PCI-DSS v4.0 — Compliance Matrix
Payment Card Industry Data Security Standard
| Req | Sub | Description | Status | Implementation |
|---|---|---|---|---|
| 1 | 1.1.1 | Network security controls defined | ✓ | firewalld default-drop zone all servers |
| 1 | 1.2.1 | Inbound/outbound restricted | ✓ | Rich rules per service/source |
| 1 | 1.2.2 | All other traffic denied | ✓ | Zone=drop, no implicit permits |
| 1 | 1.3.1 | Inbound to CDE restricted | ✓ | DNS/DHCP/SSH from internal only |
| 1 | 1.4.1 | NSC between zones | ✓ | DHCP failover 647 SKY↔RAIN only |
| 1 | 1.5.1 | Remote access secured | ✓ | SSH key auth, no root, AllowUsers, VPN |
| 2 | 2.2.1 | Securely configured | ✓ | CIS Benchmark Level 2 applied |
| 2 | 2.2.2 | Vendor defaults changed | ✓ | Root locked, all defaults changed |
| 2 | 2.2.3 | Unnecessary services removed | ✓ | telnet/ftp/rsh/avahi/cups masked |
| 2 | 2.2.4 | Insecure protocols disabled | ✓ | SSHv2 only, no FTP/Telnet/rsh |
| 4 | 4.1 | Strong cryptography for transmission | ✓ | DNSSEC, TLS Webmin, VPN AES-256 |
| 5 | 5.2 | Anti-malware mechanisms | ✓ | AIDE file integrity daily scan |
| 7 | 7.1 | Access limited to need | ✓ | Dedicated admin accounts, nologin service accounts |
| 8 | 8.3.6 | Password complexity | ✓ | 14-char min, 90-day max, lockout after 5 |
| 10 | 10.2 | Audit trails | ✓ | auditd immutable mode, DNS/DHCP/auth rules |
| 10 | 10.3 | Audit trail protection | ✓ | Centralized to WIND, 90-day retention |
| 10 | 10.7 | Log retention | ✓ | Dedicated log partitions on sdb |
| 11 | 11.5.1 | File integrity monitoring | ✓ | AIDE daily scans on all 4 servers |
| 12 | 12.1.1 | Security policy established | ◐ | Policy drafted in status site — formal sign-off pending |
| 12 | 12.5.1 | Asset inventory | ✓ | IAR: 129 hosts tracked in wdchostregistry.csv |
| 12 | 12.10.1 | IR plan | ◐ | IRP drafted in status site — formal sign-off pending |
NIST Cybersecurity Framework — Function Coverage
Identify
100%
Asset mgmt · Risk assessment · Governance
Protect
100%
Access ctrl · Encryption · Hardening
Detect
100%
AIDE · auditd · ELK · Prometheus
Respond
85%
IRP drafted · formal sign-off pending
Recover
85%
DRP drafted · GCS pipeline pending
NIST SP 800-53 — Key Control Families
| Family | Control | Status | Implementation |
|---|---|---|---|
| AC-3 | Access Enforcement | ✓ | SELinux enforcing, BIND chroot, MAC DHCP filtering |
| AU-2 | Auditable Events | ✓ | auditd custom rules: DNS/DHCP changes, auth, privilege escalation |
| AU-6 | Audit Review | ✓ | Kibana dashboards, Grafana panels, centralized on WIND |
| CM-2 | Baseline Configuration | ✓ | CIS Benchmark L2, Terraform IaC for GCP |
| CM-7 | Least Functionality | ✓ | Minimal install, unnecessary services masked, IPv6 disabled |
| CP-9 | System Backup | ✓ | Daily cron backups, GCS offsite (pipeline pending) |
| IA-2 | Identification & Auth | ✓ | SSH key-only, no passwords, AllowUsers directive |
| IA-5 | Authenticator Mgmt | ✓ | 14-char min, 90-day rotation, faillock after 5 |
| SC-7 | Boundary Protection | ✓ | Prod/mgmt/GCP zone separation, VPN encrypted tunnel |
| SC-8 | Transmission Confidentiality | ✓ | IKEv2 AES-256, DNSSEC, TLS on Webmin |
| SC-20 | Secure Name Resolution | ✓ | DNSSEC zone signing + validation |
| SC-28 | Protection of Info at Rest | ✓ | Dedicated partitions, GCS encryption, key chmod 600 |
| SI-4 | System Monitoring | ✓ | Prometheus q15s, Fail2ban, AIDE, Kibana dashboards |
| SI-7 | Software Integrity | ✓ | AIDE daily file integrity scan on all 4 servers |
Compliance Gaps & Remediation
PCI 12.1.1 — Security Policy: Policy drafted in Governance tab. Requires formal review and sign-off by management.
Target: Q2 2026
PCI 12.10.1 — IR Plan: IRP drafted in Governance tab. Requires formal review, tabletop exercise, and sign-off.
Target: Q2 2026
NIST CP-9 — Offsite Backup: GCS bucket exists, VPN tunnel established. Automated backup pipeline from on-prem → GCS not yet configured.
Target: Mar 2026
SSO Integration: 42 applications identified for Okta SSO. Not yet started.
Target: Q3 2026
Data Classification: 7 payment + 42 supporter-data apps identified. Classification program not started.
Target: Q3 2026
Estimated Monthly
$62
Mar 2026 forecast
Month-to-Date
$21
10 days into billing cycle
Cost Change
NEW
First month — no prior baseline
Monthly Forecast
Cost Breakdown by Service
Cloud VPN Tunnel
$36.00
Static IP (VPN)
$7.20
Cloud Run (MkDocs)
$2.50
Cloud Run (Status)
$2.50
Cloud Storage
$8.00
Artifact Registry
$2.00
Networking (Egress)
$3.50
Other (Logging, DNS)
$0.30
Cost Details
| Resource | SKU | Unit | Qty | Rate | Monthly |
|---|---|---|---|---|---|
| Cloud VPN Tunnel | gpus-vpn-tunnel-wdc | hr | 730 | $0.049 | $36.00 |
| Static IP | gpus-vpn-ip | hr | 730 | $0.010 | $7.20 |
| Cloud Run — MkDocs | gpus-mkdocs-portal | req | ~500 | $0.40/M | $2.50 |
| Cloud Run — Status | gpus-status-site | req | ~500 | $0.40/M | $2.50 |
| GCS — Backups | gpus-infra-backups-wdc | GB | ~50 | $0.01 | $5.00 |
| GCS — TF State | gpus-infra-tf-state | GB | 0.01 | $0.02 | $0.01 |
| Artifact Registry | gpus-images | GB | ~2 | $0.10 | $2.00 |
| VPC Flow Logs | gpus-vpc | GB | ~1 | $0.50 | $0.50 |
| Total Estimated | $62.00 | ||||
Cost Optimization Notes
✓ Cloud Run scales to zero — no charge when idle (most of the time)
✓ Nearline storage — 50% cheaper than Standard for backup data
✓ Single VPN tunnel — upgrade to HA VPN ($72/mo) if uptime SLA needed
⚠ VPN is the biggest cost — $36/mo fixed regardless of traffic
ℹ Billing alert — set budget alert at $75/mo in GCP Console → Billing → Budgets
✓ Nearline storage — 50% cheaper than Standard for backup data
✓ Single VPN tunnel — upgrade to HA VPN ($72/mo) if uptime SLA needed
⚠ VPN is the biggest cost — $36/mo fixed regardless of traffic
ℹ Billing alert — set budget alert at $75/mo in GCP Console → Billing → Budgets
Security Policy
Incident Response Plan
Disaster Recovery Plan
Information Security Policy
GPUS-POL-001 · v1.0 · Effective: 2026-03-10 · Owner: IT Department · Classification: INTERNAL
1. Access Control
All access to infrastructure systems follows the principle of least privilege. Administrative access is restricted to named accounts over the management network (192.168.124.0/24) using SSH key-based authentication only. Root login is disabled on all servers. Service accounts are set to nologin.
| Server | Admin Account | Auth Method | Network |
|---|---|---|---|
| SKY / RAIN | dnsadmin | SSH key-only | 192.168.124.0/24 |
| SUN / WIND | monitadmin | SSH key-only | 192.168.124.0/24 |
| GCP | [email protected] | OAuth + IAM | IAM roles |
2. Change Management
All configuration changes require: (1) backup of affected files, (2) validation before deployment, (3) AIDE baseline update after change, (4) entry in
/var/log/asset-inventory.log, (5) DNSSEC re-signing if zone files changed. GCP changes must go through Terraform — no manual console changes.3. Availability & Redundancy
DNS and DHCP services run in primary/secondary failover (SKY/RAIN). DHCP failover is automatic. DNS zone transfers via AXFR. Monitoring (SUN) and logging (WIND) are single-instance with daily backups. GCP services use Cloud Run with auto-scaling.
4. Logging & Monitoring
All servers forward logs to WIND via rsyslog (TCP:5140). Elasticsearch retains logs for 90 days with daily index rotation. Prometheus scrapes metrics every 15 seconds. AIDE runs daily integrity scans. auditd runs in immutable mode. VPC Flow Logs enabled in GCP.
5. Password Policy
| Parameter | Value | CIS Control |
|---|---|---|
| Minimum length | 14 characters | CIS 5.4 |
| Maximum age | 90 days | CIS 5.4 |
| Lockout threshold | 5 failed attempts | CIS 5.4 |
| Lockout duration | 15 minutes | CIS 5.4 |
| Password history | 5 remembered | CIS 5.4 |
Incident Response Plan
GPUS-IRP-001 · v1.0 · Effective: 2026-03-10 · Owner: IT Department · Classification: INTERNAL
1. Incident Classification
| Severity | Description | Response Time | Escalation | Examples |
|---|---|---|---|---|
| P1 Critical | Service outage or active breach | 15 min | IT Manager → CISO | Both DNS down, ransomware, data exfil |
| P2 High | Degraded service or confirmed intrusion attempt | 1 hr | IT Manager | Single DNS down, AIDE alert, Fail2ban flood |
| P3 Medium | Anomaly requiring investigation | 4 hr | IT Team | Unusual audit events, DNS query spike |
| P4 Low | Minor issue, no impact | 24 hr | IT Team | Config drift, routine Fail2ban bans |
2. Phase 1 — Detection
Detection sources: AIDE file integrity alerts, Fail2ban ban events, auditd rule triggers, Prometheus alert rules, Kibana dashboards, GCP Cloud Audit Logs.
## Check all detection sources
# AIDE
sudo aide --check
# Fail2ban
sudo fail2ban-client status sshd
# auditd — recent security events
sudo ausearch -ts recent -k dns-zone-change -k dhcp-config
# Prometheus alerts
curl -s http://192.168.120.3:9090/api/v1/alerts | python3 -m json.tool
# Kibana — auth failures
# Open http://192.168.124.4:5601 → auth-logs-* index
3. Phase 2 — Containment
## Isolate compromised server (example: SKY)
# Option A: Block all traffic except failover
sudo firewall-cmd --zone=drop --remove-all-rich-rules
sudo firewall-cmd --zone=drop --add-rich-rule='rule family="ipv4" source address="192.168.120.2" accept'
# Option B: Shut down (RAIN takes over DNS/DHCP automatically)
sudo shutdown -h now
# Preserve evidence BEFORE remediation
mkdir -p /var/log/incident/$(date +%F)
sudo cp /var/named/wdc.us.gl3.db* /var/log/incident/$(date +%F)/
sudo ausearch -ts today > /var/log/incident/$(date +%F)/audit.txt
sudo aide --check > /var/log/incident/$(date +%F)/aide.txt 2>&1
4. Phase 3 — Eradication
Identify root cause from logs. Remove malicious artifacts. Restore from known-good backup if files were modified. Re-apply CIS hardening if configuration was altered.
5. Phase 4 — Recovery
## Restore from backup
BACKUP_DATE="YYYY-MM-DD"
tar xzf /backup/dns-dhcp/dns-backup-${BACKUP_DATE}.tar.gz -C /tmp
named-checkzone wdc.us.gl3 /tmp/zones/wdc.us.gl3.db
sudo cp /tmp/zones/* /var/named/
sudo cp /tmp/dhcpd.conf /etc/dhcp/dhcpd.conf
## Re-sign DNSSEC
cd /var/named
sudo dnssec-signzone -A -3 $(head -c 500 /dev/urandom | sha1sum | cut -b 1-16) \
-N INCREMENT -o wdc.us.gl3 -t wdc.us.gl3.db
sudo rndc reload
sudo systemctl restart dhcpd
## Re-baseline AIDE
sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
6. Phase 5 — Post-Incident
Post-incident report due within 72 hours. Root cause analysis, timeline, affected systems, remediation actions, lessons learned, process improvements. All evidence preserved in
/var/log/incident/.Contacts
| Role | Contact | Escalation |
|---|---|---|
| IT Infrastructure Lead | Rajesh Chhetry | First responder for all incidents |
| IT Manager | — | P1/P2 escalation within 15min/1hr |
| CISO | — | P1 escalation, breach notification |
Disaster Recovery Plan
GPUS-DRP-001 · v1.0 · Effective: 2026-03-10 · Owner: IT Department · Classification: INTERNAL
1. Recovery Objectives
| System | RTO | RPO | Recovery Method |
|---|---|---|---|
| DNS (SKY/RAIN) | 5 min | 0 (real-time failover) | Automatic — RAIN takes over |
| DHCP (SKY/RAIN) | 30 sec | 0 (real-time failover) | Automatic — failover peer |
| Monitoring (SUN) | 1 hr | 15 sec (scrape interval) | ESXi snapshot restore |
| Logging (WIND) | 1 hr | 24 hr (daily backup) | ESXi snapshot + backup restore |
| Cloud VPN | 15 min | N/A | Terraform redeploy |
| Cloud Run | 5 min | N/A | Auto-healing by GCP |
2. Disaster Scenarios
Scenario 1: Single server failure (SKY or RAIN)
Impact: Minimal — failover is automatic. RAIN serves DNS/DHCP if SKY is down and vice versa. Restore failed server from ESXi snapshot within 1 hour.
Scenario 2: Both DNS/DHCP servers down
## Emergency: Deploy from backup on any Rocky Linux 8 box
tar xzf /backup/dns-dhcp/dns-backup-LATEST.tar.gz -C /tmp
dnf install -y bind dhcp-server
cp /tmp/zones/* /var/named/
cp /tmp/dhcpd.conf /etc/dhcp/
cp /tmp/named.conf /etc/
systemctl start named dhcpd
Scenario 3: ESXi host failure
All 4 VMs lost. Rebuild from backups on replacement ESXi host. Total rebuild time: ~4 hours following the deployment guides (sky-rain + sun-wind docs).
Scenario 4: WDC site loss (fire, flood)
GCP services remain operational. Backups in GCS bucket. DNS can be redirected at Hover. Rebuild on-prem at DR site using GCS backups + Terraform + deployment guides.
Scenario 5: Cloud VPN tunnel down
## Check tunnel status
gcloud compute vpn-tunnels describe gpus-vpn-tunnel-wdc --region=us-central1
## If ESTABLISHED lost — check Meraki side first
# Meraki Dashboard → Security & SD-WAN → VPN Status
## Redeploy VPN via Terraform if needed
cd ~/terraform/gpus-infra/terraform
terraform apply -target=google_compute_vpn_tunnel.wdc_tunnel
3. Backup Schedule
| Data | Frequency | Location | Retention |
|---|---|---|---|
| DNS zone files | Daily cron | /backup + GCS (planned) | 90 days |
| DHCP config + leases | Daily cron | /backup + GCS (planned) | 90 days |
| ES snapshots | Daily | /backup + GCS (planned) | 90 days |
| Prometheus TSDB | Daily | /backup | 90 days |
| ESXi VM snapshots | Weekly | Local datastore | 4 snapshots |
| Terraform state | Every apply | GCS (gpus-infra-tf-state) | 5 versions |
4. DR Testing Schedule
Quarterly DR tests: Q1 (DNS failover), Q2 (full server restore from backup), Q3 (site failover simulation), Q4 (full tabletop exercise). Results documented and reviewed by IT Manager.
Change Log
2026-03-12 22:00
Custom domain mapped:
status.greenpeace.org → Cloud Run · SSL auto-provisionedGCP2026-03-12 21:00
GCS backup pipeline LIVE — all 4 servers → gs://gpus-infra-backups-wdc · daily 02:00BACKUP
2026-03-12 20:00
Terraform state migrated to GCS (gpus-infra-tf-state) · 20 resourcesGCP
2026-03-12 15:55
Status backend v10 live — all 4 servers returning real-time data via SSH over VPNGCP
2026-03-12 15:30
SUN firewall: TCP 9090 (Prometheus) opened for 10.8.0.0/28 · WIND: TCP 9200 (ES) openedCONFIG
2026-03-12 14:00
VPN traffic verified — all 4 servers reachable from GCP · Meraki routes + firewall rules fixedGCP
2026-03-10 13:51
chronyd fix:
denyall → deny all on SUN + WIND · AIDE re-baselinedCONFIG2026-03-10 13:38
SUN + WIND rebooted — auditd immutable active · CIS 48/48 + 51/51CIS
2026-03-10 11:15
Cloud VPN ESTABLISHED — 130.211.194.72 ↔ 38.140.146.68GCP
2026-03-10 11:00
GCP infra deployed — VPC, VPN, Cloud Run ×2, GCS ×2, Artifact Registry (19 resources)GCP
2026-03-10 10:55
RAIN DHCP updated — 112 reservations · failover normal · AIDE re-baselinedDHCP
2026-03-10 10:49
SKY DNS/DHCP bulk update — 112 workstations · serial 2026031002 · DNSSEC signedDNSDHCP
2026-03-10 10:30
GCP project
gpus-infra created · billing linked · APIs enabledGCPAIDE Baselines
| Server | Baseline | Reason | Status |
|---|---|---|---|
| SKY | 2026-03-10 10:49 | DNS/DHCP bulk update | ✓ |
| RAIN | 2026-03-10 10:55 | DHCP update | ✓ |
| SUN | 2026-03-10 13:51 | chronyd fix + reboot | ✓ |
| WIND | 2026-03-10 13:52 | chronyd fix + reboot | ✓ |
DNSSEC History
| Date | Serial | Sigs | KSK | ZSK |
|---|---|---|---|---|
| 2026-03-10 | 2026031002 | 280 | +008+37075 | +008+06660 |
Terraform History
| Date | Action | Resources | Project |
|---|---|---|---|
| 2026-03-10 | Initial deploy | 19 created | gpus-infra |
Document Versions
| Document | Version | Updated |
|---|---|---|
| sky-rain-dns-dhcp-infrastructure.md | v2.6 | 2026-03-12 |
| sun-wind-monitoring-logging.md | v1.3 | 2026-03-12 |
| wdc-infrastructure-architecture-overview.md | v1.4 | 2026-03-12 |
| wdchostregistry.csv (IAR) | v2.2 | 2026-03-12 |
| gpus-it-architecture.html | v2.4 | 2026-03-12 |
| gcp-cloud-infrastructure.md | v2.0 | 2026-03-12 |
GCS Backups
—/4
Servers with last backup <25h
Last Run
—
Most recent GCS backup
GCS Bucket
gpus-infra-backups-wdc
us-central1 · 90-day retention
Schedule
02:00
Daily cron — all servers
TF State
GCS
gpus-infra-tf-state · migrated 2026-03-12
Cloud Backup — GCS (gs://gpus-infra-backups-wdc)
On-Prem → Google Cloud Storage · Daily 02:00 · /usr/local/bin/gpus-backup.sh
| Server | GCS Path | Contents | Last Backup | Status | Size |
|---|---|---|---|---|---|
| SKY | …/sky/ | named + dhcp | 2026-03-12 | ✓ Success | ~900KB |
| RAIN | …/rain/ | named + dhcp | 2026-03-12 | ✓ Success | ~2.7MB |
| SUN | …/sun/ | prometheus + grafana | 2026-03-12 | ✓ Success | ~22MB |
| WIND | …/wind/ | elasticsearch + logstash + kibana | 2026-03-12 | ✓ Success | ~2.2MB |
Local Backup — NAS (/backup on each server)
On-Prem Local Storage · Daily cron · 30-day retention
| Server | Script | Path | Contents | Retention | Status |
|---|---|---|---|---|---|
| SKY | /etc/cron.daily/dns-dhcp-backup | /backup/dns-dhcp/ | zones, DNSSEC keys, dhcpd.conf, leases, AIDE db | 30 days | ✓ Active |
| RAIN | /etc/cron.daily/dns-dhcp-backup | /backup/dns-dhcp/ | zones, DNSSEC keys, dhcpd.conf, leases, AIDE db | 30 days | ✓ Active |
| SUN | /etc/cron.daily/mon-backup | /backup/monitoring/ | prometheus.yml, grafana.ini, dashboards, AIDE db | 30 days | ✓ Active |
| WIND | /etc/cron.daily/log-backup | /backup/logging/ | elasticsearch.yml, logstash pipelines, kibana.yml | 30 days | ✓ Active |
ESXi Snapshots
VMware ESXi · Automated via FIRE hypervisor cron · 7-day retention
| VM | Snapshot Time | Retention | Hypervisor | Status |
|---|---|---|---|---|
| SKY | 18:00 daily | 7 snapshots | FIRE | ✓ Active |
| RAIN | 19:00 daily | 7 snapshots | FIRE | ✓ Active |
| SUN | 20:00 daily | 7 snapshots | FIRE | ✓ Active |
| WIND | 21:00 daily | 7 snapshots | FIRE | ✓ Active |
Backup History Log
Manual log — update after each backup cycle
| Date | Server | Type | Result | Size | Notes |
|---|---|---|---|---|---|
| 2026-03-12 | SKY | GCS | ✓ Success | ~900KB | Pipeline installed — first run |
| 2026-03-12 | RAIN | GCS | ✓ Success | ~2.7MB | Pipeline installed — first run |
| 2026-03-12 | SUN | GCS | ✓ Success | ~22MB | Pipeline installed — first run |
| 2026-03-12 | WIND | GCS | ✓ Success | ~2.2MB | Pipeline installed — first run |
Terraform State
| Item | Value | Status |
|---|---|---|
| Backend | gcs | ✓ Migrated 2026-03-12 |
| Bucket | gpus-infra-tf-state | ✓ Active |
| Prefix | terraform/state | ✓ 20 resources |
| Versioning | Enabled — 5 versions | ✓ |