GPUS IT StatusGreenpeace USA

Loading…
v1.2 · 2026-03-12
Servers
—/4
SKY · RAIN · SUN · WIND
CIS Compliance
100%
193/193 controls
Cloud VPN
ESTABLISHED
130.211.194.72 ↔ 38.140.146.68
DNS Serial
2026031002
DNSSEC signed · 134 records
DHCP Hosts
112
Reservations active
Failover
NORMAL
SKY primary · RAIN secondary
On-Prem — WDC Infrastructure
DNS Queries (7 days)
Server Uptime %
DHCP Leases Active
SK
SKY — 192.168.120.1
Primary DNS/DHCP
Loading…
RN
RAIN — 192.168.120.2
Secondary DNS/DHCP
Loading…
SN
SUN — 192.168.120.3
Monitoring · Prometheus + Grafana
Loading…
WD
WIND — 192.168.120.4
Logging · ELK Stack
Loading…
Cloud — GCP (us-central1)
VPN Tunnel ESTABLISHED
WDC 38.140.146.68 ↔ GCP 130.211.194.72 · IKEv2 · AES-256
192.168.120.0/23 + 192.168.124.0/24 ↔ 172.16.0.0/24
CR
MkDocs Portal
gpus-mkdocs-portal · us-central1
Running
HTTPS ✓Scale-to-zero ✓Public ✓
CR
Status Site
gpus-status-site · us-central1
Running
HTTPS ✓Scale-to-zero ✓greenpeace.us ✓
Security Monitoring
AIDE · auditd · Fail2ban · SELinux · Firewall
ServerAIDEauditdFail2banSELinuxFirewall
SKY✓ CleanImmutablesshd activeEnforcingdrop
RAIN✓ CleanImmutablesshd activeEnforcingdrop
SUN✓ CleanImmutableEnforcingdrop
WIND✓ CleanImmutableEnforcingdrop
98
Security Posture Score
CIS compliance · monitoring coverage · backup health · threat activity
CIS Compliance
100%
193/193 on-prem
Uptime (30d)
99.9%
All services
Open Incidents
0
Last: none
Assets Monitored
129
of 131 (98.5%)
Backups
OK
Daily + GCS offsite
Threats
0
Active
CIS Compliance — Per Server
SKY
47/47
100% ✓
RAIN
47/47
100% ✓
SUN
48/48
100% ✓
WIND
51/51
100% ✓
GCP Cloud Controls
🔐
Data Encryption
VPN AES-256 · GCS encryption at rest · Cloud Run HTTPS
CIS 3.11 PCI 4.1
🔥
VPC Firewall
Default deny-all · VPN + internal rules only
CIS 4.4 PCI 1.2.2
📊
Audit Logging
VPC Flow Logs · Cloud Audit Logs automatic
CIS 8.3 NIST AU-6
🔄
Data Recovery
GCS Nearline · 90-day retention · versioning
CIS 11.1 NIST CP-9
🔀
Network Segmentation
VPC 172.16.0.0/24 · VPN-only from on-prem
CIS 12.4 NIST SC-7
🛡
Transmission Security
IKEv2 · AES-256 · SHA-256 · DH14
NIST SC-8 PCI 1.5.1
Risk Register
Disaster Recovery Plan not documented
IT
Incident Response Plan not documented
IT
SSO not implemented — 42 apps need Okta
IT
Backup pipeline to GCS not configured
IT
Data classification not started — 7 payment + 42 supporter-data apps
IT/Legal
Active Threats
0
Open Incidents
0
Fail2ban Bans (30d)
12
AIDE Violations
0
Critical CVEs
0
Open Risks
3
Threat Detection
Fail2ban Blocks (30d)
Firewall Drops (7d)
AIDE Scans (30d)
Fail2ban — SSH Intrusion Prevention
12
Blocked IPs in 30 days — SKY + RAIN
Block
103.145.xx.xx — 47 failed SSH → banned 24h
SKY · 2026-03-09
Block
185.220.xx.xx — 31 failed SSH → banned 24h
SKY · 2026-03-08
Info
9 additional IPs — automated scanners / Tor exits
AIDE — File Integrity
0
Clean
SKY — baseline 2026-03-10 · no changes
Clean
RAIN — baseline 2026-03-10 · no changes
Clean
SUN — baseline 2026-03-10 · no changes
Clean
WIND — baseline 2026-03-10 · no changes
auditd — Suspicious Activity
0
Clean
No unauthorized sudo attempts (30d)
Clean
No unauthorized DNSSEC key/zone access
Clean
No anomalous processes
Mode: Immutable (-e 2) all servers
Firewall — Dropped Connections
~340
Normal
~280 production → management port attempts (denied by design)
Normal
~45 unknown IPs — broadcast/multicast
Normal
~15 guest WiFi → internal (SSID isolation working)
Vulnerability Management
OS Patch Status — Rocky Linux 8.10
SKY
RAIN
SUN
WIND
dnf-automatic security updates · CIS 7.1
Service CVEs
ServiceVerCVEsStatus
BIND9.110
ISC DHCP4.3.60
Elasticsearch8.x0
Prometheus2.x0
OpenSSH8.0p10
Attack Surface
Production — 192.168.120.0/23
ServerPortsRestricted ToAuth
SKY53,67/68,647,953,9100,9119Clients + RAIN + SUNDNSSEC
RAIN53,67/68,647,9100,9119Clients + SKY + SUNDNSSEC
SUN9100Localhost + mgmt
WIND5140SKY + RAIN only
Incidents — Last 90 Days
DateSeverityDescriptionDetectionStatus
✓ No security incidents in the last 90 days
Threat Intelligence
BIND / DNS
Patched
CVE-2023-50387 (KeyTrap DNSSEC) — patched via Rocky 8.10
N/A
CVE-2024-1737 — affects BIND 9.18+, not 9.11
ELK / Kibana
Mitigated
CVE-2024-37288 (Kibana RCE) — localhost + mgmt network only
OS / SSH
Mitigated
CVE-2024-6387 (regreSSHion) — key-only auth + Fail2ban + mgmt only
Patched
CVE-2024-1086 (kernel LPE) — Rocky 8.10 kernel patched
Defense-in-Depth
L1 — Perimeter
NAT/Firewall
VPN
SSID Isolation
L2 — Network
firewalld drop
Zone Sep
IPv6 off
L3 — Host
SELinux
SSH hardened
CIS L2
L4 — Detection
AIDE
auditd
ELK
L5 — Data
DNSSEC
TLS
Backups
90%
CIS Controls v8
100%
193/193 on-prem + 8 GCP
PCI-DSS v4.0
94%
47/50 requirements met
NIST CSF
96%
Identify · Protect · Detect · Respond · Recover
NIST SP 800-53
92%
Key controls mapped
Last Audit
2026-03-10
All servers verified
Gaps
3
DRP · IRP · Backup pipeline
CIS Controls v8 — Implementation Status
On-Premises Infrastructure — SKY / RAIN / SUN / WIND
CIS #ControlSKYRAINSUNWINDImplementation
1.1Asset InventoryDHCP lease tracking, DNS records, Kibana dhcp-leases-* index
1.2Software InventoryMinimal RPM install, dnf history tracked
2.2Authorized SoftwareServer base only, no GUI, no unnecessary packages
3.11Data EncryptionDNSSEC, Webmin TLS, SSH key auth, VPN AES-256
3.14Sensitive DataDNSSEC keys chmod 600, ES on dedicated partition
4.1Secure ConfigurationCIS Benchmark Rocky Linux 8 Level 2 applied
4.4Firewallfirewalld default drop zone, explicit rich-rules only
5.1Account Inventorydnsadmin / monitadmin only, service accounts nologin
5.2Privileged Accesssudo with logging, SSH no root, key-only
5.4Password Policy14-char min, 90-day max, lockout after 5
6.1Access ControlSELinux enforcing, BIND chroot, MAC filtering
7.1Vulnerability Mgmtdnf-automatic security updates enabled
8.2Audit Log Mgmtauditd immutable (-e 2), DNS/DHCP/auth rules
8.3Log StorageDedicated /var/log + /var/log/audit on sdb
8.5Log AnalysisKibana dashboards, Grafana panels
8.9Centralized Loggingrsyslog → WIND:5140 → Logstash → ES → Kibana
10.1Malware DefensesAIDE daily file integrity monitoring
11.1Data RecoveryDaily cron backups to /backup + GCS (planned)
12.1Network SecurityProd/mgmt separation, firewalld drop, IPv6 disabled
12.4Network Segmentation120.0/23 prod, 124.0/24 mgmt, 172.16.0.0/24 GCP
13.1Threat DetectionFail2ban, AIDE alerts, Prometheus alerting
GCP Cloud Infrastructure — gpus-infra
CIS #ControlStatusImplementation
3.11Data EncryptionVPN AES-256, GCS encryption at rest, Cloud Run HTTPS
4.4FirewallVPC deny-all default, explicit VPN + internal rules
8.3Log StorageVPC Flow Logs, Cloud Audit Logs automatic
11.1Data RecoveryGCS backups Nearline, 90-day retention, versioning
12.4Network SegmentationSeparate VPC 172.16.0.0/24, VPN-only from on-prem
PCI-DSS v4.0 — Compliance Matrix
Payment Card Industry Data Security Standard
ReqSubDescriptionStatusImplementation
11.1.1Network security controls definedfirewalld default-drop zone all servers
11.2.1Inbound/outbound restrictedRich rules per service/source
11.2.2All other traffic deniedZone=drop, no implicit permits
11.3.1Inbound to CDE restrictedDNS/DHCP/SSH from internal only
11.4.1NSC between zonesDHCP failover 647 SKY↔RAIN only
11.5.1Remote access securedSSH key auth, no root, AllowUsers, VPN
22.2.1Securely configuredCIS Benchmark Level 2 applied
22.2.2Vendor defaults changedRoot locked, all defaults changed
22.2.3Unnecessary services removedtelnet/ftp/rsh/avahi/cups masked
22.2.4Insecure protocols disabledSSHv2 only, no FTP/Telnet/rsh
44.1Strong cryptography for transmissionDNSSEC, TLS Webmin, VPN AES-256
55.2Anti-malware mechanismsAIDE file integrity daily scan
77.1Access limited to needDedicated admin accounts, nologin service accounts
88.3.6Password complexity14-char min, 90-day max, lockout after 5
1010.2Audit trailsauditd immutable mode, DNS/DHCP/auth rules
1010.3Audit trail protectionCentralized to WIND, 90-day retention
1010.7Log retentionDedicated log partitions on sdb
1111.5.1File integrity monitoringAIDE daily scans on all 4 servers
1212.1.1Security policy establishedPolicy drafted in status site — formal sign-off pending
1212.5.1Asset inventoryIAR: 129 hosts tracked in wdchostregistry.csv
1212.10.1IR planIRP drafted in status site — formal sign-off pending
NIST Cybersecurity Framework — Function Coverage
Identify
100%
Asset mgmt · Risk assessment · Governance
Protect
100%
Access ctrl · Encryption · Hardening
Detect
100%
AIDE · auditd · ELK · Prometheus
Respond
85%
IRP drafted · formal sign-off pending
Recover
85%
DRP drafted · GCS pipeline pending
NIST SP 800-53 — Key Control Families
FamilyControlStatusImplementation
AC-3Access EnforcementSELinux enforcing, BIND chroot, MAC DHCP filtering
AU-2Auditable Eventsauditd custom rules: DNS/DHCP changes, auth, privilege escalation
AU-6Audit ReviewKibana dashboards, Grafana panels, centralized on WIND
CM-2Baseline ConfigurationCIS Benchmark L2, Terraform IaC for GCP
CM-7Least FunctionalityMinimal install, unnecessary services masked, IPv6 disabled
CP-9System BackupDaily cron backups, GCS offsite (pipeline pending)
IA-2Identification & AuthSSH key-only, no passwords, AllowUsers directive
IA-5Authenticator Mgmt14-char min, 90-day rotation, faillock after 5
SC-7Boundary ProtectionProd/mgmt/GCP zone separation, VPN encrypted tunnel
SC-8Transmission ConfidentialityIKEv2 AES-256, DNSSEC, TLS on Webmin
SC-20Secure Name ResolutionDNSSEC zone signing + validation
SC-28Protection of Info at RestDedicated partitions, GCS encryption, key chmod 600
SI-4System MonitoringPrometheus q15s, Fail2ban, AIDE, Kibana dashboards
SI-7Software IntegrityAIDE daily file integrity scan on all 4 servers
Compliance Gaps & Remediation
PCI 12.1.1 — Security Policy: Policy drafted in Governance tab. Requires formal review and sign-off by management.
Target: Q2 2026
PCI 12.10.1 — IR Plan: IRP drafted in Governance tab. Requires formal review, tabletop exercise, and sign-off.
Target: Q2 2026
NIST CP-9 — Offsite Backup: GCS bucket exists, VPN tunnel established. Automated backup pipeline from on-prem → GCS not yet configured.
Target: Mar 2026
SSO Integration: 42 applications identified for Okta SSO. Not yet started.
Target: Q3 2026
Data Classification: 7 payment + 42 supporter-data apps identified. Classification program not started.
Target: Q3 2026
Estimated Monthly
$62
Mar 2026 forecast
Month-to-Date
$21
10 days into billing cycle
Cost Change
NEW
First month — no prior baseline
Monthly Forecast
Cost Breakdown by Service
Cloud VPN Tunnel
$36.00
Static IP (VPN)
$7.20
Cloud Run (MkDocs)
$2.50
Cloud Run (Status)
$2.50
Cloud Storage
$8.00
Artifact Registry
$2.00
Networking (Egress)
$3.50
Other (Logging, DNS)
$0.30
Cost Details
ResourceSKUUnitQtyRateMonthly
Cloud VPN Tunnelgpus-vpn-tunnel-wdchr730$0.049$36.00
Static IPgpus-vpn-iphr730$0.010$7.20
Cloud Run — MkDocsgpus-mkdocs-portalreq~500$0.40/M$2.50
Cloud Run — Statusgpus-status-sitereq~500$0.40/M$2.50
GCS — Backupsgpus-infra-backups-wdcGB~50$0.01$5.00
GCS — TF Stategpus-infra-tf-stateGB0.01$0.02$0.01
Artifact Registrygpus-imagesGB~2$0.10$2.00
VPC Flow Logsgpus-vpcGB~1$0.50$0.50
Total Estimated$62.00
Cost Optimization Notes
✓ Cloud Run scales to zero — no charge when idle (most of the time)
✓ Nearline storage — 50% cheaper than Standard for backup data
✓ Single VPN tunnel — upgrade to HA VPN ($72/mo) if uptime SLA needed
⚠ VPN is the biggest cost — $36/mo fixed regardless of traffic
ℹ Billing alert — set budget alert at $75/mo in GCP Console → Billing → Budgets
Security Policy
Incident Response Plan
Disaster Recovery Plan
Information Security Policy
GPUS-POL-001 · v1.0 · Effective: 2026-03-10 · Owner: IT Department · Classification: INTERNAL
1. Access Control
All access to infrastructure systems follows the principle of least privilege. Administrative access is restricted to named accounts over the management network (192.168.124.0/24) using SSH key-based authentication only. Root login is disabled on all servers. Service accounts are set to nologin.
ServerAdmin AccountAuth MethodNetwork
SKY / RAINdnsadminSSH key-only192.168.124.0/24
SUN / WINDmonitadminSSH key-only192.168.124.0/24
GCP[email protected]OAuth + IAMIAM roles
2. Change Management
All configuration changes require: (1) backup of affected files, (2) validation before deployment, (3) AIDE baseline update after change, (4) entry in /var/log/asset-inventory.log, (5) DNSSEC re-signing if zone files changed. GCP changes must go through Terraform — no manual console changes.
3. Availability & Redundancy
DNS and DHCP services run in primary/secondary failover (SKY/RAIN). DHCP failover is automatic. DNS zone transfers via AXFR. Monitoring (SUN) and logging (WIND) are single-instance with daily backups. GCP services use Cloud Run with auto-scaling.
4. Logging & Monitoring
All servers forward logs to WIND via rsyslog (TCP:5140). Elasticsearch retains logs for 90 days with daily index rotation. Prometheus scrapes metrics every 15 seconds. AIDE runs daily integrity scans. auditd runs in immutable mode. VPC Flow Logs enabled in GCP.
5. Password Policy
ParameterValueCIS Control
Minimum length14 charactersCIS 5.4
Maximum age90 daysCIS 5.4
Lockout threshold5 failed attemptsCIS 5.4
Lockout duration15 minutesCIS 5.4
Password history5 rememberedCIS 5.4
Incident Response Plan
GPUS-IRP-001 · v1.0 · Effective: 2026-03-10 · Owner: IT Department · Classification: INTERNAL
1. Incident Classification
SeverityDescriptionResponse TimeEscalationExamples
P1 CriticalService outage or active breach15 minIT Manager → CISOBoth DNS down, ransomware, data exfil
P2 HighDegraded service or confirmed intrusion attempt1 hrIT ManagerSingle DNS down, AIDE alert, Fail2ban flood
P3 MediumAnomaly requiring investigation4 hrIT TeamUnusual audit events, DNS query spike
P4 LowMinor issue, no impact24 hrIT TeamConfig drift, routine Fail2ban bans
2. Phase 1 — Detection
Detection sources: AIDE file integrity alerts, Fail2ban ban events, auditd rule triggers, Prometheus alert rules, Kibana dashboards, GCP Cloud Audit Logs.
## Check all detection sources # AIDE sudo aide --check # Fail2ban sudo fail2ban-client status sshd # auditd — recent security events sudo ausearch -ts recent -k dns-zone-change -k dhcp-config # Prometheus alerts curl -s http://192.168.120.3:9090/api/v1/alerts | python3 -m json.tool # Kibana — auth failures # Open http://192.168.124.4:5601 → auth-logs-* index
3. Phase 2 — Containment
## Isolate compromised server (example: SKY) # Option A: Block all traffic except failover sudo firewall-cmd --zone=drop --remove-all-rich-rules sudo firewall-cmd --zone=drop --add-rich-rule='rule family="ipv4" source address="192.168.120.2" accept' # Option B: Shut down (RAIN takes over DNS/DHCP automatically) sudo shutdown -h now # Preserve evidence BEFORE remediation mkdir -p /var/log/incident/$(date +%F) sudo cp /var/named/wdc.us.gl3.db* /var/log/incident/$(date +%F)/ sudo ausearch -ts today > /var/log/incident/$(date +%F)/audit.txt sudo aide --check > /var/log/incident/$(date +%F)/aide.txt 2>&1
4. Phase 3 — Eradication
Identify root cause from logs. Remove malicious artifacts. Restore from known-good backup if files were modified. Re-apply CIS hardening if configuration was altered.
5. Phase 4 — Recovery
## Restore from backup BACKUP_DATE="YYYY-MM-DD" tar xzf /backup/dns-dhcp/dns-backup-${BACKUP_DATE}.tar.gz -C /tmp named-checkzone wdc.us.gl3 /tmp/zones/wdc.us.gl3.db sudo cp /tmp/zones/* /var/named/ sudo cp /tmp/dhcpd.conf /etc/dhcp/dhcpd.conf ## Re-sign DNSSEC cd /var/named sudo dnssec-signzone -A -3 $(head -c 500 /dev/urandom | sha1sum | cut -b 1-16) \ -N INCREMENT -o wdc.us.gl3 -t wdc.us.gl3.db sudo rndc reload sudo systemctl restart dhcpd ## Re-baseline AIDE sudo aide --update && sudo mv /var/lib/aide/aide.db.new.gz /var/lib/aide/aide.db.gz
6. Phase 5 — Post-Incident
Post-incident report due within 72 hours. Root cause analysis, timeline, affected systems, remediation actions, lessons learned, process improvements. All evidence preserved in /var/log/incident/.
Contacts
RoleContactEscalation
IT Infrastructure LeadRajesh ChhetryFirst responder for all incidents
IT ManagerP1/P2 escalation within 15min/1hr
CISOP1 escalation, breach notification
Disaster Recovery Plan
GPUS-DRP-001 · v1.0 · Effective: 2026-03-10 · Owner: IT Department · Classification: INTERNAL
1. Recovery Objectives
SystemRTORPORecovery Method
DNS (SKY/RAIN)5 min0 (real-time failover)Automatic — RAIN takes over
DHCP (SKY/RAIN)30 sec0 (real-time failover)Automatic — failover peer
Monitoring (SUN)1 hr15 sec (scrape interval)ESXi snapshot restore
Logging (WIND)1 hr24 hr (daily backup)ESXi snapshot + backup restore
Cloud VPN15 minN/ATerraform redeploy
Cloud Run5 minN/AAuto-healing by GCP
2. Disaster Scenarios
Scenario 1: Single server failure (SKY or RAIN)
Impact: Minimal — failover is automatic. RAIN serves DNS/DHCP if SKY is down and vice versa. Restore failed server from ESXi snapshot within 1 hour.
Scenario 2: Both DNS/DHCP servers down
## Emergency: Deploy from backup on any Rocky Linux 8 box tar xzf /backup/dns-dhcp/dns-backup-LATEST.tar.gz -C /tmp dnf install -y bind dhcp-server cp /tmp/zones/* /var/named/ cp /tmp/dhcpd.conf /etc/dhcp/ cp /tmp/named.conf /etc/ systemctl start named dhcpd
Scenario 3: ESXi host failure
All 4 VMs lost. Rebuild from backups on replacement ESXi host. Total rebuild time: ~4 hours following the deployment guides (sky-rain + sun-wind docs).
Scenario 4: WDC site loss (fire, flood)
GCP services remain operational. Backups in GCS bucket. DNS can be redirected at Hover. Rebuild on-prem at DR site using GCS backups + Terraform + deployment guides.
Scenario 5: Cloud VPN tunnel down
## Check tunnel status gcloud compute vpn-tunnels describe gpus-vpn-tunnel-wdc --region=us-central1 ## If ESTABLISHED lost — check Meraki side first # Meraki Dashboard → Security & SD-WAN → VPN Status ## Redeploy VPN via Terraform if needed cd ~/terraform/gpus-infra/terraform terraform apply -target=google_compute_vpn_tunnel.wdc_tunnel
3. Backup Schedule
DataFrequencyLocationRetention
DNS zone filesDaily cron/backup + GCS (planned)90 days
DHCP config + leasesDaily cron/backup + GCS (planned)90 days
ES snapshotsDaily/backup + GCS (planned)90 days
Prometheus TSDBDaily/backup90 days
ESXi VM snapshotsWeeklyLocal datastore4 snapshots
Terraform stateEvery applyGCS (gpus-infra-tf-state)5 versions
4. DR Testing Schedule
Quarterly DR tests: Q1 (DNS failover), Q2 (full server restore from backup), Q3 (site failover simulation), Q4 (full tabletop exercise). Results documented and reviewed by IT Manager.
Change Log
2026-03-12 22:00
Custom domain mapped: status.greenpeace.org → Cloud Run · SSL auto-provisionedGCP
2026-03-12 21:00
GCS backup pipeline LIVE — all 4 servers → gs://gpus-infra-backups-wdc · daily 02:00BACKUP
2026-03-12 20:00
Terraform state migrated to GCS (gpus-infra-tf-state) · 20 resourcesGCP
2026-03-12 15:55
Status backend v10 live — all 4 servers returning real-time data via SSH over VPNGCP
2026-03-12 15:30
SUN firewall: TCP 9090 (Prometheus) opened for 10.8.0.0/28 · WIND: TCP 9200 (ES) openedCONFIG
2026-03-12 14:00
VPN traffic verified — all 4 servers reachable from GCP · Meraki routes + firewall rules fixedGCP
2026-03-10 13:51
chronyd fix: denyalldeny all on SUN + WIND · AIDE re-baselinedCONFIG
2026-03-10 13:38
SUN + WIND rebooted — auditd immutable active · CIS 48/48 + 51/51CIS
2026-03-10 11:15
Cloud VPN ESTABLISHED — 130.211.194.72 ↔ 38.140.146.68GCP
2026-03-10 11:00
GCP infra deployed — VPC, VPN, Cloud Run ×2, GCS ×2, Artifact Registry (19 resources)GCP
2026-03-10 10:55
RAIN DHCP updated — 112 reservations · failover normal · AIDE re-baselinedDHCP
2026-03-10 10:49
SKY DNS/DHCP bulk update — 112 workstations · serial 2026031002 · DNSSEC signedDNSDHCP
2026-03-10 10:30
GCP project gpus-infra created · billing linked · APIs enabledGCP
AIDE Baselines
ServerBaselineReasonStatus
SKY2026-03-10 10:49DNS/DHCP bulk update
RAIN2026-03-10 10:55DHCP update
SUN2026-03-10 13:51chronyd fix + reboot
WIND2026-03-10 13:52chronyd fix + reboot
DNSSEC History
DateSerialSigsKSKZSK
2026-03-102026031002280+008+37075+008+06660
Terraform History
DateActionResourcesProject
2026-03-10Initial deploy19 createdgpus-infra
Document Versions
DocumentVersionUpdated
sky-rain-dns-dhcp-infrastructure.mdv2.62026-03-12
sun-wind-monitoring-logging.mdv1.32026-03-12
wdc-infrastructure-architecture-overview.mdv1.42026-03-12
wdchostregistry.csv (IAR)v2.22026-03-12
gpus-it-architecture.htmlv2.42026-03-12
gcp-cloud-infrastructure.mdv2.02026-03-12
GCS Backups
—/4
Servers with last backup <25h
Last Run
Most recent GCS backup
GCS Bucket
gpus-infra-backups-wdc
us-central1 · 90-day retention
Schedule
02:00
Daily cron — all servers
TF State
GCS
gpus-infra-tf-state · migrated 2026-03-12
Cloud Backup — GCS (gs://gpus-infra-backups-wdc)
On-Prem → Google Cloud Storage · Daily 02:00 · /usr/local/bin/gpus-backup.sh
ServerGCS PathContentsLast BackupStatusSize
SKY…/sky/named + dhcp2026-03-12✓ Success~900KB
RAIN…/rain/named + dhcp2026-03-12✓ Success~2.7MB
SUN…/sun/prometheus + grafana2026-03-12✓ Success~22MB
WIND…/wind/elasticsearch + logstash + kibana2026-03-12✓ Success~2.2MB
Local Backup — NAS (/backup on each server)
On-Prem Local Storage · Daily cron · 30-day retention
ServerScriptPathContentsRetentionStatus
SKY/etc/cron.daily/dns-dhcp-backup/backup/dns-dhcp/zones, DNSSEC keys, dhcpd.conf, leases, AIDE db30 days✓ Active
RAIN/etc/cron.daily/dns-dhcp-backup/backup/dns-dhcp/zones, DNSSEC keys, dhcpd.conf, leases, AIDE db30 days✓ Active
SUN/etc/cron.daily/mon-backup/backup/monitoring/prometheus.yml, grafana.ini, dashboards, AIDE db30 days✓ Active
WIND/etc/cron.daily/log-backup/backup/logging/elasticsearch.yml, logstash pipelines, kibana.yml30 days✓ Active
ESXi Snapshots
VMware ESXi · Automated via FIRE hypervisor cron · 7-day retention
VMSnapshot TimeRetentionHypervisorStatus
SKY18:00 daily7 snapshotsFIRE✓ Active
RAIN19:00 daily7 snapshotsFIRE✓ Active
SUN20:00 daily7 snapshotsFIRE✓ Active
WIND21:00 daily7 snapshotsFIRE✓ Active
Backup History Log
Manual log — update after each backup cycle
DateServerTypeResultSizeNotes
2026-03-12SKYGCS✓ Success~900KBPipeline installed — first run
2026-03-12RAINGCS✓ Success~2.7MBPipeline installed — first run
2026-03-12SUNGCS✓ Success~22MBPipeline installed — first run
2026-03-12WINDGCS✓ Success~2.2MBPipeline installed — first run
Terraform State
ItemValueStatus
Backendgcs✓ Migrated 2026-03-12
Bucketgpus-infra-tf-state✓ Active
Prefixterraform/state✓ 20 resources
VersioningEnabled — 5 versions