Loading… / 読み込み中…
Loading… / 読み込み中…
Estimate Infrastructure Resilience — Without Production Fault Injection
Explore your system's availability ceiling — research prototype, without touching production.
Existing chaos tools inject real faults into your infrastructure.
FaultRay uses model-based simulation.
FaultRay addresses the real operational gaps that keep engineering teams up at night.
Manual, person-dependent infrastructure management
FaultRay automates topology scanning and resilience analysis — no tribal knowledge required.
IaC (Terraform) not yet adopted
FaultRay auto-scans your existing AWS / GCP / Azure infrastructure without requiring any IaC setup.
Post-IPO availability & audit requirements
Explore DORA / SOC 2 alignment with research-prototype evidence drafts. Not audit-certified — independent legal review required.
Ad-hoc incident response with no runbooks
FaultRay auto-generates runbooks and remediation scripts from simulation findings.
AI governance readiness not started
Research-prototype mappings to METI / ISO 42001 requirements. Not audit-certified.
Every feature maps to a concrete business outcome — not a checklist.
Five core engines (Cascade, Dynamic, Ops, What-If, Capacity) cover network, process, resource, dependency, latency, blast radius, and SLA contract scenarios — powered by Monte Carlo, Markov chains, and queuing theory.
From single-node failures to cascading multi-region outages. Scenarios are generated from your declared topology YAML; a typical 10-component topology yields roughly 2,000 unique scenarios (illustrative).
Decomposes your availability ceiling estimate into five independent layers: Hardware, Software, Theoretical, Operational, and External SLA. Model-based; accuracy depends on topology fidelity.
Claude-assisted root cause analysis and improvement suggestions ranked by estimated impact and cost. Outputs are suggestions for engineering review, not final prescriptions.
Generate research-prototype Digital Operational Resilience Act evidence drafts. NOT validated for regulatory audit — independent legal and technical review required before any compliance use.
Automatically incorporate CVE data and NVD feeds to simulate vulnerability-triggered cascading failures.
Live performance metrics, trace correlation, and anomaly detection — integrated directly into your resilience dashboard with 35+ monitoring views.
Research-prototype mappings to Japan's METI AI guidelines and ISO 42001 requirements. NOT audit-certified — independent legal review required for any compliance claim.
Auto-generate runbooks, remediation scripts, and Terraform patches from simulation findings. Reduce mean time to repair from hours to minutes.
Full-featured web dashboard with topology editor, scenario explorer, N-layer drill-down, heatmap, DORA research drafts, and executive summaries.
The only chaos engineering tool that models AI agents (LLM endpoints, tool services, orchestrators) as first-class failure nodes. Simulate how infrastructure outages cascade into hallucinations before they hit production.
Trace how infrastructure failures (database down, cache miss) cascade into agent hallucinations. Expose silent degradation that looks healthy but produces wrong results.
Three pillars for agent resilience: simulate chaos scenarios, assess deployment risk with blast-radius analysis, and generate monitoring rules automatically.
Model AI Agents, LLM Endpoints, Tool Services, and Agent Orchestrators as first-class nodes in your dependency graph alongside traditional infrastructure.
Hallucination, context overflow, LLM rate limiting, token exhaustion, tool failure, agent loops, prompt injection, confidence miscalibration, CoT collapse, and output amplification.
$ faultray agent assess infra.yaml
Agent Risk Assessment
support-agent Risk: 4.2/10 (MEDIUM) Blast radius: 3 components
Recommendations: Add fallback LLM, enable hallucination circuit breaker
$ faultray agent scenarios infra.yaml
Generated 12 agent-specific chaos scenarios
$ faultray agent monitor infra.yaml
14 monitoring rules generated (context_window, hallucination_rate, ...)FaultRay takes a fundamentally different approach
| Recommended FaultRay | Gremlin | Steadybit | AWS FIS | |
|---|---|---|---|---|
| Approach | Mathematical Simulation | Real Fault Injection | Real Fault Injection | Real Fault Injection |
| Production Risk | Zero | High | Medium | High |
| Setup Time | 5 minutes | Days | Hours | Hours |
| Scenarios | 2,000+ auto-generated | Manual configuration | Template-based | AWS services only |
| Availability Proof | N-Layer Mathematical | No | No | No |
| AI Agent Modeling | 10-mode taxonomy | No | No | No |
| Starting Cost | Free / OSS | $10,000+/yr | $5,000+/yr | Pay per use |
Decomposes your availability ceiling estimate into five independent constraint layers (model-based; accuracy depends on topology fidelity)
Constrained by physical components: disk MTBF, network gear, power systems, failover promotion time
Your actual ceiling: deploy pipelines, config errors, dependency failures, human error rate
Irreducible physical noise floor: network packet loss, GC pauses, kernel scheduling jitter
Incident response time, on-call coverage, runbook completeness, automation level
Hard ceiling imposed by third-party service availability (AWS, GCP, Stripe, etc.)
A_system = min(A_hw, A_sw, A_theoretical, A_ops, A_external)Most teams chase hardware nines while their software layer caps availability at 4 nines. FaultRay reveals exactly where your bottleneck lives so you invest in the right layer.
The N-Layer model is extensible: add Geographic, Economic, or custom domain-specific constraint layers to match your organization's unique availability boundaries.
faultray analyze --topology infra.yaml --output n-layerWhen a simulation predicts availability exceeding any layer ceiling, FaultRay flags it and identifies the binding constraint layer as the target for infrastructure improvement.
From zero to availability proof in 3 steps
$ pip install faultraytopology:
name: my-saas-platform
regions:
- name: us-east-1
zones: [a, b, c]
services:
- name: api-gateway
replicas: 3
dependencies: [auth, database]$ faultray run --topology infra.yaml
Running 2,048 scenarios across multiple engines... (illustrative)
Completed in 8.3s | Pass: 2,043 | Fail: 5
$ faultray report --format html
Report saved: report.html
$ faultray dashboard
Dashboard running at http://localhost:8550Why resilience engineering is urgent right now
Google DORA State of DevOps Report 2024
“Top performers deploy 4× faster and have 10× lower change failure rates than low performers (Google DORA cohort finding; not attributable to any specific tool)”
FaultRay offers one way to explore reliability bottlenecks before deploy — research prototype, and outcomes depend heavily on your engineering practices, not just tooling.
IBM Cost of a Data Breach 2024
“Average cost of a data breach reached $4.88M in 2024 — highest ever recorded”
Infrastructure failures are a breach vector. Simulate weaknesses before production breaks.
EU Digital Operational Resilience Act (DORA)
“DORA became mandatory for EU financial entities in January 2025. Regulated entities must use certified tooling and independent legal review for compliance.”
FaultRay is a research prototype that explores DORA-aligned evidence patterns for internal design review. NOT a certified compliance tool — engage qualified auditors for actual DORA audits.
See FaultRay in action with your own infrastructure. Our team will walk you through a live simulation in 30 minutes.
“We ran FaultRay against our payment pipeline topology before a Black Friday push. It surfaced a single-point-of-failure in our auth service that our team had missed for 18 months.”
Aspirational scenario
Series B FinTech (illustrative)
“FaultRay's research-prototype evidence drafts gave our team a starting point for internal resilience design review. We still engaged qualified auditors and independent legal review for actual compliance work.”
Aspirational scenario
EU-based engineering team (illustrative)
“We use FaultRay's N-Layer model in architecture reviews. It gives us a shared language between engineers and the CTO for discussing reliability trade-offs.”
Aspirational scenario
B2B SaaS team (illustrative)
Different roles, same outcome: infrastructure you can prove is reliable.
SRE / Platform Engineer
"We don't know our blast radius until production explodes."
Map every failure path before it happens. Generate SLA evidence your leadership will trust.
Start Free →Engineering Manager
"Proving reliability to the board takes weeks of manual work."
DORA evidence drafts from simulations (research prototype — not for audit). Resilience dashboards generated for internal review in minutes.
See a Demo →CISO / CTO
"How do I know our vendors won't take us down?"
Supply chain risk simulation. Third-party failure impact quantified before contract signing.
Get a Quote →Estimate your annual savings with FaultRay
Illustrative formula: Annual loss = Monthly revenue × (incident hours/720) × incidents × 12. The 70% FaultRay-effect reduction is an assumption for ROI modeling — actual impact depends on your deployment and is not guaranteed.
Start free. Scale as you grow.
Perfect for individual engineers exploring chaos engineering.
For teams that need DORA evidence drafts (research prototype) and higher limits.
For enterprises needing unlimited access, SSO, and dedicated support.
| Feature | Free | Pro | Business |
|---|---|---|---|
| Simulations / month | 5 | 100 | Unlimited |
| Components | 5 | 50 | Unlimited |
| Simulation engines | 100+ | 100+ | 100+ |
| N-Layer Model | |||
| Report export | Markdown | PDF + MD | PDF + MD + JSON |
| AI-assisted analysis | |||
| Custom SSO / SAML | |||
| Support | Community | Email (24h) | Dedicated (1h) |