Open Source · Apache License 2.0

FaultRay

Estimate Infrastructure Resilience — Without Production Fault Injection

Explore your system's availability ceiling — research prototype, without touching production.

2,000+

Scenarios

DORA Mappings (research)

35+

Dashboard Pages

32,000+

Tests

Languages

State of DevOps: Elite performers deploy 182x more frequently·IBM: avg data breach costs $4.45M

Run Free Simulation Book a Guided Demo Get Started Free

terminal

$pip install faultray

Successfully installed faultray-11.0.0

$faultray demo

Running 2,048 scenarios across multiple engines... (illustrative)

Availability Ceiling: 99.9991% (4.05 nines)

N-Layer Analysis: Software=4.00 | Hardware=5.91 | Theoretical=6.65 | Ops=5.20 | External=4.85

Report saved: faultray-report.html

The Problem

Existing chaos tools inject real faults into your infrastructure.

Requires production or staging environment
Risk of real outages during testing
Expensive infrastructure costs
Complex setup and teardown
No mathematical availability proof

The Solution

FaultRay uses model-based simulation.

No production fault injection — simulation in memory
Low cost — runs on a laptop
Results in seconds
Works from declared YAML topology alone
Model-based availability ceiling estimate (depends on topology fidelity)

Challenges We Solve

FaultRay addresses the real operational gaps that keep engineering teams up at night.

Manual, person-dependent infrastructure management

FaultRay automates topology scanning and resilience analysis — no tribal knowledge required.

IaC (Terraform) not yet adopted

FaultRay auto-scans your existing AWS / GCP / Azure infrastructure without requiring any IaC setup.

Post-IPO availability & audit requirements

Explore DORA / SOC 2 alignment with research-prototype evidence drafts. Not audit-certified — independent legal review required.

Ad-hoc incident response with no runbooks

FaultRay auto-generates runbooks and remediation scripts from simulation findings.

AI governance readiness not started

Research-prototype mappings to METI / ISO 42001 requirements. Not audit-certified.

Estimate Resilience Early — Without Production Impact

Every feature maps to a concrete business outcome — not a checklist.

Core

Multiple Simulation Engines

Five core engines (Cascade, Dynamic, Ops, What-If, Capacity) cover network, process, resource, dependency, latency, blast radius, and SLA contract scenarios — powered by Monte Carlo, Markov chains, and queuing theory.

Core

Thousands of Auto-Generated Scenarios

From single-node failures to cascading multi-region outages. Scenarios are generated from your declared topology YAML; a typical 10-component topology yields roughly 2,000 unique scenarios (illustrative).

Core

N-Layer (5-Layer) Availability Model

Decomposes your availability ceiling estimate into five independent layers: Hardware, Software, Theoretical, Operational, and External SLA. Model-based; accuracy depends on topology fidelity.

AI-Assisted Analysis

Claude-assisted root cause analysis and improvement suggestions ranked by estimated impact and cost. Outputs are suggestions for engineering review, not final prescriptions.

DORA Evidence Drafts (Research Prototype)

Generate research-prototype Digital Operational Resilience Act evidence drafts. NOT validated for regulatory audit — independent legal and technical review required before any compliance use.

Security Feed Integration

Automatically incorporate CVE data and NVD feeds to simulate vulnerability-triggered cascading failures.

APM Monitoring (Real-Time)

Live performance metrics, trace correlation, and anomaly detection — integrated directly into your resilience dashboard with 35+ monitoring views.

AI Governance (METI / ISO 42001)

Research-prototype mappings to Japan's METI AI guidelines and ISO 42001 requirements. NOT audit-certified — independent legal review required for any compliance claim.

Autonomous Remediation

Auto-generate runbooks, remediation scripts, and Terraform patches from simulation findings. Reduce mean time to repair from hours to minutes.

35+ Dashboard Pages

Full-featured web dashboard with topology editor, scenario explorer, N-layer drill-down, heatmap, DORA research drafts, and executive summaries.

View Full Feature Details

New in v11.0 — industry-first

AI Agent Resilience Simulation

The only chaos engineering tool that models AI agents (LLM endpoints, tool services, orchestrators) as first-class failure nodes. Simulate how infrastructure outages cascade into hallucinations before they hit production.

Cross-Layer Analysis

Trace how infrastructure failures (database down, cache miss) cascade into agent hallucinations. Expose silent degradation that looks healthy but produces wrong results.

PREDICT · ADOPT · MANAGE

Three pillars for agent resilience: simulate chaos scenarios, assess deployment risk with blast-radius analysis, and generate monitoring rules automatically.

4 New Component Types

Model AI Agents, LLM Endpoints, Tool Services, and Agent Orchestrators as first-class nodes in your dependency graph alongside traditional infrastructure.

10 Agent Failure Modes

Hallucination, context overflow, LLM rate limiting, token exhaustion, tool failure, agent loops, prompt injection, confidence miscalibration, CoT collapse, and output amplification.

10-Mode AI Agent Failure Taxonomy

Hallucination

Degraded

Ungrounded output from degraded data sources

Context Overflow

Down

Token limit exceeded, agent cannot process input

LLM Rate Limit

Overloaded

API throttling from provider-side limits

Token Exhaustion

Down

Budget depleted, no tokens remaining

Tool Failure

Degraded

External tool or API unavailable

Agent Loop

Down

Infinite iteration, agent stuck in cycle

Prompt Injection

Degraded

Adversarial input manipulates agent behavior

Confidence Miscal.

Degraded

Unreliable confidence scores

CoT Collapse

Degraded

Chain-of-thought reasoning chain failure

Output Amplification

Degraded

Upstream error propagation to downstream agents

Terminal

$ faultray agent assess infra.yaml
Agent Risk Assessment
  support-agent    Risk: 4.2/10 (MEDIUM)  Blast radius: 3 components
  Recommendations: Add fallback LLM, enable hallucination circuit breaker

$ faultray agent scenarios infra.yaml
  Generated 12 agent-specific chaos scenarios

$ faultray agent monitor infra.yaml
  14 monitoring rules generated (context_window, hallucination_rate, ...)

How We Compare

FaultRay takes a fundamentally different approach

	Recommended FaultRay	Gremlin	Steadybit	AWS FIS
Approach	Mathematical Simulation	Real Fault Injection	Real Fault Injection	Real Fault Injection
Production Risk	Zero	High	Medium	High
Setup Time	5 minutes	Days	Hours	Hours
Scenarios	2,000+ auto-generated	Manual configuration	Template-based	AWS services only
Availability Proof	N-Layer Mathematical	No	No	No
AI Agent Modeling	10-mode taxonomy	No	No	No
Starting Cost	Free / OSS	$10,000+/yr	$5,000+/yr	Pay per use

N-Layer (5-Layer) Availability Limit Model

Decomposes your availability ceiling estimate into five independent constraint layers (model-based; accuracy depends on topology fidelity)

Layer 1: Hardware Limit

5.91 nines99.99988%

Constrained by physical components: disk MTBF, network gear, power systems, failover promotion time

Layer 2: Software Limit

4.00 nines99.99%

Your actual ceiling: deploy pipelines, config errors, dependency failures, human error rate

Layer 3: Theoretical Limit

6.65 nines99.999978%

Irreducible physical noise floor: network packet loss, GC pauses, kernel scheduling jitter

Layer 4: Operational Limit

5.20 nines99.99937%

Incident response time, on-call coverage, runbook completeness, automation level

Layer 5: External SLA

4.85 nines99.9986%

Hard ceiling imposed by third-party service availability (AWS, GCP, Stripe, etc.)

A_system = min(A_hw, A_sw, A_theoretical, A_ops, A_external)

Why This Matters

Most teams chase hardware nines while their software layer caps availability at 4 nines. FaultRay reveals exactly where your bottleneck lives so you invest in the right layer.

The N-Layer model is extensible: add Geographic, Economic, or custom domain-specific constraint layers to match your organization's unique availability boundaries.

faultray analyze --topology infra.yaml --output n-layer

Binding Constraint Detection

When a simulation predicts availability exceeding any layer ceiling, FaultRay flags it and identifies the binding constraint layer as the target for infrastructure improvement.

Quick Start

From zero to availability proof in 3 steps

Install

Terminal

$ pip install faultray

Define Your Topology

infra.yaml

topology:
  name: my-saas-platform
  regions:
    - name: us-east-1
      zones: [a, b, c]
  services:
    - name: api-gateway
      replicas: 3
      dependencies: [auth, database]

Run Analysis

Terminal

$ faultray run --topology infra.yaml
Running 2,048 scenarios across multiple engines... (illustrative)
Completed in 8.3s | Pass: 2,043 | Fail: 5

$ faultray report --format html
Report saved: report.html

$ faultray dashboard
Dashboard running at http://localhost:8550

localhost:8550/dashboard

Overview

Scenarios

N-Layer

Reports

99.99%

Availability

2,048

Scenarios

2,043

Passed

Failed

Availability by Layer

Hardware

5.91

Software

4.00

Theoretical

6.65

Operational

5.20

External

4.85

Why resilience engineering is urgent right now

DevOps Research

Google DORA State of DevOps Report 2024

“Top performers deploy 4× faster and have 10× lower change failure rates than low performers (Google DORA cohort finding; not attributable to any specific tool)”

FaultRay offers one way to explore reliability bottlenecks before deploy — research prototype, and outcomes depend heavily on your engineering practices, not just tooling.

Risk Management

IBM Cost of a Data Breach 2024

“Average cost of a data breach reached $4.88M in 2024 — highest ever recorded”

Infrastructure failures are a breach vector. Simulate weaknesses before production breaks.

Context

EU Digital Operational Resilience Act (DORA)

“DORA became mandatory for EU financial entities in January 2025. Regulated entities must use certified tooling and independent legal review for compliance.”

FaultRay is a research prototype that explores DORA-aligned evidence patterns for internal design review. NOT a certified compliance tool — engage qualified auditors for actual DORA audits.

Ready to estimate your availability ceiling?

See FaultRay in action with your own infrastructure. Our team will walk you through a live simulation in 30 minutes.

Book a Free Demo Start Free

Trusted by the Developer Community

GitHub Stars

Apache 2.0

Open Source

“We ran FaultRay against our payment pipeline topology before a Black Friday push. It surfaced a single-point-of-failure in our auth service that our team had missed for 18 months.”

SPOF surfaced in < 30s

Aspirational scenario

Series B FinTech (illustrative)

“FaultRay's research-prototype evidence drafts gave our team a starting point for internal resilience design review. We still engaged qualified auditors and independent legal review for actual compliance work.”

Research prototype, not audit tool

Aspirational scenario

EU-based engineering team (illustrative)

“We use FaultRay's N-Layer model in architecture reviews. It gives us a shared language between engineers and the CTO for discussing reliability trade-offs.”

Shared reliability language

Aspirational scenario

B2B SaaS team (illustrative)

Currently in beta. We'd love your feedback.

現在アーリーアクセス段階です。ご意見・ご要望をお寄せいただけると大変助かります。

Send Feedback

Built for your team, not theirs

Different roles, same outcome: infrastructure you can prove is reliable.

⚙️

SRE / Platform Engineer

"We don't know our blast radius until production explodes."

Map every failure path before it happens. Generate SLA evidence your leadership will trust.

Start Free →

📊

Engineering Manager

"Proving reliability to the board takes weeks of manual work."

DORA evidence drafts from simulations (research prototype — not for audit). Resilience dashboards generated for internal review in minutes.

See a Demo →

🛡

CISO / CTO

"How do I know our vendors won't take us down?"

Supply chain risk simulation. Third-party failure impact quantified before contract signing.

Get a Quote →

ROI

ROI Calculator

Estimate your annual savings with FaultRay

Monthly Revenue ($K)500 K

$50K$5M

Incidents per Year6回

1回50回

Avg. Incident Duration (hours)4時間

0.5時間24時間

Estimated Annual Loss

200 K

Estimated Savings with FaultRay

140 K

Annual Cost (Pro plan)3.6 K

ROI

+3,802%

Illustrative formula: Annual loss = Monthly revenue × (incident hours/720) × incidents × 12. The 70% FaultRay-effect reduction is an assumption for ROI modeling — actual impact depends on your deployment and is not guaranteed.

Pricing

Start free. Scale as you grow.

Free

Perfect for individual engineers exploring chaos engineering.

5 simulations / month
Up to 5 components
Multiple simulation engines
N-Layer Availability Model
HTML reports
Community support
AI-assisted analysis
Custom SSO
Priority support

Get Started Free

Feature Comparison

Feature	Free	Pro	Business
Simulations / month	5	100	Unlimited
Components	5	50	Unlimited
Simulation engines	100+	100+	100+
N-Layer Model
Report export	Markdown	PDF + MD	PDF + MD + JSON
AI-assisted analysis
Custom SSO / SAML
Support	Community	Email (24h)	Dedicated (1h)

Loading… / 読み込み中…

FaultRay

Estimate Infrastructure Resilience — Without Production Fault Injection

Explore your system's availability ceiling — research prototype, without touching production.

2,000+

Scenarios

DORA Mappings (research)

35+

Dashboard Pages

32,000+

Tests

Languages

terminal

$pip install faultray

Successfully installed faultray-11.0.0

$faultray demo

Running 2,048 scenarios across multiple engines... (illustrative)

Availability Ceiling: 99.9991% (4.05 nines)

N-Layer Analysis: Software=4.00 | Hardware=5.91 | Theoretical=6.65 | Ops=5.20 | External=4.85

Report saved: faultray-report.html

$ faultray agent assess infra.yaml Agent Risk Assessment support-agent Risk: 4.2/10 (MEDIUM) Blast radius: 3 components Recommendations: Add fallback LLM, enable hallucination circuit breaker $ faultray agent scenarios infra.yaml Generated 12 agent-specific chaos scenarios $ faultray agent monitor infra.yaml 14 monitoring rules generated (context_window, hallucination_rate, ...)

Recommended
FaultRay

Gremlin

Steadybit

AWS FIS

Approach

Mathematical Simulation

Real Fault Injection

Production Risk

Zero

High

Medium

High

Setup Time

5 minutes

Days

Hours

Scenarios

2,000+ auto-generated

Manual configuration

Template-based

AWS services only

Availability Proof

N-Layer Mathematical

AI Agent Modeling

10-mode taxonomy

Starting Cost

Free / OSS

$10,000+/yr

$5,000+/yr

Pay per use

$ faultray run --topology infra.yaml Running 2,048 scenarios across multiple engines... (illustrative) Completed in 8.3s | Pass: 2,043 | Fail: 5 $ faultray report --format html Report saved: report.html $ faultray dashboard Dashboard running at http://localhost:8550

Feature

Free

Pro

Business

Simulations / month

100

Unlimited

Components

Unlimited

Simulation engines

100+

N-Layer Model

Report export

Markdown

PDF + MD

PDF + MD + JSON

AI-assisted analysis

Custom SSO / SAML

Support

Community

Email (24h)

Dedicated (1h)