Circuit Breaker Protection¶
Overview¶
The Karpenter IBM Cloud Provider includes a circuit breaker implementation to prevent scale-up storms and protect against cascading failures during node provisioning. The circuit breaker is configurable through Helm values, environment variables, and ConfigMaps.
Problem Statement¶
In production environments, bootstrap failures or API issues can cause Karpenter to continuously create new instances that fail to join the cluster.
Circuit Breaker Implementation¶
Core Components¶
The circuit breaker is implemented in /pkg/cloudprovider/circuitbreaker.go
and provides:
- State Management: CLOSED → OPEN → HALF_OPEN transitions
- Rate Limiting: Maximum instances per minute
- Concurrency Control: Maximum concurrent provisioning operations
- Failure Detection: Configurable failure thresholds
- Automatic Recovery: Time-based recovery with testing
Configuration¶
type CircuitBreakerConfig struct {
FailureThreshold int // 3 consecutive failures
FailureWindow time.Duration // Within 5 minutes
RecoveryTimeout time.Duration // Wait 15 minutes before retry
HalfOpenMaxRequests int // Allow 2 test requests
RateLimitPerMinute int // Max 2 instances/minute
MaxConcurrentInstances int // Max 5 concurrent provisions
}
Default Configuration¶
func DefaultCircuitBreakerConfig() *CircuitBreakerConfig {
return &CircuitBreakerConfig{
FailureThreshold: 3, // 3 consecutive failures
FailureWindow: 5 * time.Minute, // Within 5 minutes
RecoveryTimeout: 15 * time.Minute, // Wait 15 minutes before retry
HalfOpenMaxRequests: 2, // Allow 2 test requests
RateLimitPerMinute: 2, // Max 2 instances/minute
MaxConcurrentInstances: 5, // Max 5 concurrent provisions
}
}
Configuration Options¶
Helm Values¶
Configure the circuit breaker in your values.yaml
:
circuitBreaker:
enabled: true
# Use a preset configuration
preset: "balanced" # Options: conservative, balanced, aggressive, demo, custom
# Custom configuration (when preset: "custom")
config:
failureThreshold: 3
failureWindow: "5m"
recoveryTimeout: "15m"
halfOpenMaxRequests: 2
rateLimitPerMinute: 10
maxConcurrentInstances: 5
Environment Variables¶
Configure using environment variables:
export CIRCUIT_BREAKER_ENABLED=true
export CIRCUIT_BREAKER_FAILURE_THRESHOLD=3
export CIRCUIT_BREAKER_FAILURE_WINDOW=5m
export CIRCUIT_BREAKER_RECOVERY_TIMEOUT=15m
export CIRCUIT_BREAKER_HALF_OPEN_MAX_REQUESTS=2
export CIRCUIT_BREAKER_RATE_LIMIT_PER_MINUTE=10
export CIRCUIT_BREAKER_MAX_CONCURRENT_INSTANCES=5
Available Presets¶
Conservative (Production)¶
- Failure Threshold: 2 - Rate Limit: 2/minute - Recovery: 20 minutes - Best for production with strict SLAsBalanced (Default)¶
- Failure Threshold: 3 - Rate Limit: 5/minute - Recovery: 15 minutes - Good balance for most workloadsAggressive¶
- Failure Threshold: 5 - Rate Limit: 10/minute - Recovery: 5 minutes - Higher throughput for developmentDemo¶
- Failure Threshold: 10 - Rate Limit: 20/minute - Recovery: 1 minute - Optimized for demonstrationsCircuit Breaker States¶
CLOSED (Normal Operation)¶
- All provisioning requests allowed
- Failures are tracked but don't block requests
- Rate limiting and concurrency limits still apply
OPEN (Failing Fast)¶
- All provisioning requests blocked
- Returns
CircuitBreakerError
immediately - Automatically transitions to HALF_OPEN after
RecoveryTimeout
HALF_OPEN (Testing Recovery)¶
- Limited number of test requests allowed (
HalfOpenMaxRequests
) - Success → transitions to CLOSED
- Failure → transitions back to OPEN
Protection Mechanisms¶
1. Rate Limiting¶
// Prevents more than N instances per minute
if cb.instancesThisMinute >= cb.config.RateLimitPerMinute {
return &RateLimitError{
Limit: cb.config.RateLimitPerMinute,
Current: cb.instancesThisMinute,
TimeToReset: time.Minute - time.Since(cb.lastMinuteReset),
}
}
2. Concurrency Control¶
// Prevents too many simultaneous provisioning operations
if cb.concurrentInstances >= cb.config.MaxConcurrentInstances {
return &ConcurrencyLimitError{
Limit: cb.config.MaxConcurrentInstances,
Current: cb.concurrentInstances,
}
}
3. Failure Detection¶
// Opens circuit after threshold failures within window
if recentFailures >= cb.config.FailureThreshold {
cb.transitionToOpen()
}
Integration with CloudProvider¶
The circuit breaker is integrated into the main provisioning flow:
// Check circuit breaker before provisioning
if err := c.circuitBreaker.CanProvision(ctx, nodeClass.Name, nodeClass.Spec.Region, 0); err != nil {
return nil, cloudprovider.NewInsufficientCapacityError(fmt.Errorf("circuit breaker blocked provisioning: %w", err))
}
// Record success/failure after provisioning
if err != nil {
c.circuitBreaker.RecordFailure(nodeClass.Name, nodeClass.Spec.Region, err)
} else {
c.circuitBreaker.RecordSuccess(nodeClass.Name, nodeClass.Spec.Region)
}
Error Types¶
CircuitBreakerError¶
type CircuitBreakerError struct {
State CircuitBreakerState
Message string
TimeToWait time.Duration
}
RateLimitError¶
ConcurrencyLimitError¶
Monitoring and Observability¶
Circuit Breaker Status¶
type CircuitBreakerStatus struct {
State CircuitBreakerState
RecentFailures int
FailureThreshold int
InstancesThisMinute int
RateLimit int
ConcurrentInstances int
MaxConcurrent int
LastStateChange time.Time
TimeToRecovery time.Duration
}
Logging¶
The circuit breaker provides logging for: - State transitions - Provisioning attempts (allowed/blocked) - Success/failure recording - Rate limit and concurrency violations