Skip to content

Troubleshooting Guide

This guide helps diagnose and resolve common issues with the Karpenter IBM Cloud Provider.

Quick Diagnostics

Check Controller Status

# Check if Karpenter controller is running
kubectl get pods -n karpenter

# Check controller logs
kubectl logs -n karpenter deployment/karpenter -f

# Check controller startup messages
kubectl logs -n karpenter deployment/karpenter | grep "Starting Controller"

Common Issues

Authentication Issues

Failed to authenticate with IBM Cloud API

Failed to create VPC client: authentication failed
Error: {"errorMessage":"Unauthorized","errorCode":"401"}

Solution:

  1. Verify API keys are correctly set
  2. Check Service ID permissions
  3. Update Kubernetes secret:
    kubectl create secret generic karpenter-ibm-credentials \
      --from-literal=api-key="$IBM_API_KEY" \
      --from-literal=vpc-api-key="$VPC_API_KEY" \
      --namespace karpenter --dry-run=client -o yaml | kubectl apply -f -
    

Instance Provisioning Issues

No suitable subnets found

Diagnosis:

# Check available subnets
ibmcloud is subnets --output json

# Check subnet capacity
ibmcloud is subnet SUBNET_ID --output json

Solutions:

  • Verify subnet exists in specified zone
  • Ensure subnet has available IP addresses
  • Consider using auto-subnet selection

Node Registration Issues

Nodes not joining cluster after provisioning

This is often caused by a chain of issues. Work through this systematic checklist:

1. Verify Instance Creation

# Check if instances are being created
ibmcloud is instances --output json | jq '.[] | select(.name | contains("nodepool"))'

# Check NodeClaim status
kubectl get nodeclaims -o wide
kubectl describe nodeclaim NODECLAIM_NAME

Expected: Instance status running, NodeClaim shows Launched: True

2. Check Network Connectivity (Most Common Issue)

Step 2a: Verify Subnet Placement

# Find which subnet your cluster nodes are in
kubectl get nodes -o wide  # Note the INTERNAL-IP range

# Check if Karpenter nodes are in the same subnet
ibmcloud is instance INSTANCE_ID --output json | jq '.primary_network_interface.subnet'

# If different subnets, nodes may be network-isolated!

Step 2b: Verify API Server Endpoint Configuration

# Find the INTERNAL API endpoint (not external!)
kubectl get endpoints kubernetes -o yaml
# OR
kubectl get endpointslice -n default -l kubernetes.io/service-name=kubernetes

# Check what's configured in IBMNodeClass
kubectl get ibmnodeclass YOUR-NODECLASS -o yaml | grep apiServerEndpoint

# Update if using external IP instead of internal
kubectl patch ibmnodeclass YOUR-NODECLASS --type='merge' \
  -p='{"spec":{"apiServerEndpoint":"https://INTERNAL-IP:6443"}}'

Step 2c: Test Connectivity from Node

# Attach floating IP for debugging

# Then SSH and test
ssh -i ~/.ssh/eb root@FLOATING_IP

# Test network layers
ping INTERNAL_API_IP                          # Test ICMP
telnet INTERNAL_API_IP 6443                   # Test TCP
curl -k https://INTERNAL_API_IP:6443/healthz  # Test HTTPS

3. Verify Security Groups

Security Group Requirements

Both worker and control plane security groups need proper rules for bidirectional communication.

Required Security Group Rules:

# Check current security groups on instance
ibmcloud is instance INSTANCE_ID --output json | \
  jq '.network_interfaces[0].security_groups'

# Worker Node Security Group needs:
# Outbound rules
- TCP 6443 to control plane subnet (Kubernetes API)
- TCP 10250 to all nodes (Kubelet)
- TCP/UDP 53 to 0.0.0.0/0 (DNS)
- TCP 80,443 to 0.0.0.0/0 (Package downloads)

# Inbound rules
- TCP 6443 from control plane (API server callbacks)
- TCP 10250 from all nodes (Kubelet peer communication)

# Add missing rules example:
ibmcloud is security-group-rule-add WORKER_SG_ID \
  outbound tcp --port-min 6443 --port-max 6443 \
  --remote CONTROL_PLANE_SUBNET_CIDR

ibmcloud is security-group-rule-add WORKER_SG_ID \
  inbound tcp --port-min 6443 --port-max 6443 \
  --remote CONTROL_PLANE_SUBNET_CIDR

4. Debug Bootstrap Process

Check Cloud-Init Status:

# SSH to node (after attaching floating IP)
ssh -i ~/.ssh/eb root@FLOATING_IP

# Check cloud-init progress
sudo cloud-init status --long

# View bootstrap logs
sudo tail -100 /var/log/cloud-init.log
sudo tail -100 /var/log/cloud-init-output.log
sudo cat /var/log/karpenter-bootstrap.log

# Check if kubelet was installed
sudo systemctl status kubelet
sudo journalctl -u kubelet --no-pager -n 50

Common Bootstrap Issues: - Package repository access blocked (check security groups for HTTP/HTTPS) - CNI conflicts (check for pre-existing CNI configurations)

5. Verify IBMNodeClass Configuration

# Check for common configuration issues
kubectl get ibmnodeclass YOUR-NODECLASS -o yaml

# Key fields to verify:
# - apiServerEndpoint: Must be INTERNAL cluster endpoint
# - bootstrapMode: Should be "cloud-init" for VPC
# - securityGroups: Must include proper security group IDs
# - sshKeys: Must use SSH key IDs (r010-xxx format), not names

6. Check Resource Group Configuration

# Verify instances are created in correct resource group
ibmcloud is instances --output json | \
  jq '.[] | select(.name | contains("nodepool")) |
  {name: .name, resource_group: .resource_group.id}'

# Should match the resource group in IBMNodeClass
kubectl get ibmnodeclass YOUR-NODECLASS -o yaml | grep resourceGroupID

Security Group Configuration

Kubernetes API Server Access

Common Issue: Security groups blocking API server communication (TCP 6443)

Symptoms:

# From worker node:
ping API_SERVER_IP              # ✅ SUCCESS
curl https://API_SERVER_IP:6443 # ❌ TIMEOUT

Required Security Group Rules:

Worker Node Security Group:

# Allow outbound to API server
ibmcloud is security-group-rule-add WORKER_SG_ID \
  outbound tcp --port-min 6443 --port-max 6443 \
  --remote CONTROL_PLANE_SUBNET_CIDR

# Allow inbound for return traffic
ibmcloud is security-group-rule-add WORKER_SG_ID \
  inbound tcp --port-min 6443 --port-max 6443 \
  --remote CONTROL_PLANE_SUBNET_CIDR

Control Plane Security Group:

# Allow inbound from workers
ibmcloud is security-group-rule-add CONTROL_PLANE_SG_ID \
  inbound tcp --port-min 6443 --port-max 6443 \
  --remote WORKER_SUBNET_CIDR

Debug connectivity:

# Test layer by layer
ping API_SERVER_IP                    # ICMP connectivity
telnet API_SERVER_IP 6443            # TCP connectivity
curl -k https://API_SERVER_IP:6443/healthz  # Application layer

Debug Mode

Enable debug logging for detailed information:

apiVersion: v1
kind: Deployment
metadata:
  name: karpenter
  namespace: karpenter
spec:
  template:
    spec:
      containers:
      - name: controller
        env:
        - name: LOG_LEVEL
          value: debug

Getting Help