health-checks.md
1 --- 2 title: Health Checks 3 sidebar_position: 25 4 --- 5 6 # Health Checks 7 8 Health checks enable orchestration platforms like Kubernetes to monitor the operational status of your Agent Mesh components. By exposing standardized health endpoints, your agents, gateways, and platform services can signal when they're ready to receive traffic, allowing for graceful deployments, automatic recovery from failures, and intelligent load balancing. 9 10 Agent Mesh inherits health check functionality from solace-ai-connector and extends it with broker connectivity checks, database connectivity checks, and custom health check support. For the underlying implementation details, see [solace-ai-connector Health Checks](https://github.com/SolaceLabs/solace-ai-connector/blob/main/docs/health_checks.md). 11 12 ## Health Check Endpoints 13 14 Each Agent Mesh application exposes three HTTP health check endpoints: 15 16 | Endpoint | Purpose | Kubernetes Probe | 17 |----------|---------|------------------| 18 | `/startup` | One-time gate for initialization - once successful, latches to 200 forever | Startup probe | 19 | `/readyz` | Validates if the system is ready to process messages | Readiness probe | 20 | `/healthz` | Confirms the process is alive and responsive | Liveness probe | 21 22 All endpoints return: 23 - **HTTP 200** when healthy 24 - **HTTP 503** when unhealthy 25 26 :::note Understanding the Three Probes 27 - **Startup probe**: Runs during initialization. Once it succeeds, Kubernetes stops checking it. This prevents liveness probes from killing slow-starting applications. 28 - **Readiness probe**: Runs continuously. When it fails, Kubernetes removes the pod from service endpoints but keeps it running. When it recovers, traffic resumes. 29 - **Liveness probe**: Runs continuously. When it fails repeatedly, Kubernetes restarts the container. 30 ::: 31 32 ## Enabling Health Checks 33 34 Add the `health_check` section at the top level of your YAML configuration (outside the `apps:` block). You only need to add this to one configuration file for the health check server to run in the container: 35 36 ```yaml 37 health_check: 38 enabled: true 39 port: 8080 # Default port 40 41 apps: 42 - name: my-agent-app 43 # ... app configuration ... 44 ``` 45 46 ## Built-in Health Checks 47 48 ### Broker Connection 49 50 Agent Mesh automatically monitors the connection to the Solace event broker. The health check returns healthy only when the broker connection status is `CONNECTED`. 51 52 When running in **dev mode** (using the DevBroker for local development), broker health checks always return healthy because there's no real broker connection to monitor. 53 54 ### Database Connectivity 55 56 For components using SQL-based session services, Agent Mesh verifies database connectivity against each configured database. The health check fails if any database is unreachable or the query times out (configurable via `database_timeout_seconds`). 57 58 You can configure the database health check timeout in your app configuration: 59 60 ```yaml 61 apps: 62 - name: my-agent-app 63 # ... other app config ... 64 health_check: 65 database_timeout_seconds: 5.0 # Default: 5 seconds 66 ``` 67 68 :::note 69 Database health checks only apply to components with SQL-based session services configured. If no databases are configured, this check automatically passes. 70 ::: 71 72 ## Custom Health Checks 73 74 For application-specific health requirements, you can define custom health check functions that run alongside the built-in checks. This is useful for verifying external service availability, checking model readiness, or implementing business-specific health criteria. 75 76 ### Configuration 77 78 Add custom health checks to your application configuration under the app's `health_check` section: 79 80 ```yaml 81 apps: 82 - name: my-agent-app 83 # ... other app config ... 84 health_check: 85 custom_startup_check: my_agent.health:check_startup 86 custom_ready_check: my_agent.health:check_ready 87 ``` 88 89 The format is `module.path:function_name`, where: 90 91 - `module.path` is the Python module path (e.g., `my_agent.health`) 92 - `function_name` is the function to call (e.g., `check_ready`) 93 94 ### Writing Custom Health Check Functions 95 96 Custom health check functions receive the application instance and must return a boolean: 97 98 ```python 99 import logging 100 101 log = logging.getLogger(__name__) 102 103 def check_startup(app) -> bool: 104 """ 105 Custom startup check - verify external ML service is available. 106 107 Args: 108 app: The application instance, providing access to: 109 - app.app_info: Application configuration 110 - app.flows: All configured flows and components 111 112 Returns: 113 True if healthy, False if unhealthy 114 """ 115 try: 116 # Example: Check if an external ML service SDK can connect 117 from my_ml_service import MLServiceClient 118 client = MLServiceClient() 119 return client.is_healthy() 120 except Exception as e: 121 log.warning("ML service health check failed: %s", e) 122 return False 123 124 125 def check_ready(app) -> bool: 126 """ 127 Custom readiness check - verify external payment service is reachable. 128 129 Returns: 130 True if healthy, False if unhealthy 131 """ 132 try: 133 # Example: Check if an external payment service SDK can connect 134 from my_payment_service import PaymentClient 135 client = PaymentClient() 136 return client.ping() 137 except Exception as e: 138 log.warning("Payment service health check failed: %s", e) 139 return False 140 ``` 141 142 :::warning Return Type 143 Custom health check functions must return a boolean (`True` or `False`). Non-boolean return values are treated as unhealthy, and exceptions are caught and logged as failures. 144 ::: 145 146 ## Kubernetes Integration 147 148 Configure Kubernetes probes in your deployment manifest to use the health check endpoints: 149 150 ```yaml 151 apiVersion: apps/v1 152 kind: Deployment 153 metadata: 154 name: my-agent 155 spec: 156 template: 157 spec: 158 containers: 159 - name: agent 160 ports: 161 - containerPort: 8080 162 name: health 163 startupProbe: 164 httpGet: 165 path: /startup 166 port: health 167 initialDelaySeconds: 5 168 periodSeconds: 5 169 failureThreshold: 30 170 readinessProbe: 171 httpGet: 172 path: /readyz 173 port: health 174 periodSeconds: 10 175 failureThreshold: 3 176 livenessProbe: 177 httpGet: 178 path: /healthz 179 port: health 180 periodSeconds: 30 181 failureThreshold: 3 182 ``` 183 184 :::tip Probe Configuration 185 186 - **startupProbe**: Use a higher `failureThreshold` to allow time for initial model loading or database migrations 187 - **readinessProbe**: Use a shorter `periodSeconds` to quickly detect and recover from transient issues 188 - **livenessProbe**: Use a longer `periodSeconds` and higher `failureThreshold` to avoid unnecessary restarts during temporary issues 189 190 ::: 191 192 ## Health Check Flow 193 194 The `/healthz` (liveness) endpoint simply returns HTTP 200 if the health check server is running. It does not perform any additional checks. 195 196 The `/startup` and `/readyz` endpoints evaluate the following checks: 197 198 ```text 199 ┌─────────────────────────────────────────────────────────────┐ 200 │ Health Check Request (/startup or /readyz) │ 201 └─────────────────────────────────────────────────────────────┘ 202 │ 203 ▼ 204 ┌─────────────────────┐ 205 │ Broker Connected? │ 206 │ (or dev_mode?) │ 207 └─────────────────────┘ 208 │ │ 209 Yes No ──────► HTTP 503 210 │ 211 ▼ 212 ┌─────────────────────┐ 213 │ Database Connected? │ 214 │ (if configured) │ 215 └─────────────────────┘ 216 │ │ 217 Yes No ──────► HTTP 503 218 │ 219 ▼ 220 ┌─────────────────────┐ 221 │ Custom Check OK? │ 222 │ (if configured) │ 223 └─────────────────────┘ 224 │ │ 225 Yes No ──────► HTTP 503 226 │ 227 ▼ 228 HTTP 200 229 ``` 230 231 ## Configuration Reference 232 233 ### Global Health Check Options 234 235 Configure the health check server at the top level of your YAML configuration: 236 237 ```yaml 238 health_check: 239 enabled: true 240 port: 8080 241 liveness_path: /healthz 242 readiness_path: /readyz 243 startup_path: /startup 244 readiness_check_period_seconds: 5 245 startup_check_period_seconds: 5 246 ``` 247 248 | Option | Type | Default | Description | 249 | ------ | ---- | ------- | ----------- | 250 | `enabled` | boolean | `false` | Enable health check endpoints | 251 | `port` | integer | `8080` | Port for health check HTTP server | 252 | `liveness_path` | string | `/healthz` | URL path for liveness probe endpoint | 253 | `readiness_path` | string | `/readyz` | URL path for readiness probe endpoint | 254 | `startup_path` | string | `/startup` | URL path for startup probe endpoint | 255 | `readiness_check_period_seconds` | integer | `5` | Interval in seconds for internal readiness monitoring | 256 | `startup_check_period_seconds` | integer | `5` | Interval in seconds for internal startup monitoring | 257 258 :::tip Custom Endpoint Paths 259 If your infrastructure requires different endpoint paths (e.g., to avoid conflicts with other services), you can customize them using `liveness_path`, `readiness_path`, and `startup_path`. Remember to update your Kubernetes probe configurations to match. 260 ::: 261 262 ### App-specific Health Check Options 263 264 Configure custom health checks per application under each app's `health_check` section: 265 266 ```yaml 267 apps: 268 - name: my-agent-app 269 # ... other app config ... 270 health_check: 271 database_timeout_seconds: 5.0 272 custom_startup_check: my_agent.health:check_startup 273 custom_ready_check: my_agent.health:check_ready 274 ``` 275 276 | Option | Type | Default | Description | 277 | ------ | ---- | ------- | ----------- | 278 | `database_timeout_seconds` | float | `5.0` | Timeout for database connectivity checks | 279 | `custom_startup_check` | string | - | Module path for custom startup check (`module:function`) | 280 | `custom_ready_check` | string | - | Module path for custom readiness check (`module:function`) | 281 282 ## Troubleshooting 283 284 ### Health Check Returns 503 285 286 If your health check is returning 503, check the following: 287 288 1. **Broker connection**: Verify the Solace broker is reachable and credentials are correct 289 290 ```bash 291 # Check agent logs for connection status 292 grep -i "connection" /path/to/agent.log 293 ``` 294 295 2. **Database connectivity**: Ensure databases are accessible and responding within the timeout period 296 297 3. **Custom health check**: Review logs for custom check failures 298 299 ```bash 300 grep -i "custom health check" /path/to/agent.log 301 ``` 302 303 ### Health Check Times Out 304 305 If health checks are timing out: 306 307 1. **Database timeout**: Increase the timeout in your app configuration 308 309 ```yaml 310 apps: 311 - name: my-agent-app 312 health_check: 313 database_timeout_seconds: 10.0 314 ``` 315 316 2. **Network issues**: Check network connectivity between the agent and dependent services 317 318 3. **Resource constraints**: Ensure the container has adequate CPU and memory 319 320 ### Dev Mode Always Returns Healthy 321 322 When running with `dev_mode: true`, broker health checks always return healthy. This is expected behavior for local development. For production deployments, ensure dev_mode is disabled: 323 324 ```yaml 325 broker: 326 dev_mode: false 327 # ... other broker configuration 328 ``` 329 330 ## Related Documentation 331 332 - [solace-ai-connector Health Checks](https://github.com/SolaceLabs/solace-ai-connector/blob/main/docs/health_checks.md) - Underlying health check implementation 333 - [Kubernetes Deployment Guide](./kubernetes/kubernetes-deployment-guide.md) - Detailed Kubernetes deployment instructions 334 - [Logging Configuration](./logging.md) - Configure logging for health check debugging 335 - [Monitoring Your Agent Mesh](./observability.md) - Comprehensive observability features