/ docs / docs / documentation / deploying / health-checks.md
health-checks.md
  1  ---
  2  title: Health Checks
  3  sidebar_position: 25
  4  ---
  5  
  6  # Health Checks
  7  
  8  Health checks enable orchestration platforms like Kubernetes to monitor the operational status of your Agent Mesh components. By exposing standardized health endpoints, your agents, gateways, and platform services can signal when they're ready to receive traffic, allowing for graceful deployments, automatic recovery from failures, and intelligent load balancing.
  9  
 10  Agent Mesh inherits health check functionality from solace-ai-connector and extends it with broker connectivity checks, database connectivity checks, and custom health check support. For the underlying implementation details, see [solace-ai-connector Health Checks](https://github.com/SolaceLabs/solace-ai-connector/blob/main/docs/health_checks.md).
 11  
 12  ## Health Check Endpoints
 13  
 14  Each Agent Mesh application exposes three HTTP health check endpoints:
 15  
 16  | Endpoint | Purpose | Kubernetes Probe |
 17  |----------|---------|------------------|
 18  | `/startup` | One-time gate for initialization - once successful, latches to 200 forever | Startup probe |
 19  | `/readyz` | Validates if the system is ready to process messages | Readiness probe |
 20  | `/healthz` | Confirms the process is alive and responsive | Liveness probe |
 21  
 22  All endpoints return:
 23  - **HTTP 200** when healthy
 24  - **HTTP 503** when unhealthy
 25  
 26  :::note Understanding the Three Probes
 27  - **Startup probe**: Runs during initialization. Once it succeeds, Kubernetes stops checking it. This prevents liveness probes from killing slow-starting applications.
 28  - **Readiness probe**: Runs continuously. When it fails, Kubernetes removes the pod from service endpoints but keeps it running. When it recovers, traffic resumes.
 29  - **Liveness probe**: Runs continuously. When it fails repeatedly, Kubernetes restarts the container.
 30  :::
 31  
 32  ## Enabling Health Checks
 33  
 34  Add the `health_check` section at the top level of your YAML configuration (outside the `apps:` block). You only need to add this to one configuration file for the health check server to run in the container:
 35  
 36  ```yaml
 37  health_check:
 38    enabled: true
 39    port: 8080  # Default port
 40  
 41  apps:
 42    - name: my-agent-app
 43      # ... app configuration ...
 44  ```
 45  
 46  ## Built-in Health Checks
 47  
 48  ### Broker Connection
 49  
 50  Agent Mesh automatically monitors the connection to the Solace event broker. The health check returns healthy only when the broker connection status is `CONNECTED`.
 51  
 52  When running in **dev mode** (using the DevBroker for local development), broker health checks always return healthy because there's no real broker connection to monitor.
 53  
 54  ### Database Connectivity
 55  
 56  For components using SQL-based session services, Agent Mesh verifies database connectivity against each configured database. The health check fails if any database is unreachable or the query times out (configurable via `database_timeout_seconds`).
 57  
 58  You can configure the database health check timeout in your app configuration:
 59  
 60  ```yaml
 61  apps:
 62    - name: my-agent-app
 63      # ... other app config ...
 64      health_check:
 65        database_timeout_seconds: 5.0  # Default: 5 seconds
 66  ```
 67  
 68  :::note
 69  Database health checks only apply to components with SQL-based session services configured. If no databases are configured, this check automatically passes.
 70  :::
 71  
 72  ## Custom Health Checks
 73  
 74  For application-specific health requirements, you can define custom health check functions that run alongside the built-in checks. This is useful for verifying external service availability, checking model readiness, or implementing business-specific health criteria.
 75  
 76  ### Configuration
 77  
 78  Add custom health checks to your application configuration under the app's `health_check` section:
 79  
 80  ```yaml
 81  apps:
 82    - name: my-agent-app
 83      # ... other app config ...
 84      health_check:
 85        custom_startup_check: my_agent.health:check_startup
 86        custom_ready_check: my_agent.health:check_ready
 87  ```
 88  
 89  The format is `module.path:function_name`, where:
 90  
 91  - `module.path` is the Python module path (e.g., `my_agent.health`)
 92  - `function_name` is the function to call (e.g., `check_ready`)
 93  
 94  ### Writing Custom Health Check Functions
 95  
 96  Custom health check functions receive the application instance and must return a boolean:
 97  
 98  ```python
 99  import logging
100  
101  log = logging.getLogger(__name__)
102  
103  def check_startup(app) -> bool:
104      """
105      Custom startup check - verify external ML service is available.
106  
107      Args:
108          app: The application instance, providing access to:
109               - app.app_info: Application configuration
110               - app.flows: All configured flows and components
111  
112      Returns:
113          True if healthy, False if unhealthy
114      """
115      try:
116          # Example: Check if an external ML service SDK can connect
117          from my_ml_service import MLServiceClient
118          client = MLServiceClient()
119          return client.is_healthy()
120      except Exception as e:
121          log.warning("ML service health check failed: %s", e)
122          return False
123  
124  
125  def check_ready(app) -> bool:
126      """
127      Custom readiness check - verify external payment service is reachable.
128  
129      Returns:
130          True if healthy, False if unhealthy
131      """
132      try:
133          # Example: Check if an external payment service SDK can connect
134          from my_payment_service import PaymentClient
135          client = PaymentClient()
136          return client.ping()
137      except Exception as e:
138          log.warning("Payment service health check failed: %s", e)
139          return False
140  ```
141  
142  :::warning Return Type
143  Custom health check functions must return a boolean (`True` or `False`). Non-boolean return values are treated as unhealthy, and exceptions are caught and logged as failures.
144  :::
145  
146  ## Kubernetes Integration
147  
148  Configure Kubernetes probes in your deployment manifest to use the health check endpoints:
149  
150  ```yaml
151  apiVersion: apps/v1
152  kind: Deployment
153  metadata:
154    name: my-agent
155  spec:
156    template:
157      spec:
158        containers:
159          - name: agent
160            ports:
161              - containerPort: 8080
162                name: health
163            startupProbe:
164              httpGet:
165                path: /startup
166                port: health
167              initialDelaySeconds: 5
168              periodSeconds: 5
169              failureThreshold: 30
170            readinessProbe:
171              httpGet:
172                path: /readyz
173                port: health
174              periodSeconds: 10
175              failureThreshold: 3
176            livenessProbe:
177              httpGet:
178                path: /healthz
179                port: health
180              periodSeconds: 30
181              failureThreshold: 3
182  ```
183  
184  :::tip Probe Configuration
185  
186  - **startupProbe**: Use a higher `failureThreshold` to allow time for initial model loading or database migrations
187  - **readinessProbe**: Use a shorter `periodSeconds` to quickly detect and recover from transient issues
188  - **livenessProbe**: Use a longer `periodSeconds` and higher `failureThreshold` to avoid unnecessary restarts during temporary issues
189  
190  :::
191  
192  ## Health Check Flow
193  
194  The `/healthz` (liveness) endpoint simply returns HTTP 200 if the health check server is running. It does not perform any additional checks.
195  
196  The `/startup` and `/readyz` endpoints evaluate the following checks:
197  
198  ```text
199  ┌─────────────────────────────────────────────────────────────┐
200  │              Health Check Request (/startup or /readyz)      │
201  └─────────────────────────────────────────────────────────────┘
202203204                     ┌─────────────────────┐
205                     │  Broker Connected?  │
206                     │  (or dev_mode?)     │
207                     └─────────────────────┘
208                          │           │
209                         Yes          No ──────► HTTP 503
210211212                     ┌─────────────────────┐
213                     │ Database Connected? │
214                     │ (if configured)     │
215                     └─────────────────────┘
216                          │           │
217                         Yes          No ──────► HTTP 503
218219220                     ┌─────────────────────┐
221                     │  Custom Check OK?   │
222                     │  (if configured)    │
223                     └─────────────────────┘
224                          │           │
225                         Yes          No ──────► HTTP 503
226227228                      HTTP 200
229  ```
230  
231  ## Configuration Reference
232  
233  ### Global Health Check Options
234  
235  Configure the health check server at the top level of your YAML configuration:
236  
237  ```yaml
238  health_check:
239    enabled: true
240    port: 8080
241    liveness_path: /healthz
242    readiness_path: /readyz
243    startup_path: /startup
244    readiness_check_period_seconds: 5
245    startup_check_period_seconds: 5
246  ```
247  
248  | Option | Type | Default | Description |
249  | ------ | ---- | ------- | ----------- |
250  | `enabled` | boolean | `false` | Enable health check endpoints |
251  | `port` | integer | `8080` | Port for health check HTTP server |
252  | `liveness_path` | string | `/healthz` | URL path for liveness probe endpoint |
253  | `readiness_path` | string | `/readyz` | URL path for readiness probe endpoint |
254  | `startup_path` | string | `/startup` | URL path for startup probe endpoint |
255  | `readiness_check_period_seconds` | integer | `5` | Interval in seconds for internal readiness monitoring |
256  | `startup_check_period_seconds` | integer | `5` | Interval in seconds for internal startup monitoring |
257  
258  :::tip Custom Endpoint Paths
259  If your infrastructure requires different endpoint paths (e.g., to avoid conflicts with other services), you can customize them using `liveness_path`, `readiness_path`, and `startup_path`. Remember to update your Kubernetes probe configurations to match.
260  :::
261  
262  ### App-specific Health Check Options
263  
264  Configure custom health checks per application under each app's `health_check` section:
265  
266  ```yaml
267  apps:
268    - name: my-agent-app
269      # ... other app config ...
270      health_check:
271        database_timeout_seconds: 5.0
272        custom_startup_check: my_agent.health:check_startup
273        custom_ready_check: my_agent.health:check_ready
274  ```
275  
276  | Option | Type | Default | Description |
277  | ------ | ---- | ------- | ----------- |
278  | `database_timeout_seconds` | float | `5.0` | Timeout for database connectivity checks |
279  | `custom_startup_check` | string | - | Module path for custom startup check (`module:function`) |
280  | `custom_ready_check` | string | - | Module path for custom readiness check (`module:function`) |
281  
282  ## Troubleshooting
283  
284  ### Health Check Returns 503
285  
286  If your health check is returning 503, check the following:
287  
288  1. **Broker connection**: Verify the Solace broker is reachable and credentials are correct
289  
290     ```bash
291     # Check agent logs for connection status
292     grep -i "connection" /path/to/agent.log
293     ```
294  
295  2. **Database connectivity**: Ensure databases are accessible and responding within the timeout period
296  
297  3. **Custom health check**: Review logs for custom check failures
298  
299     ```bash
300     grep -i "custom health check" /path/to/agent.log
301     ```
302  
303  ### Health Check Times Out
304  
305  If health checks are timing out:
306  
307  1. **Database timeout**: Increase the timeout in your app configuration
308  
309     ```yaml
310     apps:
311       - name: my-agent-app
312         health_check:
313           database_timeout_seconds: 10.0
314     ```
315  
316  2. **Network issues**: Check network connectivity between the agent and dependent services
317  
318  3. **Resource constraints**: Ensure the container has adequate CPU and memory
319  
320  ### Dev Mode Always Returns Healthy
321  
322  When running with `dev_mode: true`, broker health checks always return healthy. This is expected behavior for local development. For production deployments, ensure dev_mode is disabled:
323  
324  ```yaml
325  broker:
326    dev_mode: false
327    # ... other broker configuration
328  ```
329  
330  ## Related Documentation
331  
332  - [solace-ai-connector Health Checks](https://github.com/SolaceLabs/solace-ai-connector/blob/main/docs/health_checks.md) - Underlying health check implementation
333  - [Kubernetes Deployment Guide](./kubernetes/kubernetes-deployment-guide.md) - Detailed Kubernetes deployment instructions
334  - [Logging Configuration](./logging.md) - Configure logging for health check debugging
335  - [Monitoring Your Agent Mesh](./observability.md) - Comprehensive observability features