Cradicle Explorer

/ dev / benchmarks / gateway / README.md
README.md
  1  # MLflow AI Gateway Benchmark
  2  
  3  Measures the **proxy overhead** of the MLflow tracking-server-backed AI Gateway under
  4  concurrent load. A fake OpenAI server simulates the upstream provider at a fixed latency,
  5  so results reflect pure MLflow processing time rather than provider variance.
  6  
  7  ## Prerequisites
  8  
  9  - Python 3.10+ with [`uv`](https://docs.astral.sh/uv/) — all scripts must be run via `uv run`, which handles dependency installation automatically via inline script metadata
 10  - Docker (required for `--database postgres` and `multi` mode)
 11  
 12  ## Quick start
 13  
 14  ```bash
 15  cd dev/benchmarks/gateway
 16  
 17  # 4 instances behind nginx (default, requires Docker)
 18  uv run run.py
 19  
 20  # Single instance, SQLite (no Docker needed)
 21  uv run run.py --instances 1
 22  
 23  # Single instance, PostgreSQL
 24  uv run run.py --instances 1 --database postgres
 25  
 26  # Scale up
 27  uv run run.py --instances 8 --workers 8
 28  
 29  # Benchmark an existing endpoint directly (skips all setup)
 30  uv run run.py --url http://your-server/gateway/my-endpoint/mlflow/invocations
 31  
 32  # Basic-auth enabled (starts MLflow with --app-name=basic-auth,
 33  # sends Authorization: Basic on every request)
 34  uv run run.py --instances 1 --auth
 35  
 36  ```
 37  
 38  ## What is measured
 39  
 40  Latency is measured **client-side** using `time.perf_counter()` around each `aiohttp` request.
 41  Each sample covers the full round-trip: client serialization → loopback → full server processing → response deserialization. Only HTTP 200 responses count toward latency stats; errors are tracked separately.
 42  
 43  Connection pooling and HTTP keep-alive are enabled, so TCP handshake cost is amortized after the warmup phase.
 44  
 45  ### What is NOT measured
 46  
 47  | Factor             | In this benchmark                              | In production               |
 48  | ------------------ | ---------------------------------------------- | --------------------------- |
 49  | Network latency    | ~0 ms (loopback)                               | 1–100 ms per hop            |
 50  | TLS/SSL            | None (plain HTTP)                              | ~5–20 ms per new connection |
 51  | Provider inference | Fixed fake delay (`--fake-delay-ms`)           | Variable (50 ms – 60 s+)    |
 52  | Authentication     | Off by default; basic-auth opt-in via `--auth` | Token validation, RBAC      |
 53  
 54  ## What MLflow does per request
 55  
 56  Each invocation through the tracking-server gateway runs these steps:
 57  
 58  ```
 59  1. Config resolution     (DB-backed, cached after first hit)
 60  2. Secret decryption     (cached, 60 s TTL)
 61  3. Provider instantiation
 62  4. Tracing               (if usage_tracking=True)
 63  5. HTTP call to LLM API
 64  ```
 65  
 66  Steps 1 (config resolution) and 4 (tracing) have historically been the dominant bottlenecks.
 67  Config caching (enabled by default) eliminates most of step 1's cost. Tracing overhead
 68  depends on the span processor in use.
 69  
 70  ## Architecture
 71  
 72  ### Single instance (`--instances 1`)
 73  
 74  ```
 75  benchmark.py ──aiohttp──▶  MLflow server (:5731)  ──▶  fake_server.py (:9137)
 76                                 │
 77                             SQLite or PostgreSQL
 78  ```
 79  
 80  ### Multi-instance (`--instances N`, default)
 81  
 82  ```
 83  benchmark.py ──aiohttp──▶  nginx LB (:5731)  ──round-robin──▶  MLflow :5800
 84                                                                   MLflow :5801
 85                                                                   MLflow :580N
 86                                                                      │
 87                                                                 fake_server.py (:9137)
 88                                                                 PostgreSQL (Docker)
 89  ```
 90  
 91  MLflow instances are started **sequentially** (instance 0 first) to let it initialize the
 92  DB schema before the others join. All instances share one PostgreSQL database.
 93  
 94  ## Options
 95  
 96  | Flag                          | Default        | Description                                                             |
 97  | ----------------------------- | -------------- | ----------------------------------------------------------------------- |
 98  | `--url URL`                   | —              | Benchmark this URL directly, skip all setup                             |
 99  | `--instances N`               | 4              | MLflow instances. Use 1 for single-instance (no nginx, optional SQLite) |
100  | `--workers N`                 | 4              | MLflow worker processes per instance                                    |
101  | `--database sqlite\|postgres` | `sqlite`       | Database to use — only applies when `--instances 1`                     |
102  | `--no-usage-tracking`         | —              | Disable usage tracking (tracing) on the endpoint                        |
103  | `--port N`                    | 5731           | Port to benchmark (MLflow port for single, nginx LB port for multi)     |
104  | `--base-port N`               | 5800           | First MLflow instance port in multi mode (rest are +1, +2, …)           |
105  | `--fake-server-port N`        | 9137           | Fake OpenAI server port                                                 |
106  | `--requests N`                | 2000           | Requests per run                                                        |
107  | `--max-concurrent N`          | 50             | Max concurrent requests                                                 |
108  | `--runs N`                    | 3              | Number of benchmark runs                                                |
109  | `--fake-delay-ms N`           | 50             | Simulated provider latency in ms                                        |
110  | `--min-rps N`                 | —              | Fail (exit 1) if average throughput falls below N req/s                 |
111  | `--max-p50-ms N`              | —              | Fail (exit 1) if average P50 latency exceeds N ms (CI threshold)        |
112  | `--max-p99-ms N`              | —              | Fail (exit 1) if average P99 latency exceeds N ms (CI threshold)        |
113  | `--auth`                      | off            | Start MLflow with `--app-name=basic-auth`; send Basic auth on requests  |
114  | `--auth-username USER`        | `admin`        | Basic-auth username (matches `mlflow/server/auth/basic_auth.ini`)       |
115  | `--auth-password PASS`        | `password1234` | Basic-auth password (matches `mlflow/server/auth/basic_auth.ini`)       |
116  
117  All flags can also be set via environment variables (same name, uppercased):
118  `INSTANCES`, `WORKERS_PER_INSTANCE`, `REQUESTS`, `MAX_CONCURRENT`, `RUNS`,
119  `FAKE_RESPONSE_DELAY_MS`, `MLFLOW_PORT`, `BASE_PORT`, `FAKE_SERVER_PORT`,
120  `AUTH`, `AUTH_USERNAME`, `AUTH_PASSWORD`.
121  
122  To avoid conflicts with a local PostgreSQL instance, override the port via `GATEWAY_BENCH_POSTGRES_PORT` (default: 5432).
123  
124  ## Known limitations
125  
126  - **Loopback only** — all processes run on the same machine. Results don't include real
127    network latency between client, gateway, and provider.
128  - **No TLS** — MLflow is started with `--disable-security-middleware`. Production deployments
129    add TLS termination overhead.
130  - **Fixed provider latency** — `fake_server.py` always responds in exactly `--fake-delay-ms`.
131    Real providers have high variance (P99 often 5–10× P50).
132  - **Basic-auth is opt-in, no RBAC** — `--auth` enables `basic-auth` with the default
133    admin user (full permissions), which measures the cost of HTTP Basic authentication
134    and user lookup but not fine-grained RBAC checks against non-admin users.
135  - **Single machine resource contention** — with multiple instances, all MLflow instances, nginx,
136    PostgreSQL, and the benchmark client share CPU/memory. On a server with dedicated resources
137    per instance, throughput will be higher.
138  
139  ## Files
140  
141  | File             | Purpose                                                                        |
142  | ---------------- | ------------------------------------------------------------------------------ |
143  | `run.py`         | Main entry point — orchestrates servers, Docker, endpoint setup, and benchmark |
144  | `benchmark.py`   | Async HTTP benchmark client (standalone or imported by `run.py`)               |
145  | `fake_server.py` | Fake OpenAI-compatible server for controlled latency simulation                |