README.md
1 # MLflow AI Gateway Benchmark 2 3 Measures the **proxy overhead** of the MLflow tracking-server-backed AI Gateway under 4 concurrent load. A fake OpenAI server simulates the upstream provider at a fixed latency, 5 so results reflect pure MLflow processing time rather than provider variance. 6 7 ## Prerequisites 8 9 - Python 3.10+ with [`uv`](https://docs.astral.sh/uv/) — all scripts must be run via `uv run`, which handles dependency installation automatically via inline script metadata 10 - Docker (required for `--database postgres` and `multi` mode) 11 12 ## Quick start 13 14 ```bash 15 cd dev/benchmarks/gateway 16 17 # 4 instances behind nginx (default, requires Docker) 18 uv run run.py 19 20 # Single instance, SQLite (no Docker needed) 21 uv run run.py --instances 1 22 23 # Single instance, PostgreSQL 24 uv run run.py --instances 1 --database postgres 25 26 # Scale up 27 uv run run.py --instances 8 --workers 8 28 29 # Benchmark an existing endpoint directly (skips all setup) 30 uv run run.py --url http://your-server/gateway/my-endpoint/mlflow/invocations 31 32 # Basic-auth enabled (starts MLflow with --app-name=basic-auth, 33 # sends Authorization: Basic on every request) 34 uv run run.py --instances 1 --auth 35 36 ``` 37 38 ## What is measured 39 40 Latency is measured **client-side** using `time.perf_counter()` around each `aiohttp` request. 41 Each sample covers the full round-trip: client serialization → loopback → full server processing → response deserialization. Only HTTP 200 responses count toward latency stats; errors are tracked separately. 42 43 Connection pooling and HTTP keep-alive are enabled, so TCP handshake cost is amortized after the warmup phase. 44 45 ### What is NOT measured 46 47 | Factor | In this benchmark | In production | 48 | ------------------ | ---------------------------------------------- | --------------------------- | 49 | Network latency | ~0 ms (loopback) | 1–100 ms per hop | 50 | TLS/SSL | None (plain HTTP) | ~5–20 ms per new connection | 51 | Provider inference | Fixed fake delay (`--fake-delay-ms`) | Variable (50 ms – 60 s+) | 52 | Authentication | Off by default; basic-auth opt-in via `--auth` | Token validation, RBAC | 53 54 ## What MLflow does per request 55 56 Each invocation through the tracking-server gateway runs these steps: 57 58 ``` 59 1. Config resolution (DB-backed, cached after first hit) 60 2. Secret decryption (cached, 60 s TTL) 61 3. Provider instantiation 62 4. Tracing (if usage_tracking=True) 63 5. HTTP call to LLM API 64 ``` 65 66 Steps 1 (config resolution) and 4 (tracing) have historically been the dominant bottlenecks. 67 Config caching (enabled by default) eliminates most of step 1's cost. Tracing overhead 68 depends on the span processor in use. 69 70 ## Architecture 71 72 ### Single instance (`--instances 1`) 73 74 ``` 75 benchmark.py ──aiohttp──▶ MLflow server (:5731) ──▶ fake_server.py (:9137) 76 │ 77 SQLite or PostgreSQL 78 ``` 79 80 ### Multi-instance (`--instances N`, default) 81 82 ``` 83 benchmark.py ──aiohttp──▶ nginx LB (:5731) ──round-robin──▶ MLflow :5800 84 MLflow :5801 85 MLflow :580N 86 │ 87 fake_server.py (:9137) 88 PostgreSQL (Docker) 89 ``` 90 91 MLflow instances are started **sequentially** (instance 0 first) to let it initialize the 92 DB schema before the others join. All instances share one PostgreSQL database. 93 94 ## Options 95 96 | Flag | Default | Description | 97 | ----------------------------- | -------------- | ----------------------------------------------------------------------- | 98 | `--url URL` | — | Benchmark this URL directly, skip all setup | 99 | `--instances N` | 4 | MLflow instances. Use 1 for single-instance (no nginx, optional SQLite) | 100 | `--workers N` | 4 | MLflow worker processes per instance | 101 | `--database sqlite\|postgres` | `sqlite` | Database to use — only applies when `--instances 1` | 102 | `--no-usage-tracking` | — | Disable usage tracking (tracing) on the endpoint | 103 | `--port N` | 5731 | Port to benchmark (MLflow port for single, nginx LB port for multi) | 104 | `--base-port N` | 5800 | First MLflow instance port in multi mode (rest are +1, +2, …) | 105 | `--fake-server-port N` | 9137 | Fake OpenAI server port | 106 | `--requests N` | 2000 | Requests per run | 107 | `--max-concurrent N` | 50 | Max concurrent requests | 108 | `--runs N` | 3 | Number of benchmark runs | 109 | `--fake-delay-ms N` | 50 | Simulated provider latency in ms | 110 | `--min-rps N` | — | Fail (exit 1) if average throughput falls below N req/s | 111 | `--max-p50-ms N` | — | Fail (exit 1) if average P50 latency exceeds N ms (CI threshold) | 112 | `--max-p99-ms N` | — | Fail (exit 1) if average P99 latency exceeds N ms (CI threshold) | 113 | `--auth` | off | Start MLflow with `--app-name=basic-auth`; send Basic auth on requests | 114 | `--auth-username USER` | `admin` | Basic-auth username (matches `mlflow/server/auth/basic_auth.ini`) | 115 | `--auth-password PASS` | `password1234` | Basic-auth password (matches `mlflow/server/auth/basic_auth.ini`) | 116 117 All flags can also be set via environment variables (same name, uppercased): 118 `INSTANCES`, `WORKERS_PER_INSTANCE`, `REQUESTS`, `MAX_CONCURRENT`, `RUNS`, 119 `FAKE_RESPONSE_DELAY_MS`, `MLFLOW_PORT`, `BASE_PORT`, `FAKE_SERVER_PORT`, 120 `AUTH`, `AUTH_USERNAME`, `AUTH_PASSWORD`. 121 122 To avoid conflicts with a local PostgreSQL instance, override the port via `GATEWAY_BENCH_POSTGRES_PORT` (default: 5432). 123 124 ## Known limitations 125 126 - **Loopback only** — all processes run on the same machine. Results don't include real 127 network latency between client, gateway, and provider. 128 - **No TLS** — MLflow is started with `--disable-security-middleware`. Production deployments 129 add TLS termination overhead. 130 - **Fixed provider latency** — `fake_server.py` always responds in exactly `--fake-delay-ms`. 131 Real providers have high variance (P99 often 5–10× P50). 132 - **Basic-auth is opt-in, no RBAC** — `--auth` enables `basic-auth` with the default 133 admin user (full permissions), which measures the cost of HTTP Basic authentication 134 and user lookup but not fine-grained RBAC checks against non-admin users. 135 - **Single machine resource contention** — with multiple instances, all MLflow instances, nginx, 136 PostgreSQL, and the benchmark client share CPU/memory. On a server with dedicated resources 137 per instance, throughput will be higher. 138 139 ## Files 140 141 | File | Purpose | 142 | ---------------- | ------------------------------------------------------------------------------ | 143 | `run.py` | Main entry point — orchestrates servers, Docker, endpoint setup, and benchmark | 144 | `benchmark.py` | Async HTTP benchmark client (standalone or imported by `run.py`) | 145 | `fake_server.py` | Fake OpenAI-compatible server for controlled latency simulation |