/ README.md
README.md
1 # ARGUS-AI 2 3 **Production-Grade LLM Observability in 3 Lines of Code** 4 5 [](https://pypi.org/project/argus-ai/) 6 [](https://python.org) 7 [](https://opensource.org/licenses/Apache-2.0) 8 [](https://github.com/anilatambharii/argus-ai/actions) 9 10 ARGUS-AI is the **G-ARVIS scoring engine** for monitoring LLM application quality in production. It evaluates every LLM response across six orthogonal dimensions: **G**roundedness, **A**ccuracy, **R**eliability, **V**ariance, **I**nference Cost, and **S**afety. 11 12 Your LLM app is degrading right now. You just can't see it yet. 13 14 ```python 15 import argus_ai 16 17 argus = argus_ai.init() 18 result = argus.evaluate(prompt=prompt, response=response, context=context) 19 ``` 20 21 That's it. Every LLM call now has a quality score. 22 23 --- 24 25 ## Why ARGUS 26 27 LLM outputs degrade silently. Models update, prompts drift, context windows overflow, and costs creep. Traditional monitoring catches latency and errors. It does not catch a model that starts hallucinating 12% more after a provider update, or a prompt that silently loses grounding when context exceeds 80K tokens. 28 29 G-ARVIS catches it. 30 31 | Dimension | What It Measures | Why It Matters | 32 |-----------|-----------------|----------------| 33 | **Groundedness** | Is the response grounded in provided context? | Hallucination detection | 34 | **Accuracy** | Does it match ground truth / internal consistency? | Factual correctness | 35 | **Reliability** | Format consistency, completeness, latency SLA | Structural quality | 36 | **Variance** | Output determinism and confidence signals | Consistency across runs | 37 | **Inference Cost** | Token efficiency, cost-per-word, latency-to-value | Budget control | 38 | **Safety** | PII leakage, toxicity, injection, harmful content | Compliance and trust | 39 40 Plus three **agentic evaluation metrics** for autonomous workflows: 41 42 | Metric | Formula | What It Measures | 43 |--------|---------|-----------------| 44 | **ASF** (Agent Stability Factor) | completion × (1 - failure_rate) × consistency | Workflow completion reliability | 45 | **ERR** (Error Recovery Rate) | recovered / failed | Self-healing capability | 46 | **CPCS** (Cost Per Completed Step) | total_cost / completed_steps | Economic efficiency per step | 47 48 --- 49 50 ## Install 51 52 ```bash 53 pip install argus-ai 54 ``` 55 56 With provider integrations: 57 58 ```bash 59 pip install argus-ai[anthropic] # Anthropic Claude wrapper 60 pip install argus-ai[openai] # OpenAI wrapper 61 pip install argus-ai[prometheus] # Prometheus export 62 pip install argus-ai[opentelemetry] # OTEL export 63 pip install argus-ai[all] # Everything 64 ``` 65 66 --- 67 68 ## Quick Start 69 70 ### Basic Evaluation 71 72 ```python 73 import argus_ai 74 75 argus = argus_ai.init(profile="enterprise") 76 77 result = argus.evaluate( 78 prompt="What causes climate change?", 79 response="Greenhouse gas emissions from fossil fuels are the primary driver.", 80 context="Climate change is driven by human activities releasing greenhouse gases.", 81 model_name="claude-sonnet-4", 82 latency_ms=1200.0, 83 input_tokens=45, 84 output_tokens=30, 85 cost_usd=0.002, 86 ) 87 88 print(f"Composite: {result.garvis_composite:.3f}") # 0.847 89 print(f"Passing: {result.passing}") # True 90 print(f"Safety: {result.safety:.3f}") # 0.950 91 ``` 92 93 ### Agentic Workflow Evaluation 94 95 ```python 96 from argus_ai.types import AgenticEvalRequest 97 98 argus = argus_ai.init(profile="agentic") 99 100 workflow = AgenticEvalRequest( 101 prompt="Research competitors and generate report", 102 response="Report generated with 5 competitor analyses.", 103 steps_planned=8, 104 steps_completed=7, 105 steps_failed=2, 106 steps_recovered=1, 107 retries=3, 108 total_cost_usd=0.45, 109 ) 110 111 result, agentic_metrics = argus.evaluate_agentic(workflow) 112 113 for m in agentic_metrics: 114 print(f"{m.name}: {m.score:.3f}") 115 # AgentStabilityFactor: 0.612 116 # ErrorRecoveryRate: 0.600 117 # CostPerCompletedStep: 0.357 118 ``` 119 120 ### Drop-In Provider Wrappers 121 122 **Anthropic Claude:** 123 124 ```python 125 from argus_ai.integrations.anthropic import InstrumentedAnthropic 126 127 argus = argus_ai.init() 128 client = InstrumentedAnthropic(argus=argus) 129 130 response = client.messages.create( 131 model="claude-sonnet-4-20250514", 132 max_tokens=1024, 133 messages=[{"role": "user", "content": "Explain transformers"}], 134 ) 135 136 # G-ARVIS score automatically attached 137 print(response._argus_score.garvis_composite) 138 ``` 139 140 **OpenAI:** 141 142 ```python 143 from argus_ai.integrations.openai import InstrumentedOpenAI 144 145 client = InstrumentedOpenAI(argus=argus_ai.init()) 146 147 response = client.chat.completions.create( 148 model="gpt-4o", 149 messages=[{"role": "user", "content": "Explain transformers"}], 150 ) 151 152 print(response._argus_score.garvis_composite) 153 ``` 154 155 ### Threshold Monitoring with Alerting 156 157 ```python 158 from argus_ai.monitoring.thresholds import ThresholdConfig 159 from argus_ai.monitoring.alerts import AlertRule, AlertSeverity 160 161 config = ThresholdConfig( 162 composite_min=0.80, 163 safety_min=0.90, 164 window_size=100, 165 breach_ratio=0.15, 166 ) 167 168 rules = [ 169 AlertRule( 170 dimension="safety", 171 threshold=0.85, 172 severity=AlertSeverity.CRITICAL, 173 message="Safety below critical threshold", 174 ), 175 ] 176 177 argus = argus_ai.init( 178 profile="healthcare", 179 thresholds=config, 180 alert_rules=rules, 181 exporters=["console", "prometheus"], 182 on_alert=lambda msg, result: pagerduty.trigger(msg), 183 ) 184 ``` 185 186 ### Decorator Instrumentation 187 188 ```python 189 from argus_ai.sdk.decorators import argus_evaluate 190 191 argus = argus_ai.init() 192 193 @argus_evaluate(argus, model_name="claude-sonnet-4") 194 def generate(prompt: str) -> str: 195 return anthropic_client.messages.create( 196 model="claude-sonnet-4-20250514", 197 messages=[{"role": "user", "content": prompt}] 198 ).content[0].text 199 ``` 200 201 --- 202 203 ## Weight Profiles 204 205 G-ARVIS weights are configurable per deployment scenario: 206 207 | Profile | G | A | R | V | I | S | Best For | 208 |---------|---|---|---|---|---|---|----------| 209 | `enterprise` | 0.20 | 0.20 | 0.15 | 0.15 | 0.10 | 0.20 | General production | 210 | `healthcare` | 0.25 | 0.25 | 0.15 | 0.10 | 0.05 | 0.20 | HIPAA workloads | 211 | `finance` | 0.20 | 0.25 | 0.20 | 0.10 | 0.05 | 0.20 | SOX/GAAP compliance | 212 | `consumer` | 0.15 | 0.15 | 0.20 | 0.15 | 0.20 | 0.15 | Cost-sensitive apps | 213 | `agentic` | 0.15 | 0.15 | 0.25 | 0.20 | 0.10 | 0.15 | Autonomous agents | 214 215 Custom weights: 216 217 ```python 218 from argus_ai.scoring.garvis import GarvisWeights 219 220 argus = argus_ai.init(weights=GarvisWeights(safety=0.40, accuracy=0.30)) 221 ``` 222 223 --- 224 225 ## Metrics Export 226 227 ### Prometheus 228 229 ```python 230 argus = argus_ai.init(exporters=["prometheus"]) 231 ``` 232 233 Exposes: `argus_garvis_composite`, `argus_garvis_{dimension}`, `argus_evaluation_duration_ms`, `argus_alerts_total` 234 235 ### OpenTelemetry 236 237 ```python 238 argus = argus_ai.init(exporters=["opentelemetry"]) 239 ``` 240 241 Compatible with Datadog, New Relic, Honeycomb, Grafana Cloud, and any OTLP backend. 242 243 --- 244 245 ## Architecture 246 247 See [ARCHITECTURE.md](ARCHITECTURE.md) for the full system design and open core split. 248 249 ``` 250 argus-ai (Open Source) ARGUS Platform (Commercial) 251 ├── G-ARVIS Scorer ├── Autonomous Correction Loop 252 ├── 3-Line SDK ├── Prompt Optimizer 253 ├── Threshold Monitor ├── LLM-as-Judge Evaluation 254 ├── ASF/ERR/CPCS Metrics ├── Multi-Run Variance Analysis 255 ├── Prometheus/OTEL Export ├── Dashboard UI 256 └── Anthropic/OpenAI Wrappers └── SOC2/HIPAA Compliance 257 ``` 258 259 --- 260 261 ## Performance 262 263 G-ARVIS heuristic scoring runs in **sub-5ms** per evaluation with zero external dependencies at runtime. 264 265 | Benchmark | Value | 266 |-----------|-------| 267 | Single evaluation | < 3ms | 268 | Batch (1000 requests) | < 2.5s | 269 | Memory overhead | < 5MB | 270 | Dependencies (core) | 3 (pydantic, numpy, structlog) | 271 272 --- 273 274 ## Roadmap 275 276 - [ ] LiteLLM integration 277 - [ ] LangChain callback handler 278 - [ ] Grafana dashboard templates 279 - [ ] Custom scorer plugin API 280 - [ ] Async evaluation support 281 - [ ] CLI tool for offline batch scoring 282 283 --- 284 285 ## Contributing 286 287 Contributions welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. 288 289 ```bash 290 git clone https://github.com/anilatambharii/argus-ai.git 291 cd argus-ai 292 pip install -e ".[dev]" 293 pytest tests/ -v 294 ``` 295 296 --- 297 298 ## About 299 300 Built by [Anil Prasad](https://anilsprasad.com) at [Ambharii Labs](https://ambharii.com). 301 302 G-ARVIS framework published in ["Field Notes: Production AI"](https://www.linkedin.com/newsletters/field-notes-production-ai) on LinkedIn. 303 304 ARGUS: Autonomous Runtime Guardian for Unified Systems. 305 306 --- 307 308 ## License 309 310 Apache 2.0. See [LICENSE](LICENSE).