/ README.md
README.md
  1  # ARGUS-AI
  2  
  3  **Production-Grade LLM Observability in 3 Lines of Code**
  4  
  5  [![PyPI version](https://img.shields.io/pypi/v/argus-ai.svg)](https://pypi.org/project/argus-ai/)
  6  [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://python.org)
  7  [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
  8  [![CI](https://github.com/anilatambharii/argus-ai/actions/workflows/ci.yml/badge.svg)](https://github.com/anilatambharii/argus-ai/actions)
  9  
 10  ARGUS-AI is the **G-ARVIS scoring engine** for monitoring LLM application quality in production. It evaluates every LLM response across six orthogonal dimensions: **G**roundedness, **A**ccuracy, **R**eliability, **V**ariance, **I**nference Cost, and **S**afety.
 11  
 12  Your LLM app is degrading right now. You just can't see it yet.
 13  
 14  ```python
 15  import argus_ai
 16  
 17  argus = argus_ai.init()
 18  result = argus.evaluate(prompt=prompt, response=response, context=context)
 19  ```
 20  
 21  That's it. Every LLM call now has a quality score.
 22  
 23  ---
 24  
 25  ## Why ARGUS
 26  
 27  LLM outputs degrade silently. Models update, prompts drift, context windows overflow, and costs creep. Traditional monitoring catches latency and errors. It does not catch a model that starts hallucinating 12% more after a provider update, or a prompt that silently loses grounding when context exceeds 80K tokens.
 28  
 29  G-ARVIS catches it.
 30  
 31  | Dimension | What It Measures | Why It Matters |
 32  |-----------|-----------------|----------------|
 33  | **Groundedness** | Is the response grounded in provided context? | Hallucination detection |
 34  | **Accuracy** | Does it match ground truth / internal consistency? | Factual correctness |
 35  | **Reliability** | Format consistency, completeness, latency SLA | Structural quality |
 36  | **Variance** | Output determinism and confidence signals | Consistency across runs |
 37  | **Inference Cost** | Token efficiency, cost-per-word, latency-to-value | Budget control |
 38  | **Safety** | PII leakage, toxicity, injection, harmful content | Compliance and trust |
 39  
 40  Plus three **agentic evaluation metrics** for autonomous workflows:
 41  
 42  | Metric | Formula | What It Measures |
 43  |--------|---------|-----------------|
 44  | **ASF** (Agent Stability Factor) | completion × (1 - failure_rate) × consistency | Workflow completion reliability |
 45  | **ERR** (Error Recovery Rate) | recovered / failed | Self-healing capability |
 46  | **CPCS** (Cost Per Completed Step) | total_cost / completed_steps | Economic efficiency per step |
 47  
 48  ---
 49  
 50  ## Install
 51  
 52  ```bash
 53  pip install argus-ai
 54  ```
 55  
 56  With provider integrations:
 57  
 58  ```bash
 59  pip install argus-ai[anthropic]    # Anthropic Claude wrapper
 60  pip install argus-ai[openai]       # OpenAI wrapper
 61  pip install argus-ai[prometheus]   # Prometheus export
 62  pip install argus-ai[opentelemetry] # OTEL export
 63  pip install argus-ai[all]          # Everything
 64  ```
 65  
 66  ---
 67  
 68  ## Quick Start
 69  
 70  ### Basic Evaluation
 71  
 72  ```python
 73  import argus_ai
 74  
 75  argus = argus_ai.init(profile="enterprise")
 76  
 77  result = argus.evaluate(
 78      prompt="What causes climate change?",
 79      response="Greenhouse gas emissions from fossil fuels are the primary driver.",
 80      context="Climate change is driven by human activities releasing greenhouse gases.",
 81      model_name="claude-sonnet-4",
 82      latency_ms=1200.0,
 83      input_tokens=45,
 84      output_tokens=30,
 85      cost_usd=0.002,
 86  )
 87  
 88  print(f"Composite: {result.garvis_composite:.3f}")  # 0.847
 89  print(f"Passing:   {result.passing}")                # True
 90  print(f"Safety:    {result.safety:.3f}")             # 0.950
 91  ```
 92  
 93  ### Agentic Workflow Evaluation
 94  
 95  ```python
 96  from argus_ai.types import AgenticEvalRequest
 97  
 98  argus = argus_ai.init(profile="agentic")
 99  
100  workflow = AgenticEvalRequest(
101      prompt="Research competitors and generate report",
102      response="Report generated with 5 competitor analyses.",
103      steps_planned=8,
104      steps_completed=7,
105      steps_failed=2,
106      steps_recovered=1,
107      retries=3,
108      total_cost_usd=0.45,
109  )
110  
111  result, agentic_metrics = argus.evaluate_agentic(workflow)
112  
113  for m in agentic_metrics:
114      print(f"{m.name}: {m.score:.3f}")
115  # AgentStabilityFactor: 0.612
116  # ErrorRecoveryRate: 0.600
117  # CostPerCompletedStep: 0.357
118  ```
119  
120  ### Drop-In Provider Wrappers
121  
122  **Anthropic Claude:**
123  
124  ```python
125  from argus_ai.integrations.anthropic import InstrumentedAnthropic
126  
127  argus = argus_ai.init()
128  client = InstrumentedAnthropic(argus=argus)
129  
130  response = client.messages.create(
131      model="claude-sonnet-4-20250514",
132      max_tokens=1024,
133      messages=[{"role": "user", "content": "Explain transformers"}],
134  )
135  
136  # G-ARVIS score automatically attached
137  print(response._argus_score.garvis_composite)
138  ```
139  
140  **OpenAI:**
141  
142  ```python
143  from argus_ai.integrations.openai import InstrumentedOpenAI
144  
145  client = InstrumentedOpenAI(argus=argus_ai.init())
146  
147  response = client.chat.completions.create(
148      model="gpt-4o",
149      messages=[{"role": "user", "content": "Explain transformers"}],
150  )
151  
152  print(response._argus_score.garvis_composite)
153  ```
154  
155  ### Threshold Monitoring with Alerting
156  
157  ```python
158  from argus_ai.monitoring.thresholds import ThresholdConfig
159  from argus_ai.monitoring.alerts import AlertRule, AlertSeverity
160  
161  config = ThresholdConfig(
162      composite_min=0.80,
163      safety_min=0.90,
164      window_size=100,
165      breach_ratio=0.15,
166  )
167  
168  rules = [
169      AlertRule(
170          dimension="safety",
171          threshold=0.85,
172          severity=AlertSeverity.CRITICAL,
173          message="Safety below critical threshold",
174      ),
175  ]
176  
177  argus = argus_ai.init(
178      profile="healthcare",
179      thresholds=config,
180      alert_rules=rules,
181      exporters=["console", "prometheus"],
182      on_alert=lambda msg, result: pagerduty.trigger(msg),
183  )
184  ```
185  
186  ### Decorator Instrumentation
187  
188  ```python
189  from argus_ai.sdk.decorators import argus_evaluate
190  
191  argus = argus_ai.init()
192  
193  @argus_evaluate(argus, model_name="claude-sonnet-4")
194  def generate(prompt: str) -> str:
195      return anthropic_client.messages.create(
196          model="claude-sonnet-4-20250514",
197          messages=[{"role": "user", "content": prompt}]
198      ).content[0].text
199  ```
200  
201  ---
202  
203  ## Weight Profiles
204  
205  G-ARVIS weights are configurable per deployment scenario:
206  
207  | Profile | G | A | R | V | I | S | Best For |
208  |---------|---|---|---|---|---|---|----------|
209  | `enterprise` | 0.20 | 0.20 | 0.15 | 0.15 | 0.10 | 0.20 | General production |
210  | `healthcare` | 0.25 | 0.25 | 0.15 | 0.10 | 0.05 | 0.20 | HIPAA workloads |
211  | `finance` | 0.20 | 0.25 | 0.20 | 0.10 | 0.05 | 0.20 | SOX/GAAP compliance |
212  | `consumer` | 0.15 | 0.15 | 0.20 | 0.15 | 0.20 | 0.15 | Cost-sensitive apps |
213  | `agentic` | 0.15 | 0.15 | 0.25 | 0.20 | 0.10 | 0.15 | Autonomous agents |
214  
215  Custom weights:
216  
217  ```python
218  from argus_ai.scoring.garvis import GarvisWeights
219  
220  argus = argus_ai.init(weights=GarvisWeights(safety=0.40, accuracy=0.30))
221  ```
222  
223  ---
224  
225  ## Metrics Export
226  
227  ### Prometheus
228  
229  ```python
230  argus = argus_ai.init(exporters=["prometheus"])
231  ```
232  
233  Exposes: `argus_garvis_composite`, `argus_garvis_{dimension}`, `argus_evaluation_duration_ms`, `argus_alerts_total`
234  
235  ### OpenTelemetry
236  
237  ```python
238  argus = argus_ai.init(exporters=["opentelemetry"])
239  ```
240  
241  Compatible with Datadog, New Relic, Honeycomb, Grafana Cloud, and any OTLP backend.
242  
243  ---
244  
245  ## Architecture
246  
247  See [ARCHITECTURE.md](ARCHITECTURE.md) for the full system design and open core split.
248  
249  ```
250  argus-ai (Open Source)          ARGUS Platform (Commercial)
251  ├── G-ARVIS Scorer              ├── Autonomous Correction Loop
252  ├── 3-Line SDK                  ├── Prompt Optimizer
253  ├── Threshold Monitor           ├── LLM-as-Judge Evaluation
254  ├── ASF/ERR/CPCS Metrics        ├── Multi-Run Variance Analysis
255  ├── Prometheus/OTEL Export      ├── Dashboard UI
256  └── Anthropic/OpenAI Wrappers   └── SOC2/HIPAA Compliance
257  ```
258  
259  ---
260  
261  ## Performance
262  
263  G-ARVIS heuristic scoring runs in **sub-5ms** per evaluation with zero external dependencies at runtime.
264  
265  | Benchmark | Value |
266  |-----------|-------|
267  | Single evaluation | < 3ms |
268  | Batch (1000 requests) | < 2.5s |
269  | Memory overhead | < 5MB |
270  | Dependencies (core) | 3 (pydantic, numpy, structlog) |
271  
272  ---
273  
274  ## Roadmap
275  
276  - [ ] LiteLLM integration
277  - [ ] LangChain callback handler
278  - [ ] Grafana dashboard templates
279  - [ ] Custom scorer plugin API
280  - [ ] Async evaluation support
281  - [ ] CLI tool for offline batch scoring
282  
283  ---
284  
285  ## Contributing
286  
287  Contributions welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
288  
289  ```bash
290  git clone https://github.com/anilatambharii/argus-ai.git
291  cd argus-ai
292  pip install -e ".[dev]"
293  pytest tests/ -v
294  ```
295  
296  ---
297  
298  ## About
299  
300  Built by [Anil Prasad](https://anilsprasad.com) at [Ambharii Labs](https://ambharii.com).
301  
302  G-ARVIS framework published in ["Field Notes: Production AI"](https://www.linkedin.com/newsletters/field-notes-production-ai) on LinkedIn.
303  
304  ARGUS: Autonomous Runtime Guardian for Unified Systems.
305  
306  ---
307  
308  ## License
309  
310  Apache 2.0. See [LICENSE](LICENSE).