agentic.rst
1 Layer 3: Agentic Investigation 2 =============================== 3 4 PyOD 3's ``od-expert`` skill lets any AI agent drive a full anomaly detection investigation through natural conversation. The agent handles benchmark-backed detector selection, multi-detector consensus, quality assessment, adaptive escalation, and iteration on user feedback, all without requiring the user to be an OD expert. 5 6 .. figure:: ../figs/agentic-demo.png 7 :alt: PyOD 3 agentic investigation demo on a diabetes screening dataset 8 :align: center 9 :width: 720 10 11 4-turn agentic conversation on a diabetes screening dataset 12 (768 patients, 8 clinical features; shipped with PyOD as 13 ``examples/data/pima.csv``). The dark callouts alongside the agent's 14 turns show the ``od-expert`` skill's live decision-making: modality 15 triage, top-10 pitfall checks, and the 11 adaptive escalation triggers. 16 17 * **Runnable script**: `agentic_example.py <https://github.com/yzhao062/pyod/blob/development/examples/agentic_example.py>`_ 18 * **Interactive demo**: `agentic_demo.html <https://htmlpreview.github.io/?https://github.com/yzhao062/pyod/blob/development/examples/agentic_demo.html>`_ (open in a browser for the full visual walkthrough with skill decision panels) 19 20 ---- 21 22 What the ``od-expert`` skill encodes 23 ------------------------------------ 24 25 In v3.2.0, the skill grew from a 78-line API reference to roughly 1000 lines of expert content split across an always-loaded ``SKILL.md`` and six on-demand reference files. The content encodes: 26 27 * **A master decision tree** that routes the user's data to one of five modalities (tabular, time series, graph, text, image) based on observable properties. 28 * **Top-10 critical pitfalls**, always in the agent's working memory: unscaled features for distance-based detectors, contamination mismatches, deep learning on tiny data, missing optional extras like ``pyod[graph]``, raw-score reporting, single-detector runs, and so on. The agent walks each one before calling ``engine.run``. 29 * **11 adaptive escalation triggers** that decide when the agent proceeds autonomously vs. when it pauses to ask the user. Examples: modality ambiguity (T1), contamination uncertainty (T2), detector disagreement post-run (T3), high-stakes domain hints (T8 -- medical, fraud, safety), labels mentioned but not provided (T5). If none of the triggers fire, the agent runs end-to-end without interruption. 30 * **On-demand reference files** for each modality (``references/tabular.md``, ``references/time_series.md``, ``references/graph.md``, ``references/text_image.md``, plus ``references/workflow.md`` and ``references/pitfalls.md``). The agent loads them when the modality decision routes there. 31 * **A KB-derived detector list** for each modality, refreshed from ``pyod.utils.knowledge`` by ``scripts/regen_skill.py`` at build time. Every detector name in the skill is mechanically validated against the live KB by a CI safety net test (``pyod/test/test_skill_kb_consistency.py``) so drift fails the build. 32 * **An API safety net** (``pyod/test/test_skill_api_refs.py``) that walks ``ADEngine`` and ``InvestigationState`` via a live dry run and validates every ``state.X`` / ``state.X['key']`` / ``engine.X`` reference in the skill content. Added in v3.2.1 after a regression that shipped invented API names in v3.2.0. 33 34 See :doc:`the skill maintenance methodology guide <../skill_maintenance>` for the full pattern and for how to add a new skill. 35 36 ---- 37 38 How It Works 39 ------------ 40 41 When a user asks about anomalies in their data, PyOD's ``od-expert`` skill auto-activates based on intent keywords. The agent then: 42 43 1. **Walks the master decision tree** -- timestamps, graph structure, text/image, or tabular? Load the matching ``references/<modality>.md``. 44 2. **Walks the top-10 pitfall checklist** -- is any pitfall active for this data? Example: feature scale ratio > 100 triggers Pitfall 1 (unscaled features for distance-based detectors) and the agent recommends a pre-scaling step or flags it in the report. 45 3. **Walks the 11 escalation triggers** -- does anything about the request call for a pause? Example: "medical screening" fires Trigger 8 (high-stakes domain) and the agent commits to dual-detector validation and a confidence caveat. 46 4. **Selects detectors** -- calls ``engine.plan(state)`` to pick the top-3 from PyOD's 61-detector catalog based on benchmark evidence (ADBench, TSB-AD, BOND). Each plan entry in ``state.plans`` has ``detector_name``, ``confidence``, ``reason``, ``evidence``. 47 5. **Runs in parallel** -- executes all selected detectors and builds a rank-normalized consensus in ``state.consensus``. 48 6. **Re-walks a subset of triggers post-run** -- detector disagreement (T3), weak quality (T4), suspiciously clean results (T10). If any fire, the agent hedges the report or iterates. 49 7. **Generates a report** -- Markdown or JSON, always including a "what I assumed and why" block that lists the contamination rate, the detectors used, the best detector, and any caveats the trigger/pitfall walk surfaced. 50 51 The agent's decisions at each of these steps are visible in the interactive demo's dark "od-expert" panels. 52 53 ---- 54 55 Activation Paths 56 ---------------- 57 58 PyOD 3 reaches agents through three paths. Pick whichever matches your stack: 59 60 **Claude Code / Claude Desktop / Codex** 61 The ``od-expert`` skill ships as package data inside the pyod wheel. 62 Two install modes are supported: 63 64 .. code-block:: bash 65 66 pip install pyod 67 68 # Claude Code / Claude Desktop: user-global install 69 pyod install skill # installs to ~/.claude/skills/od-expert/ 70 71 # Codex: project-local install (Codex reads ./skills/, has no user-global dir) 72 pyod install skill --project # installs to ./skills/od-expert/ 73 74 After installing, run ``pyod info`` to confirm the skill is detected 75 and to see which agent stacks are present (``pyod info`` checks both 76 ``~/.claude/`` and ``~/.codex/`` and recommends the right install 77 command). The legacy ``pyod-install-skill`` command from v3.0.0 still 78 works as an alias for ``pyod install skill``. 79 80 **MCP-compatible agents** 81 Install the optional MCP extra, then launch the server: 82 83 .. code-block:: bash 84 85 pip install pyod[mcp] 86 pyod mcp serve # alias for ``python -m pyod.mcp_server`` 87 88 Any MCP-compatible LLM picks up the tools based on intent. 89 90 **Python apps / custom agents** 91 Import and call directly: ``from pyod.utils.ad_engine import ADEngine``. The agent calls ``engine.investigate(data)`` and follows ``state.next_action`` at each step. 92 93 ---- 94 95 Conversation Walkthrough 96 ------------------------ 97 98 **Dataset:** Diabetes screening dataset, 768 patients with 8 clinical features (glucose, blood pressure, skin thickness, insulin, BMI, pedigree, age, pregnancies). Shipped with PyOD at ``examples/data/pima.csv``. 99 100 Why this dataset? It exercises the skill's machinery: the feature scale ratio is ~347x (Pitfall 1 fires), and the user's "medical screening" context fires Trigger 8 (high-stakes domain). On a happy-path dataset the side callouts would be mostly empty; on this data, they catch real problems. 101 102 **Turn 1.** User asks for screening: 103 104 .. code-block:: text 105 106 User: I have a diabetes screening dataset 107 -- 768 patients with 8 clinical features (glucose, 108 blood pressure, BMI, insulin, age, etc.). Find 109 unusual patient profiles that might need follow-up. 110 This is a medical screening context. 111 112 [od-expert skill: pre-run decisions] 113 Master tree -> tabular (no timestamps, no graph, no text/image) 114 Pitfall #1 -> FIRES: std ratio 347x, note as report caveat 115 Pitfalls #2-10 -> all pass 116 Trigger T8 -> FIRES: medical screening; dual-detector validation 117 Other triggers -> not applicable 118 Plan -> KNN (0.85), IForest (0.85), LOF (0.80) 119 120 Agent: Profiled as tabular, low-dim, n=768 x 8. Running 121 KNN + IForest + LOF from the ADBench top-5 for 122 low-dim small datasets. Scale mismatch noted for 123 the final report. 124 125 Results: 62 flagged (8.1%), separation 0.96, 126 agreement 0.59, quality HIGH (0.79). Top case: 127 patient #13. KNN strongest individually. 128 129 Behind the scenes: 130 131 .. code-block:: python 132 133 import numpy as np 134 from pyod.utils.ad_engine import ADEngine 135 136 np.random.seed(42) 137 data = np.genfromtxt('examples/data/pima.csv', 138 delimiter=',', skip_header=1) 139 X = data[:, :-1] 140 141 engine = ADEngine() 142 state = engine.investigate(X) 143 # state.plans[0]['detector_name'] == 'KNN' 144 # state.analysis['best_detector'] == 'KNN' 145 # state.analysis['consensus_analysis']['n_anomalies'] == 62 146 # state.analysis['consensus_analysis']['anomaly_ratio'] == 0.081 147 # state.quality['separation'] == 0.956 148 # state.quality['agreement'] == 0.592 149 # state.quality['stability'] == 0.814 150 # state.quality['overall'] == 0.787 151 # state.quality['verdict'] == 'high' 152 153 **Turn 2.** User asks what is driving the flags: 154 155 .. code-block:: text 156 157 User: What clinical features are driving the flags? 158 159 Agent: Analyzing patient #13 via KNN (best detector). 160 engine.explain_findings returns absolute z-score 161 magnitudes (|z|); raw values are cross-checked 162 manually to show direction: 163 insulin |z|=6.65 (846 uU/mL) 164 age |z|=2.19 (59 yrs) 165 glucose |z|=2.13 (189 mg/dL) 166 pregnancies |z|=0.84 (1) 167 Metabolic-syndrome profile: extreme insulin, 168 elevated glucose, older age. 169 170 .. code-block:: python 171 172 best_idx = state.analysis['best_detector_index'] 173 best_result = state.results[best_idx] 174 explanations = engine.explain_findings(best_result, X=X, top_k=5) 175 # explanations[0]['contributing_features'][0] == 176 # {'feature': 4, 'z_score': 6.65} # insulin (absolute magnitude) 177 178 **Turn 3.** User reveals labels exist. Trigger 5 fires: 179 180 .. code-block:: text 181 182 User: I should mention -- I actually have diabetes 183 diagnosis labels for the full cohort. 184 185 [od-expert skill: new turn, re-evaluating triggers] 186 Trigger T5 -> FIRES: labels mentioned but the session 187 is on the unsupervised path 188 Recommended switch -> drop out of ADEngine session and 189 call XGBOD directly (pyod[xgboost]) 190 191 Agent: That changes the plan. The ADEngine session API is 192 unsupervised in v3.2.x. For labelled data, supervised 193 detection (XGBOD) typically lifts AUC 10-20 points. 194 I would switch to the classic fit / predict path. 195 196 .. code-block:: python 197 198 # ADEngine session is unsupervised in v3.2.x; for labels, 199 # use the classic fit / predict path directly: 200 from pyod.models.xgbod import XGBOD 201 clf = XGBOD() 202 clf.fit(X, y_labels) # supervised training 203 scores = clf.decision_function(X) # anomaly scores 204 labels_pred = clf.predict(X) # binary labels 205 206 **Turn 4.** User asks for the unsupervised report for now: 207 208 .. code-block:: python 209 210 report = engine.report(state, format='text') 211 # Includes the quality bars, the selected detectors, the 212 # best detector, and an explicit "assumptions and caveats" 213 # block citing the scale mismatch (Pitfall 1), the 214 # observed anomaly ratio, the high-stakes caveat (Trigger 215 # 8), and the label-availability note (Trigger 5). 216 217 # Or JSON for programmatic consumption: 218 report_dict = engine.report(state, format='json') 219 220 ---- 221 222 The Session API 223 --------------- 224 225 The agentic workflow is built on :class:`~pyod.utils.ad_engine.ADEngine`'s session API. Each method advances the investigation and sets ``state.next_action`` to guide the agent: 226 227 ========================== ======================================================== 228 Method Purpose 229 ========================== ======================================================== 230 ``start(X)`` Profile data, return ``InvestigationState`` 231 ``plan(state)`` Select top-N detectors, populate ``state.plans`` 232 ``run(state)`` Execute all detectors, fill ``state.consensus`` 233 ``analyze(state)`` Populate ``state.quality`` and ``state.analysis`` 234 ``iterate(state, fb)`` Adjust plan based on structured or NL feedback 235 ``report(state)`` Generate markdown or JSON report 236 ``investigate(X)`` One-shot: ``start`` + ``plan`` + ``run`` + ``analyze`` 237 ========================== ======================================================== 238 239 Key ``state`` fields the agent reads: 240 241 * ``state.profile`` -- dict with ``data_type``, ``n_samples``, ``n_features``, ``has_nan``, ``dtype``, ``dimensionality_class``. 242 * ``state.plans`` -- list of plan dicts with ``detector_name``, ``confidence``, ``reason``, ``evidence``, ``alternatives``. 243 * ``state.consensus`` -- dict with ``scores``, ``labels``, ``n_detectors``, ``agreement``. 244 * ``state.quality`` -- dict with ``separation``, ``agreement``, ``stability``, ``overall``, ``verdict``, ``explanation``. 245 * ``state.analysis`` -- dict with ``consensus_analysis`` (containing ``n_anomalies``, ``anomaly_ratio``, ``top_anomalies``, etc.), ``best_detector``, ``best_detector_index``, ``per_detector_analysis``. 246 * ``state.next_action`` -- dict with ``action`` in {``report_to_user``, ``iterate``, ``confirm_with_user``}, plus ``reason``, ``summary``, and sometimes ``suggestion`` / ``proposed_change``. 247 248 Feedback to ``iterate(state, fb)`` can be structured (dict) or natural language (string): 249 250 .. code-block:: python 251 252 # Structured (executes immediately) 253 engine.iterate(state, {"action": "adjust_contamination", "value": 0.05}) 254 engine.iterate(state, {"action": "exclude", "detectors": ["IForest"]}) 255 engine.iterate(state, {"action": "include", "detectors": ["ECOD"]}) 256 engine.iterate(state, {"action": "rerun"}) 257 258 # Natural language (parsed to action, may need confirmation) 259 engine.iterate(state, "too many false positives") 260 engine.iterate(state, "try without IForest") 261 262 ---- 263 264 Why This Is Different 265 --------------------- 266 267 Without PyOD 3, an AI agent wrapping a library like scikit-learn would: 268 269 1. Pick one detector (probably the wrong one) 270 2. Run it once with default parameters 271 3. Return raw scores without quality assessment 272 4. Rely on the LLM to interpret results from first principles 273 274 With PyOD 3 and the v3.2.0 ``od-expert`` skill: 275 276 1. Walks a master decision tree to pick the right modality and detector family. 277 2. Walks a top-10 pitfall checklist before running, catching problems like scale mismatch or DL on tiny data. 278 3. Walks 11 adaptive escalation triggers to decide when to pause and ask vs. run autonomously. 279 4. Selects detectors via benchmark-backed routing (ADBench, TSB-AD, BOND). 280 5. Runs top-3 in parallel and builds rank-normalized consensus. 281 6. Re-checks quality-related triggers post-run and hedges the report accordingly. 282 7. Always reports the assumptions and caveats, including the scale mismatch, contamination, and any triggered escalations. 283 284 The agent becomes an OD expert through the library, not despite it.