/ docs / examples / agentic.rst
agentic.rst
  1  Layer 3: Agentic Investigation
  2  ===============================
  3  
  4  PyOD 3's ``od-expert`` skill lets any AI agent drive a full anomaly detection investigation through natural conversation. The agent handles benchmark-backed detector selection, multi-detector consensus, quality assessment, adaptive escalation, and iteration on user feedback, all without requiring the user to be an OD expert.
  5  
  6  .. figure:: ../figs/agentic-demo.png
  7     :alt: PyOD 3 agentic investigation demo on a diabetes screening dataset
  8     :align: center
  9     :width: 720
 10  
 11     4-turn agentic conversation on a diabetes screening dataset
 12     (768 patients, 8 clinical features; shipped with PyOD as
 13     ``examples/data/pima.csv``). The dark callouts alongside the agent's
 14     turns show the ``od-expert`` skill's live decision-making: modality
 15     triage, top-10 pitfall checks, and the 11 adaptive escalation triggers.
 16  
 17  * **Runnable script**: `agentic_example.py <https://github.com/yzhao062/pyod/blob/development/examples/agentic_example.py>`_
 18  * **Interactive demo**: `agentic_demo.html <https://htmlpreview.github.io/?https://github.com/yzhao062/pyod/blob/development/examples/agentic_demo.html>`_ (open in a browser for the full visual walkthrough with skill decision panels)
 19  
 20  ----
 21  
 22  What the ``od-expert`` skill encodes
 23  ------------------------------------
 24  
 25  In v3.2.0, the skill grew from a 78-line API reference to roughly 1000 lines of expert content split across an always-loaded ``SKILL.md`` and six on-demand reference files. The content encodes:
 26  
 27  * **A master decision tree** that routes the user's data to one of five modalities (tabular, time series, graph, text, image) based on observable properties.
 28  * **Top-10 critical pitfalls**, always in the agent's working memory: unscaled features for distance-based detectors, contamination mismatches, deep learning on tiny data, missing optional extras like ``pyod[graph]``, raw-score reporting, single-detector runs, and so on. The agent walks each one before calling ``engine.run``.
 29  * **11 adaptive escalation triggers** that decide when the agent proceeds autonomously vs. when it pauses to ask the user. Examples: modality ambiguity (T1), contamination uncertainty (T2), detector disagreement post-run (T3), high-stakes domain hints (T8 -- medical, fraud, safety), labels mentioned but not provided (T5). If none of the triggers fire, the agent runs end-to-end without interruption.
 30  * **On-demand reference files** for each modality (``references/tabular.md``, ``references/time_series.md``, ``references/graph.md``, ``references/text_image.md``, plus ``references/workflow.md`` and ``references/pitfalls.md``). The agent loads them when the modality decision routes there.
 31  * **A KB-derived detector list** for each modality, refreshed from ``pyod.utils.knowledge`` by ``scripts/regen_skill.py`` at build time. Every detector name in the skill is mechanically validated against the live KB by a CI safety net test (``pyod/test/test_skill_kb_consistency.py``) so drift fails the build.
 32  * **An API safety net** (``pyod/test/test_skill_api_refs.py``) that walks ``ADEngine`` and ``InvestigationState`` via a live dry run and validates every ``state.X`` / ``state.X['key']`` / ``engine.X`` reference in the skill content. Added in v3.2.1 after a regression that shipped invented API names in v3.2.0.
 33  
 34  See :doc:`the skill maintenance methodology guide <../skill_maintenance>` for the full pattern and for how to add a new skill.
 35  
 36  ----
 37  
 38  How It Works
 39  ------------
 40  
 41  When a user asks about anomalies in their data, PyOD's ``od-expert`` skill auto-activates based on intent keywords. The agent then:
 42  
 43  1. **Walks the master decision tree** -- timestamps, graph structure, text/image, or tabular? Load the matching ``references/<modality>.md``.
 44  2. **Walks the top-10 pitfall checklist** -- is any pitfall active for this data? Example: feature scale ratio > 100 triggers Pitfall 1 (unscaled features for distance-based detectors) and the agent recommends a pre-scaling step or flags it in the report.
 45  3. **Walks the 11 escalation triggers** -- does anything about the request call for a pause? Example: "medical screening" fires Trigger 8 (high-stakes domain) and the agent commits to dual-detector validation and a confidence caveat.
 46  4. **Selects detectors** -- calls ``engine.plan(state)`` to pick the top-3 from PyOD's 61-detector catalog based on benchmark evidence (ADBench, TSB-AD, BOND). Each plan entry in ``state.plans`` has ``detector_name``, ``confidence``, ``reason``, ``evidence``.
 47  5. **Runs in parallel** -- executes all selected detectors and builds a rank-normalized consensus in ``state.consensus``.
 48  6. **Re-walks a subset of triggers post-run** -- detector disagreement (T3), weak quality (T4), suspiciously clean results (T10). If any fire, the agent hedges the report or iterates.
 49  7. **Generates a report** -- Markdown or JSON, always including a "what I assumed and why" block that lists the contamination rate, the detectors used, the best detector, and any caveats the trigger/pitfall walk surfaced.
 50  
 51  The agent's decisions at each of these steps are visible in the interactive demo's dark "od-expert" panels.
 52  
 53  ----
 54  
 55  Activation Paths
 56  ----------------
 57  
 58  PyOD 3 reaches agents through three paths. Pick whichever matches your stack:
 59  
 60  **Claude Code / Claude Desktop / Codex**
 61      The ``od-expert`` skill ships as package data inside the pyod wheel.
 62      Two install modes are supported:
 63  
 64      .. code-block:: bash
 65  
 66          pip install pyod
 67  
 68          # Claude Code / Claude Desktop: user-global install
 69          pyod install skill              # installs to ~/.claude/skills/od-expert/
 70  
 71          # Codex: project-local install (Codex reads ./skills/, has no user-global dir)
 72          pyod install skill --project    # installs to ./skills/od-expert/
 73  
 74      After installing, run ``pyod info`` to confirm the skill is detected
 75      and to see which agent stacks are present (``pyod info`` checks both
 76      ``~/.claude/`` and ``~/.codex/`` and recommends the right install
 77      command). The legacy ``pyod-install-skill`` command from v3.0.0 still
 78      works as an alias for ``pyod install skill``.
 79  
 80  **MCP-compatible agents**
 81      Install the optional MCP extra, then launch the server:
 82  
 83      .. code-block:: bash
 84  
 85          pip install pyod[mcp]
 86          pyod mcp serve              # alias for ``python -m pyod.mcp_server``
 87  
 88      Any MCP-compatible LLM picks up the tools based on intent.
 89  
 90  **Python apps / custom agents**
 91      Import and call directly: ``from pyod.utils.ad_engine import ADEngine``. The agent calls ``engine.investigate(data)`` and follows ``state.next_action`` at each step.
 92  
 93  ----
 94  
 95  Conversation Walkthrough
 96  ------------------------
 97  
 98  **Dataset:** Diabetes screening dataset, 768 patients with 8 clinical features (glucose, blood pressure, skin thickness, insulin, BMI, pedigree, age, pregnancies). Shipped with PyOD at ``examples/data/pima.csv``.
 99  
100  Why this dataset? It exercises the skill's machinery: the feature scale ratio is ~347x (Pitfall 1 fires), and the user's "medical screening" context fires Trigger 8 (high-stakes domain). On a happy-path dataset the side callouts would be mostly empty; on this data, they catch real problems.
101  
102  **Turn 1.** User asks for screening:
103  
104  .. code-block:: text
105  
106      User: I have a diabetes screening dataset
107            -- 768 patients with 8 clinical features (glucose,
108            blood pressure, BMI, insulin, age, etc.). Find
109            unusual patient profiles that might need follow-up.
110            This is a medical screening context.
111  
112      [od-expert skill: pre-run decisions]
113        Master tree     -> tabular (no timestamps, no graph, no text/image)
114        Pitfall #1      -> FIRES: std ratio 347x, note as report caveat
115        Pitfalls #2-10  -> all pass
116        Trigger T8      -> FIRES: medical screening; dual-detector validation
117        Other triggers  -> not applicable
118        Plan            -> KNN (0.85), IForest (0.85), LOF (0.80)
119  
120      Agent: Profiled as tabular, low-dim, n=768 x 8. Running
121             KNN + IForest + LOF from the ADBench top-5 for
122             low-dim small datasets. Scale mismatch noted for
123             the final report.
124  
125             Results: 62 flagged (8.1%), separation 0.96,
126             agreement 0.59, quality HIGH (0.79). Top case:
127             patient #13. KNN strongest individually.
128  
129  Behind the scenes:
130  
131  .. code-block:: python
132  
133      import numpy as np
134      from pyod.utils.ad_engine import ADEngine
135  
136      np.random.seed(42)
137      data = np.genfromtxt('examples/data/pima.csv',
138                           delimiter=',', skip_header=1)
139      X = data[:, :-1]
140  
141      engine = ADEngine()
142      state = engine.investigate(X)
143      # state.plans[0]['detector_name']          == 'KNN'
144      # state.analysis['best_detector']          == 'KNN'
145      # state.analysis['consensus_analysis']['n_anomalies']    == 62
146      # state.analysis['consensus_analysis']['anomaly_ratio']  == 0.081
147      # state.quality['separation']              == 0.956
148      # state.quality['agreement']               == 0.592
149      # state.quality['stability']               == 0.814
150      # state.quality['overall']                 == 0.787
151      # state.quality['verdict']                 == 'high'
152  
153  **Turn 2.** User asks what is driving the flags:
154  
155  .. code-block:: text
156  
157      User: What clinical features are driving the flags?
158  
159      Agent: Analyzing patient #13 via KNN (best detector).
160             engine.explain_findings returns absolute z-score
161             magnitudes (|z|); raw values are cross-checked
162             manually to show direction:
163               insulin       |z|=6.65  (846 uU/mL)
164               age           |z|=2.19  (59 yrs)
165               glucose       |z|=2.13  (189 mg/dL)
166               pregnancies   |z|=0.84  (1)
167             Metabolic-syndrome profile: extreme insulin,
168             elevated glucose, older age.
169  
170  .. code-block:: python
171  
172      best_idx = state.analysis['best_detector_index']
173      best_result = state.results[best_idx]
174      explanations = engine.explain_findings(best_result, X=X, top_k=5)
175      # explanations[0]['contributing_features'][0] ==
176      #   {'feature': 4, 'z_score': 6.65}  # insulin (absolute magnitude)
177  
178  **Turn 3.** User reveals labels exist. Trigger 5 fires:
179  
180  .. code-block:: text
181  
182      User: I should mention -- I actually have diabetes
183            diagnosis labels for the full cohort.
184  
185      [od-expert skill: new turn, re-evaluating triggers]
186        Trigger T5 -> FIRES: labels mentioned but the session
187                      is on the unsupervised path
188        Recommended switch -> drop out of ADEngine session and
189                              call XGBOD directly (pyod[xgboost])
190  
191      Agent: That changes the plan. The ADEngine session API is
192             unsupervised in v3.2.x. For labelled data, supervised
193             detection (XGBOD) typically lifts AUC 10-20 points.
194             I would switch to the classic fit / predict path.
195  
196  .. code-block:: python
197  
198      # ADEngine session is unsupervised in v3.2.x; for labels,
199      # use the classic fit / predict path directly:
200      from pyod.models.xgbod import XGBOD
201      clf = XGBOD()
202      clf.fit(X, y_labels)               # supervised training
203      scores = clf.decision_function(X)  # anomaly scores
204      labels_pred = clf.predict(X)       # binary labels
205  
206  **Turn 4.** User asks for the unsupervised report for now:
207  
208  .. code-block:: python
209  
210      report = engine.report(state, format='text')
211      # Includes the quality bars, the selected detectors, the
212      # best detector, and an explicit "assumptions and caveats"
213      # block citing the scale mismatch (Pitfall 1), the
214      # observed anomaly ratio, the high-stakes caveat (Trigger
215      # 8), and the label-availability note (Trigger 5).
216  
217      # Or JSON for programmatic consumption:
218      report_dict = engine.report(state, format='json')
219  
220  ----
221  
222  The Session API
223  ---------------
224  
225  The agentic workflow is built on :class:`~pyod.utils.ad_engine.ADEngine`'s session API. Each method advances the investigation and sets ``state.next_action`` to guide the agent:
226  
227  ========================== ========================================================
228  Method                     Purpose
229  ========================== ========================================================
230  ``start(X)``               Profile data, return ``InvestigationState``
231  ``plan(state)``            Select top-N detectors, populate ``state.plans``
232  ``run(state)``             Execute all detectors, fill ``state.consensus``
233  ``analyze(state)``         Populate ``state.quality`` and ``state.analysis``
234  ``iterate(state, fb)``     Adjust plan based on structured or NL feedback
235  ``report(state)``          Generate markdown or JSON report
236  ``investigate(X)``         One-shot: ``start`` + ``plan`` + ``run`` + ``analyze``
237  ========================== ========================================================
238  
239  Key ``state`` fields the agent reads:
240  
241  * ``state.profile`` -- dict with ``data_type``, ``n_samples``, ``n_features``, ``has_nan``, ``dtype``, ``dimensionality_class``.
242  * ``state.plans`` -- list of plan dicts with ``detector_name``, ``confidence``, ``reason``, ``evidence``, ``alternatives``.
243  * ``state.consensus`` -- dict with ``scores``, ``labels``, ``n_detectors``, ``agreement``.
244  * ``state.quality`` -- dict with ``separation``, ``agreement``, ``stability``, ``overall``, ``verdict``, ``explanation``.
245  * ``state.analysis`` -- dict with ``consensus_analysis`` (containing ``n_anomalies``, ``anomaly_ratio``, ``top_anomalies``, etc.), ``best_detector``, ``best_detector_index``, ``per_detector_analysis``.
246  * ``state.next_action`` -- dict with ``action`` in {``report_to_user``, ``iterate``, ``confirm_with_user``}, plus ``reason``, ``summary``, and sometimes ``suggestion`` / ``proposed_change``.
247  
248  Feedback to ``iterate(state, fb)`` can be structured (dict) or natural language (string):
249  
250  .. code-block:: python
251  
252      # Structured (executes immediately)
253      engine.iterate(state, {"action": "adjust_contamination", "value": 0.05})
254      engine.iterate(state, {"action": "exclude", "detectors": ["IForest"]})
255      engine.iterate(state, {"action": "include", "detectors": ["ECOD"]})
256      engine.iterate(state, {"action": "rerun"})
257  
258      # Natural language (parsed to action, may need confirmation)
259      engine.iterate(state, "too many false positives")
260      engine.iterate(state, "try without IForest")
261  
262  ----
263  
264  Why This Is Different
265  ---------------------
266  
267  Without PyOD 3, an AI agent wrapping a library like scikit-learn would:
268  
269  1. Pick one detector (probably the wrong one)
270  2. Run it once with default parameters
271  3. Return raw scores without quality assessment
272  4. Rely on the LLM to interpret results from first principles
273  
274  With PyOD 3 and the v3.2.0 ``od-expert`` skill:
275  
276  1. Walks a master decision tree to pick the right modality and detector family.
277  2. Walks a top-10 pitfall checklist before running, catching problems like scale mismatch or DL on tiny data.
278  3. Walks 11 adaptive escalation triggers to decide when to pause and ask vs. run autonomously.
279  4. Selects detectors via benchmark-backed routing (ADBench, TSB-AD, BOND).
280  5. Runs top-3 in parallel and builds rank-normalized consensus.
281  6. Re-checks quality-related triggers post-run and hedges the report accordingly.
282  7. Always reports the assumptions and caveats, including the scale mismatch, contamination, and any triggered escalations.
283  
284  The agent becomes an OD expert through the library, not despite it.