/ docs / superpowers / specs / 2026-04-12-v3-agentic-design.md
2026-04-12-v3-agentic-design.md
  1  # PyOD V3: Agentic Anomaly Detection Design
  2  
  3  **Date:** 2026-04-12
  4  **Status:** Draft (v6 -- Round 5 review fixes)
  5  
  6  ---
  7  
  8  ## 1. Vision
  9  
 10  PyOD V3 thesis: **any AI agent can do expert-level anomaly detection through PyOD, without knowing OD.**
 11  
 12  The goal is not multi-modal coverage (already shipped). The goal is making the anomaly detection workflow **agentic** — ADEngine guides any agent (LLM or application) through the expert OD workflow step by step, so the agent never has to know which algorithm to pick, how to interpret scores, or when to iterate.
 13  
 14  **What "agentic" means concretely:**
 15  - Understand the data modality (tabular, TS, graph, text, image)
 16  - Choose the right algorithm(s) based on benchmark evidence
 17  - Run detection, compare multiple detectors, assess result quality
 18  - Discuss results with users, explain findings
 19  - Iterate based on feedback (too many false positives, missed anomalies, try something different)
 20  - Generate reports
 21  
 22  **What it does NOT mean:**
 23  - No changes to Layer 1 (BaseDetector models, direct `fit`/`predict` API)
 24  - No mandatory dependency on LLMs or agents
 25  - No persistent state across Python sessions (stateless library)
 26  
 27  ---
 28  
 29  ## 2. Architecture: Three Layers
 30  
 31  ```
 32  Layer 3:  Skill (od-expert)               ← agent conversation layer
 33  Layer 2:  ADEngine (workflow + intelligence) ← agentic orchestration
 34  Layer 1:  BaseDetector models (fit/predict)  ← direct Python API (unchanged)
 35  ```
 36  
 37  Each layer is independently useful:
 38  - **Layer 1** users: `from pyod.models.iforest import IForest; clf.fit(X)` — no changes, no new requirements
 39  - **Layer 2** users: `engine = ADEngine(); result = engine.investigate(data)` — intelligent orchestration
 40  - **Layer 3** users: agent follows od-expert skill, which calls ADEngine — full conversational workflow
 41  
 42  Intelligence lives in **Layer 2** (portable across agents). Conversation flow lives in **Layer 3** (agent-specific). Layer 1 is untouched.
 43  
 44  ---
 45  
 46  ## 3. Current State (ADEngine Tier A + B)
 47  
 48  ADEngine today has lifecycle methods, but they are **disconnected building blocks**. An agent must know which to call, in what order, and how to interpret results:
 49  
 50  ```python
 51  profile = engine.profile_data(data)           # agent decides what to do
 52  plan = engine.plan_detection(profile)          # returns 1 detector
 53  result = engine.run_detection(data, plan)      # agent must run this
 54  analysis = engine.analyze_results(result, X=data)  # agent decides if good
 55  # ... agent parses analysis, decides whether to iterate, how, etc.
 56  ```
 57  
 58  **Problems:**
 59  1. Routes to 1 detector — experts run 2-3 and compare
 60  2. No quality assessment — agent cannot tell if results are trustworthy
 61  3. No guided iteration — `suggest_next_step` returns text, not an executable action
 62  4. No workflow enforcement — agent can call methods in any order, skip steps
 63  5. No session context — each call is stateless, agent must carry all context
 64  
 65  ---
 66  
 67  ## 4. V3 Design: Workflow Engine
 68  
 69  ### 4.1 Session-Based State Machine
 70  
 71  ADEngine manages an **investigation session** that tracks the workflow state:
 72  
 73  ```
 74  START → PROFILED → PLANNED → DETECTED → ANALYZED → [ITERATE → PLANNED | DONE]
 75  ```
 76  
 77  Each step returns the updated `InvestigationState` with a typed `next_action` that tells the agent what to do next. The agent (or skill) follows `next_action` without needing OD knowledge.
 78  
 79  **Decision (Codex finding #1):** The session execution method is named `run(state)`, not `detect(state)`. The existing `detect(X_train, ...)` one-shot method is unchanged. No naming conflict.
 80  
 81  ### 4.2 New API: Investigation Session
 82  
 83  ```python
 84  class ADEngine:
 85      # --- Existing methods (unchanged, still work independently) ---
 86      def profile_data(self, X, data_type=None): ...
 87      def plan_detection(self, profile, ...): ...
 88      def run_detection(self, X_train, plan, ...): ...
 89      def analyze_results(self, result, ...): ...
 90      # ... all existing methods preserved ...
 91  
 92      # --- V3: Session-based workflow ---
 93  
 94      def start(self, X, data_type=None):
 95          """Start an investigation session.
 96  
 97          Profiles the data and returns an InvestigationState with
 98          the profile and recommended next action.
 99  
100          Parameters
101          ----------
102          X : array-like, Data, list, or dict
103              Input data (any modality).
104          data_type : str or None
105              Explicit type override.
106  
107          Returns
108          -------
109          state : InvestigationState
110          """
111  
112      def plan(self, state, priority='balanced', constraints=None):
113          """Plan detection for the investigation.
114  
115          Selects top-N candidate detectors based on profile and
116          benchmark evidence. Updates state with detection plan.
117  
118          Parameters
119          ----------
120          state : InvestigationState
121          priority : str
122              'speed', 'accuracy', or 'balanced'.
123          constraints : dict or None
124              {'exclude_detectors': [...], 'max_detectors': int}
125  
126          Returns
127          -------
128          state : InvestigationState
129          """
130  
131      def run(self, state):
132          """Run detection with all planned detectors.
133  
134          Calls ``run_detection()`` for each plan in ``state.plans``.
135          Collects per-detector results, computes consensus scores
136          via rank normalization, and measures detector agreement.
137          If a detector errors, it is recorded in ``state.results``
138          with ``status='error'`` and excluded from consensus.
139  
140          Parameters
141          ----------
142          state : InvestigationState
143              Must be in phase ``'planned'``.
144  
145          Returns
146          -------
147          state : InvestigationState
148              Phase set to ``'detected'``.
149          """
150  
151      def analyze(self, state):
152          """Analyze detection results.
153  
154          Assesses result quality (score distribution, detector
155          agreement, contamination stability), identifies top
156          anomalies, computes feature importance, and generates
157          a human-readable summary. Updates state with analysis.
158  
159          Parameters
160          ----------
161          state : InvestigationState
162  
163          Returns
164          -------
165          state : InvestigationState
166          """
167  
168      def iterate(self, state, feedback):
169          """Iterate based on user/agent feedback.
170  
171          **Structured feedback (primary, executes immediately):**
172          - ``{"action": "adjust_contamination", "value": 0.05}``
173          - ``{"action": "exclude", "detectors": ["IForest"]}``
174          - ``{"action": "include", "detectors": ["ECOD"]}``
175          - ``{"action": "rerun"}``  (same plan, different random seed)
176  
177          **Natural-language feedback (best-effort, needs confirmation):**
178          - ``"too many false positives"``
179          - ``"try without IForest"``
180          When NL feedback is provided, the engine parses it into a
181          proposed structured action with a confidence score. If
182          confidence < 0.8, ``next_action`` is set to
183          ``'confirm_with_user'`` with the proposed change for the
184          agent to confirm before executing.
185  
186          Parameters
187          ----------
188          state : InvestigationState
189              Must be in phase ``'analyzed'``.
190          feedback : str or dict
191              Structured dict (executed immediately) or NL string
192              (parsed to proposed action, may need confirmation).
193  
194          Returns
195          -------
196          state : InvestigationState
197              Phase reset to ``'planned'`` if action is clear,
198              or kept at ``'analyzed'`` with ``next_action =
199              'confirm_with_user'`` if ambiguous.
200          """
201  
202      def report(self, state, format='text'):
203          """Generate a report for the investigation.
204  
205          Parameters
206          ----------
207          state : InvestigationState
208          format : str
209              'text' or 'json'.
210  
211          Returns
212          -------
213          report : str or dict
214          """
215  
216      def investigate(self, X, data_type=None, priority='balanced'):
217          """One-shot expert investigation (convenience).
218  
219          Runs the full workflow: start → plan → run → analyze.
220          Returns an InvestigationState ready for user review,
221          iteration, or reporting.
222  
223          Parameters
224          ----------
225          X : array-like
226              Input data.
227          data_type : str or None
228          priority : str
229  
230          Returns
231          -------
232          state : InvestigationState
233          """
234  ```
235  
236  ### 4.3 InvestigationState
237  
238  A typed dataclass with closed enums and defined schemas.
239  
240  **Phase enum (closed):**
241  
242  ```python
243  PHASES = ('profiled', 'planned', 'detected', 'analyzed')
244  ```
245  
246  **ActionType enum (closed):**
247  
248  ```python
249  ACTION_TYPES = (
250      'plan',              # engine recommends planning next
251      'run',               # engine recommends running detection
252      'analyze',           # engine recommends analyzing results
253      'report_to_user',    # results ready for user review
254      'confirm_with_user', # engine needs user confirmation before acting
255      'iterate',           # engine suggests trying a different approach
256      'done',              # investigation complete
257  )
258  ```
259  
260  **State dataclass:**
261  
262  ```python
263  @dataclass
264  class InvestigationState:
265      # --- Workflow tracking ---
266      phase: str                    # one of PHASES
267      iteration: int                # 0 = first run, increments on iterate()
268      history: list                 # list of HistoryEntry dicts (see schema)
269  
270      # --- Data context ---
271      data: object                  # reference to input data (not copied)
272      profile: dict                 # output of profile_data()
273  
274      # --- Detection ---
275      plans: list                   # list of DetectionPlan dicts (top-N)
276      results: list                 # list of DetectorResult dicts (see schema)
277      consensus: dict or None       # ConsensusResult dict (see schema)
278  
279      # --- Analysis ---
280      analysis: dict or None        # InvestigationAnalysis dict (see schema)
281      quality: dict or None         # QualityAssessment dict (see schema)
282  
283      # --- Workflow guidance ---
284      next_action: dict             # NextAction dict (see schema)
285  ```
286  
287  **Typed schemas:**
288  
289  ```python
290  # HistoryEntry: one per workflow step
291  HistoryEntry = {
292      'phase': str,           # phase after this step
293      'action': str,          # what was done ('start', 'plan', 'run', ...)
294      'iteration': int,       # iteration number
295      'timestamp': float,     # time.time()
296      'detail': str,          # human-readable summary
297  }
298  
299  # DetectorResult: one per detector in run()
300  # Superset of run_detection() output — raw result stored verbatim
301  # plus status/error fields for the session wrapper.
302  DetectorResult = {
303      'detector_name': str,
304      'status': str,          # 'success' | 'error' | 'skipped'
305      'error': str or None,   # error message if status='error'
306      # --- Fields from run_detection() (present when status='success') ---
307      'plan': dict,           # the DetectionPlan used
308      'scores_train': np.ndarray,  # (n_samples,)
309      'labels_train': np.ndarray,  # (n_samples,)
310      'threshold': float,
311      'n_anomalies': int,
312      'anomaly_ratio': float,
313      'detector': object,     # fitted BaseDetector
314      'runtime_seconds': float,
315      'score_summary': dict,  # mean, std, min, max, q25, q75
316  }
317  
318  # ConsensusResult: aggregated across successful detectors
319  ConsensusResult = {
320      'scores': np.ndarray,       # (n_samples,) rank-normalized mean
321      'labels': np.ndarray,       # (n_samples,) majority-vote labels
322      'n_detectors': int,         # number of successful detectors
323      'agreement': float,         # mean pairwise Spearman correlation [0,1]
324      'disagreements': list,      # sample indices where detectors disagree
325  }
326  
327  # ConsensusAnalysis: lightweight summary (NOT analyze_results() output)
328  ConsensusAnalysis = {
329      'n_anomalies': int,
330      'anomaly_ratio': float,
331      'score_distribution': dict,     # mean, std, min, max, median, q25, q75
332      'top_anomalies': list,          # top-k by consensus score
333      'summary': str,                 # generated narrative
334  }
335  
336  # InvestigationAnalysis: output of analyze()
337  InvestigationAnalysis = {
338      'consensus_analysis': ConsensusAnalysis,
339      'per_detector_analysis': list,  # positionally aligned with state.results;
340                                      # None for error/skipped entries,
341                                      # analyze_results() output for successful
342      'best_detector': str,           # name of best detector
343      'best_detector_index': int,     # index into state.results (always a
344                                      # successful entry)
345      'summary': str,                 # human-readable summary
346  }
347  # best_detector selection (deterministic fallback chain):
348  # 1. Highest finite Spearman correlation with consensus scores
349  # 2. If tied: highest plan confidence (from routing)
350  # 3. If still tied: fastest successful detector (lowest runtime)
351  # 4. If all correlations are NaN (constant scores): first successful detector
352  # Single-detector case: best_detector_index = that detector's index
353  
354  # QualityAssessment
355  QualityAssessment = {
356      'separation': float,    # score separation ratio [0, 1] (see 4.4)
357      'agreement': float,     # detector agreement [0, 1], N/A → 0.5
358      'stability': float,     # label stability [0, 1] (see 4.4)
359      'overall': float,       # mean(separation, agreement, stability)
360      'verdict': str,         # 'high' | 'medium' | 'low'
361      'explanation': str,     # human-readable quality summary
362  }
363  
364  # StructuredFeedback: typed actions for iterate()
365  # Each action has a closed set of required fields.
366  StructuredFeedback = {
367      'action': str,  # one of:
368      # 'adjust_contamination' — requires 'value': float
369      # 'exclude'             — requires 'detectors': list[str]
370      # 'include'             — requires 'detectors': list[str]
371      # 'rerun'               — no extra fields
372  }
373  
374  # NextAction: closed action type with typed payload
375  # 'action' and 'reason' are always required.
376  # Additional fields are per action type (R=required, O=optional):
377  #
378  # action='plan':              (no extra fields)
379  # action='run':               O 'adjustment': str  (present after iterate)
380  # action='analyze':           (no extra fields)
381  # action='report_to_user':    R 'summary': str
382  #                             R 'confidence': float
383  # action='confirm_with_user': O 'suggestion': str    (present for change confirmation)
384  #                             O 'proposed_change': StructuredFeedback (present for change confirmation)
385  #                             (when used for error/retry, only 'reason' is present)
386  # action='iterate':           R 'suggestion': str
387  # action='done':              (no extra fields)
388  NextAction = {
389      'action': str,          # one of ACTION_TYPES
390      'reason': str,          # always present
391  }
392  ```
393  
394  **Worked example: state after `plan()`**
395  
396  ```python
397  state.phase = 'planned'
398  state.iteration = 0
399  state.profile = {'data_type': 'tabular', 'n_samples': 1000, 'n_features': 20, ...}
400  state.plans = [
401      {'detector_name': 'IForest', 'params': {}, 'confidence': 0.85, ...},
402      {'detector_name': 'ECOD', 'params': {}, 'confidence': 0.8, ...},
403      {'detector_name': 'KNN', 'params': {}, 'confidence': 0.75, ...},
404  ]
405  state.results = []
406  state.consensus = None
407  state.next_action = {
408      'action': 'run',
409      'reason': 'Top 3 detectors selected: IForest (0.85), ECOD (0.80), KNN (0.75). Ready to run.',
410  }
411  ```
412  
413  **Worked example: state after `analyze()`**
414  
415  ```python
416  state.phase = 'analyzed'
417  state.iteration = 0
418  state.quality = {
419      'separation': 0.82,
420      'agreement': 0.91,
421      'stability': 0.88,
422      'overall': 0.87,
423      'verdict': 'high',
424      'explanation': 'Strong score separation, high detector agreement (Spearman 0.91), stable labels under contamination perturbation.',
425  }
426  state.next_action = {
427      'action': 'report_to_user',
428      'reason': 'High-quality results (0.87). 3 detectors agree on 95 of 100 anomaly labels.',
429      'summary': '100 anomalies detected (10%) with high confidence. Top anomalies at indices [42, 87, 156, ...].',
430      'confidence': 0.87,
431  }
432  ```
433  
434  ### 4.4 Key Behaviors
435  
436  **Multi-detector comparison (in `run()`):**
437  
438  `run(state)` wraps the existing `run_detection(X, plan)` method, called once per plan in `state.plans`.
439  
440  How it works:
441  1. For each plan in `state.plans`, call `self.run_detection(state.data, plan)`.
442  2. Collect results into `state.results` as `DetectorResult` dicts.
443  3. If a detector raises an exception, record `status='error'` with the error message and continue.
444  4. After all detectors finish, compute consensus from successful results.
445  
446  **Consensus preconditions:**
447  - All successful detectors must produce scores of the same length (`n_samples`). This is guaranteed because they all fit on `state.data`.
448  - Scores are rank-normalized per detector (rank / n_samples) before averaging, so different score scales are comparable.
449  - Labels are majority-voted across detectors.
450  
451  **Consensus computation:**
452  ```
453  rank_scores[i] = rankdata(scores_i) / n_samples    # per detector
454  consensus_scores = mean(rank_scores, axis=0)        # across detectors
455  consensus_labels = (vote_count > n_detectors / 2).astype(int)
456  ```
457  
458  **Fallback for single detector:** consensus = that detector's raw scores and labels, agreement = 0.5.
459  
460  **Fallback for all detectors erroring:** `state.results` is all errors, `state.consensus = None`, `next_action = 'confirm_with_user'` with explanation.
461  
462  **How `plan(state)` wraps `plan_detection()`:**
463  - Calls `self.plan_detection(state.profile, priority, constraints)` to get primary plan with up to 2 alternatives.
464  - Extracts primary + alternatives into `state.plans` (up to 3 detectors).
465  - v1 limit: `max_detectors` is capped at 3 because `plan_detection()` returns at most 1 primary + 2 alternatives. Higher values are silently capped. Future versions may extend `plan_detection()` to support larger candidate sets.
466  
467  **How `report(state)` wraps `generate_report()`:**
468  - Selects `best_idx = state.analysis['best_detector_index']`.
469  - Selects `best_result = state.results[best_idx]` (raw `run_detection()` output, fully compatible with `generate_report()`).
470  - Selects `best_analysis = state.analysis['per_detector_analysis'][best_idx]` (raw `analyze_results()` output for that detector).
471  - Calls `self.generate_report(best_result, best_analysis, format)` for the main report body. This is a direct wrapper with no contract mismatch — both inputs are exactly what the existing helpers produce.
472  - Prepends a session-level section with: consensus summary, detector agreement score, quality verdict, and disagreement highlights. This section is generated by `report()` itself, not by `generate_report()`.
473  - If format='json', constructs the best-detector section directly from `best_result` and `best_analysis` (bypasses `generate_report(format='json')` which returns a string). Returns a native Python dict:
474    ```python
475    {
476        'session': {
477            'consensus': state.consensus,
478            'quality': state.quality,
479            'comparison': {detector agreement, disagreements},
480        },
481        'best_detector': {
482            'name': best_result['detector_name'],
483            'scores': best_result['scores_train'].tolist(),
484            'labels': best_result['labels_train'].tolist(),
485            'threshold': best_result['threshold'],
486            'analysis': best_analysis,
487        },
488    }
489    ```
490  - If format='text', calls `self.generate_report(best_result, best_analysis, format='text')` for the main body (returns a string) and prepends the session-level section.
491  
492  **How `analyze()` constructs `state.analysis`:**
493  - `consensus_analysis` is a `ConsensusAnalysis` dict (see schema in 4.3), built directly by `analyze()` from the consensus scores and labels. It is NOT produced by calling `analyze_results()` (since consensus has no `plan` or `threshold`).
494  - `per_detector_analysis` is a list positionally aligned with `state.results`. For each entry:
495    - If `status='success'`: calls `self.analyze_results(result, X=state.data)` and stores the output. Fully compatible with `generate_report()`.
496    - If `status='error'` or `'skipped'`: stores `None`.
497  - This alignment means `state.analysis['per_detector_analysis'][i]` always corresponds to `state.results[i]`, regardless of error/skip entries.
498  - **All-detectors-error path:** If every detector in `state.results` has `status='error'`, then `state.analysis = None`. In this case, `state.quality` is set to all-zeros with verdict `'low'`, and `state.next_action = {'action': 'confirm_with_user', 'reason': 'All detectors failed...'}`. Calling `report(state)` when `state.analysis is None` raises `ValueError("No successful detectors to report on. Use iterate() to adjust the plan.")`.
499  
500  **Result quality assessment (in `analyze()`):**
501  
502  Three metrics, each normalized to [0, 1]. No new dependencies beyond numpy and scipy (both already required by PyOD).
503  
504  1. **Score separation** (`quality.separation`): ratio of mean anomaly score to mean inlier score, clipped to [0, 1].
505     ```
506     anomaly_mean = mean(scores[labels == 1])
507     inlier_mean  = mean(scores[labels == 0])
508     separation = clip(anomaly_mean / (inlier_mean + 1e-10) - 1, 0, 1)
509     ```
510     Values near 1.0 indicate anomalies have much higher scores (good). Near 0.0 means indistinguishable (bad).
511  
512  2. **Detector agreement** (`quality.agreement`): mean pairwise Spearman rank correlation across detectors, clipped to [0, 1].
513     ```
514     from scipy.stats import spearmanr
515     correlations = []
516     for i in range(n_detectors):
517         for j in range(i+1, n_detectors):
518             rho, _ = spearmanr(scores_i, scores_j)
519             correlations.append(max(0, rho))
520     agreement = mean(correlations)
521     ```
522     For single-detector runs: agreement = 0.5 (neutral, neither high nor low confidence).
523  
524  3. **Label stability** (`quality.stability`): Jaccard index of top-k anomaly sets when k varies by +/-20%.
525     ```
526     k = n_anomalies  # from contamination
527     k_low  = max(1, int(k * 0.8))
528     k_high = min(n_samples, int(k * 1.2))
529     top_k     = set(argsort(scores)[-k:])
530     top_k_low = set(argsort(scores)[-k_low:])
531     top_k_high = set(argsort(scores)[-k_high:])
532     stability = 0.5 * (jaccard(top_k, top_k_low) + jaccard(top_k, top_k_high))
533     ```
534     Uses the consensus scores. No re-fitting needed — just checks if the anomaly set is robust to the contamination threshold.
535  
536  4. **Overall** (`quality.overall`): `mean(separation, agreement, stability)`.
537  
538  5. **Verdict**: `'high'` if overall >= 0.7, `'medium'` if >= 0.4, `'low'` otherwise. Explanation string constructed from the three component values.
539  
540  **Edge-case fallbacks:**
541  - If all consensus labels are 0 or all are 1 (empty anomaly/inlier set): `separation = 0.0` (no separation detected).
542  - If `spearmanr()` returns NaN (constant score vector): that pair contributes 0.0 to the agreement mean. If all pairs are NaN: `agreement = 0.0`.
543  - If `k = 0` (no anomalies found): `stability = 0.0`.
544  - For single-detector runs: `agreement = 0.5` (neutral — no basis for agreement or disagreement).
545  - If `state.consensus is None` (all detectors errored): all metrics = 0.0, verdict = `'low'`, `next_action = {'action': 'confirm_with_user', 'reason': 'All detectors failed. Check data format or try different detector family.'}`.
546  - `overall = mean(separation, agreement, stability)` — no values are excluded; edge-case fallbacks above ensure all three are always defined floats.
547  
548  **Intelligent `next_action`:**
549  - After `analyze()` with high confidence: `{'action': 'report_to_user', 'summary': '...', 'confidence': 0.87}`
550  - After `analyze()` with low confidence: `{'action': 'iterate', 'reason': 'Detectors disagree (agreement=0.3); consider different algorithm family', 'suggestion': 'Exclude lowest-agreement detector and re-run'}`
551  - After `iterate()` with structured feedback: `{'action': 'run', 'reason': 'Plan adjusted: excluded IForest, added ECOD', 'adjustment': 'Excluded IForest, added ECOD'}`
552  - After `iterate()` with ambiguous NL: `{'action': 'confirm_with_user', 'reason': 'Interpreted "too many" as lower contamination', 'suggestion': 'Lower contamination from 0.1 to 0.05?', 'proposed_change': {...}}`
553  
554  **Actionable `iterate()`:**
555  
556  Two modes:
557  
558  *Structured feedback (primary):* dict with a closed set of actions:
559  - `{"action": "adjust_contamination", "value": 0.05}` → updates contamination in all plans, re-runs
560  - `{"action": "exclude", "detectors": ["IForest"]}` → removes from plans, re-plans if needed
561  - `{"action": "include", "detectors": ["ECOD"]}` → adds to plans
562  - `{"action": "rerun"}` → same plans, fresh fit
563  
564  These execute immediately and set `next_action` to `'run'`.
565  
566  *NL feedback (best-effort):* string parsed via keyword matching:
567  - "too many false positives" → proposes `adjust_contamination` with lower value
568  - "try without X" → proposes `exclude`
569  - "missed anomalies" → proposes `adjust_contamination` with higher value
570  
571  NL parsing assigns a confidence score. If confidence >= 0.8, executes immediately. If < 0.8, sets `next_action` to `'confirm_with_user'` with:
572  ```python
573  next_action = {
574      'action': 'confirm_with_user',
575      'reason': 'Interpreted "too many false positives" as lower contamination.',
576      'suggestion': 'Lower contamination from 0.1 to 0.05?',
577      'proposed_change': {'action': 'adjust_contamination', 'value': 0.05},
578  }
579  ```
580  The agent presents this to the user. On confirmation, the agent calls `iterate(state, proposed_change)` with the structured dict.
581  
582  All iterations are logged in `state.history`. The engine tracks which detector-parameter combinations have been tried and avoids repeating them.
583  
584  ---
585  
586  ## 5. Skill Integration (od-expert)
587  
588  The od-expert skill uses the session API to guide the conversation:
589  
590  ```
591  User: "Find anomalies in this sensor data"
592  Skill: state = engine.start(data)
593         state = engine.plan(state)        # next_action='run'
594         state = engine.run(state)         # next_action='analyze'
595         state = engine.analyze(state)     # next_action='report_to_user'
596  Skill: presents state.next_action['summary'] to user
597  
598  User: "Too many false positives"
599  Skill: state = engine.iterate(state, "too many false positives")
600         # NL confidence=0.6 < 0.8, so next_action='confirm_with_user'
601  Skill: "I interpreted that as: lower contamination from 0.1 to 0.05. Proceed?"
602  
603  User: "Yes"
604  Skill: state = engine.iterate(state, state.next_action['proposed_change'])
605         # Structured dict, executes immediately → next_action='run'
606         state = engine.run(state)
607         state = engine.analyze(state)
608  Skill: presents updated results
609  
610  User: "Good, give me the report"
611  Skill: report = engine.report(state)
612  ```
613  
614  The skill does not need OD knowledge. It follows `state.next_action` and translates between user and engine. When `next_action` is `'confirm_with_user'`, the skill presents the proposed change and waits for user approval before executing.
615  
616  **Skill responsibilities:**
617  - Translate user intent to `iterate()` feedback
618  - Present results in human-readable format
619  - Decide when to show code vs just results
620  - Handle data loading (file paths, formats)
621  
622  **Engine responsibilities:**
623  - All OD domain knowledge
624  - Workflow state tracking
625  - Algorithm selection, comparison, quality assessment
626  - Iteration logic (what to change based on feedback)
627  
628  ---
629  
630  ## 6. Backward Compatibility
631  
632  **No breaking changes. All additions are strictly additive.**
633  
634  - All existing ADEngine methods remain unchanged with identical signatures: `profile_data`, `plan_detection`, `run_detection`, `analyze_results`, `explain_findings`, `suggest_next_step`, `generate_report`, `detect`, `list_detectors`, `explain_detector`, `compare_detectors`, `get_benchmarks`.
635  - New session methods are additive: `start`, `plan`, `run`, `analyze`, `iterate`, `report`, `investigate`.
636  - The session execution method is named `run(state)` to avoid conflict with the existing `detect(X_train, ...)` one-shot method. Both coexist on the same class.
637  - `InvestigationState` is a new class in a new file (`pyod/utils/investigation.py`), no conflicts.
638  - Layer 1 (BaseDetector models) is untouched.
639  
640  ---
641  
642  ## 7. Scope and Non-Goals
643  
644  **In scope for V3:**
645  - `InvestigationState` typed dataclass with closed enums and schemas
646  - Session workflow methods: `start`, `plan`, `run`, `analyze`, `iterate`, `report`, `investigate`
647  - Multi-detector comparison with rank-normalized consensus scoring
648  - Result quality assessment (separation, agreement, stability — exact formulas defined)
649  - Actionable iteration: structured-first with NL best-effort + confirmation
650  - Updated od-expert skill
651  - Tests and documentation
652  
653  **Not in scope:**
654  - MCP server changes (Python-first; MCP can wrap session API later)
655  - Persistent cross-session memory (library stays stateless within Python session)
656  - New detectors (all shipped)
657  - Changes to BaseDetector or any detector classes
658  - AutoML / hyperparameter search
659  - UI or visualization
660  
661  ---
662  
663  ## 8. Implementation Feasibility
664  
665  All new code lives in `pyod/utils/ad_engine.py` (session methods, ~400-500 lines) plus a new `pyod/utils/investigation.py` for `InvestigationState` (~50-100 lines). The od-expert skill update is documentation and workflow instructions (~100 lines).
666  
667  | Component | Effort | Lines | Dependencies |
668  |-----------|--------|-------|-------------|
669  | `InvestigationState` dataclass + enums | Low | ~100 | None (new file `pyod/utils/investigation.py`) |
670  | Session methods (start, plan, run, analyze, iterate, report) | Medium | ~400 | Existing ADEngine methods |
671  | Multi-detector comparison + consensus | Medium | ~100 | Existing `run_detection`, scipy.stats.rankdata, scipy.stats.spearmanr |
672  | Result quality assessment (3 metrics) | Medium | ~80 | numpy, scipy.stats (both already required) |
673  | `investigate()` convenience | Low | ~20 | Session methods |
674  | od-expert skill update | Low | ~100 | Skill markdown |
675  | Tests | Medium | ~200 | pytest |
676  | Documentation | Low | ~50 | README, CHANGES |
677  
678  Total: ~1000 lines of new code, all additive.
679  
680  ---
681  
682  ## 9. Codex Review Resolution (Round 1)
683  
684  | # | Finding | Status | Resolution |
685  |---|---------|--------|------------|
686  | 1 | Blocker: session `detect(state)` conflicts with existing `detect(X_train, ...)` | **Resolved** | Renamed to `run(state)`. Existing `detect()` unchanged. No naming conflict. |
687  | 2 | Blocker: quality assessment under-defined (dip test not in scipy, stability vague) | **Resolved** | Replaced with 3 exact metrics: score separation (ratio), detector agreement (Spearman), label stability (Jaccard). All use numpy/scipy already required. No new dependencies. |
688  | 3 | High: `iterate()` overcommits on NL feedback | **Resolved** | Structured dict feedback is primary (executes immediately). NL feedback is best-effort with confidence score; if < 0.8, returns `'confirm_with_user'` action for agent to present to user. |
689  | 4 | Medium: `InvestigationState` schema is open-ended | **Resolved** | Added closed `PHASES` and `ACTION_TYPES` enums. Defined typed schemas for `HistoryEntry`, `DetectorResult`, `ConsensusResult`, `QualityAssessment`, `NextAction`. Added 2 worked examples (after `plan()` and after `analyze()`). |
690  | 5 | Medium: multi-detector flow underspecified against existing helpers | **Resolved** | Defined how `plan()` wraps `plan_detection()`, how `run()` wraps `run_detection()` per-plan with error handling, how `report()` wraps `generate_report()`. Defined consensus preconditions (same n_samples, rank normalization) and fallback for single/error cases. |
691  
692  ## 10. Codex Review Resolution (Round 2)
693  
694  | # | Finding | Status | Resolution |
695  |---|---------|--------|------------|
696  | 1 | Blocker: `next_action` protocol inconsistent — stale `detect`/`iterate` values outside enum | **Resolved** | Added `'iterate'` to `ACTION_TYPES` enum. Replaced all stale `detect` references with `run`. Fixed `investigate()` docstring. Updated all `next_action` examples to use only enum values. |
697  | 2 | High: `DetectorResult` mismatches `run_detection()` schema; `state.analysis` untyped | **Resolved** | `DetectorResult` is now a superset of `run_detection()` output (stores raw result verbatim). Added `InvestigationAnalysis` typed schema with `consensus_analysis`, `per_detector_analysis`, `best_detector` (selected by Spearman correlation with consensus). Updated `report()` to use typed `best_detector_index`. |
698  | 3 | High: quality metrics undefined for edge cases | **Resolved** | Added explicit fallback values: empty labels → separation=0.0, NaN Spearman → 0.0, k=0 → stability=0.0, single detector → agreement=0.5, all errors → all metrics=0.0 with `confirm_with_user`. All three metrics always produce defined floats. |
699  | 4 | Medium: `max_detectors` overstated vs `plan_detection()` | **Resolved** | Capped at 3 in v1 (matches `plan_detection()` output: 1 primary + 2 alternatives). Documented as v1 limit, silently caps higher values. |
700  
701  ## 11. Codex Review Resolution (Round 3)
702  
703  | # | Finding | Status | Resolution |
704  |---|---------|--------|------------|
705  | 1 | Medium: `best_detector` no tie-break or degenerate fallback | **Resolved** | Added deterministic fallback chain: (1) highest finite Spearman, (2) highest plan confidence, (3) fastest runtime, (4) first successful if all NaN. Single-detector case explicit. |
706  | 2 | Medium: `proposed_change` untyped, `suggestion` field scope unclear | **Resolved** | Added typed `StructuredFeedback` schema with closed action names and required fields. `NextAction` payload documented per action type: `suggestion` valid for `iterate` and `confirm_with_user`, `proposed_change` is a `StructuredFeedback`. |
707  | 3 | High (still open from R2): `report()` calls `analyze_results()` on consensus but consensus lacks `plan`/`threshold` | **Resolved** | `report()` now uses `per_detector_analysis[best_idx]` (fully compatible with `generate_report()`). `consensus_analysis` is a lightweight summary built by `analyze()` directly, not by calling `analyze_results()`. Session-level consensus info rendered in a separate section prepended to the report. |
708  
709  ## 12. Codex Review Resolution (Round 4)
710  
711  | # | Finding | Status | Resolution |
712  |---|---------|--------|------------|
713  | 1 | High: `per_detector_analysis` index misaligns with `state.results` when detectors error | **Resolved** | `per_detector_analysis` is now positionally aligned with `state.results`. Error/skipped entries get `None`. `best_detector_index` always points to a successful entry in both lists. |
714  | 2 | Medium: NextAction payload fields not marked required vs optional per action type | **Resolved** | Each action type now documents R (required) vs O (optional) fields. `confirm_with_user` has optional `suggestion`/`proposed_change` (present for change confirmation, absent for error/retry). |
715  | 3 | Medium: `consensus_analysis` schema comment still said `analyze_results() output` | **Resolved** | Defined typed `ConsensusAnalysis` schema. `InvestigationAnalysis` references it by name. Behavior section confirms it is built directly by `analyze()`, not via `analyze_results()`. |
716  
717  ## 13. Codex Review Resolution (Round 5)
718  
719  | # | Finding | Status | Resolution |
720  |---|---------|--------|------------|
721  | 1 | High: `generate_report(format='json')` returns JSON string, not dict — wrapper contract broken | **Resolved** | JSON path now bypasses `generate_report()` and constructs a native Python dict directly from `best_result` and `best_analysis`. Text path still wraps `generate_report(format='text')`. |
722  | 2 | Medium: all-detectors-error leaves `state.analysis` undefined | **Resolved** | `state.analysis = None` when all detectors error. `state.quality` set to all-zeros with verdict `'low'`. `report(state)` raises `ValueError` when `state.analysis is None`. |