2026-04-12-v3-agentic-design.md
1 # PyOD V3: Agentic Anomaly Detection Design 2 3 **Date:** 2026-04-12 4 **Status:** Draft (v6 -- Round 5 review fixes) 5 6 --- 7 8 ## 1. Vision 9 10 PyOD V3 thesis: **any AI agent can do expert-level anomaly detection through PyOD, without knowing OD.** 11 12 The goal is not multi-modal coverage (already shipped). The goal is making the anomaly detection workflow **agentic** — ADEngine guides any agent (LLM or application) through the expert OD workflow step by step, so the agent never has to know which algorithm to pick, how to interpret scores, or when to iterate. 13 14 **What "agentic" means concretely:** 15 - Understand the data modality (tabular, TS, graph, text, image) 16 - Choose the right algorithm(s) based on benchmark evidence 17 - Run detection, compare multiple detectors, assess result quality 18 - Discuss results with users, explain findings 19 - Iterate based on feedback (too many false positives, missed anomalies, try something different) 20 - Generate reports 21 22 **What it does NOT mean:** 23 - No changes to Layer 1 (BaseDetector models, direct `fit`/`predict` API) 24 - No mandatory dependency on LLMs or agents 25 - No persistent state across Python sessions (stateless library) 26 27 --- 28 29 ## 2. Architecture: Three Layers 30 31 ``` 32 Layer 3: Skill (od-expert) ← agent conversation layer 33 Layer 2: ADEngine (workflow + intelligence) ← agentic orchestration 34 Layer 1: BaseDetector models (fit/predict) ← direct Python API (unchanged) 35 ``` 36 37 Each layer is independently useful: 38 - **Layer 1** users: `from pyod.models.iforest import IForest; clf.fit(X)` — no changes, no new requirements 39 - **Layer 2** users: `engine = ADEngine(); result = engine.investigate(data)` — intelligent orchestration 40 - **Layer 3** users: agent follows od-expert skill, which calls ADEngine — full conversational workflow 41 42 Intelligence lives in **Layer 2** (portable across agents). Conversation flow lives in **Layer 3** (agent-specific). Layer 1 is untouched. 43 44 --- 45 46 ## 3. Current State (ADEngine Tier A + B) 47 48 ADEngine today has lifecycle methods, but they are **disconnected building blocks**. An agent must know which to call, in what order, and how to interpret results: 49 50 ```python 51 profile = engine.profile_data(data) # agent decides what to do 52 plan = engine.plan_detection(profile) # returns 1 detector 53 result = engine.run_detection(data, plan) # agent must run this 54 analysis = engine.analyze_results(result, X=data) # agent decides if good 55 # ... agent parses analysis, decides whether to iterate, how, etc. 56 ``` 57 58 **Problems:** 59 1. Routes to 1 detector — experts run 2-3 and compare 60 2. No quality assessment — agent cannot tell if results are trustworthy 61 3. No guided iteration — `suggest_next_step` returns text, not an executable action 62 4. No workflow enforcement — agent can call methods in any order, skip steps 63 5. No session context — each call is stateless, agent must carry all context 64 65 --- 66 67 ## 4. V3 Design: Workflow Engine 68 69 ### 4.1 Session-Based State Machine 70 71 ADEngine manages an **investigation session** that tracks the workflow state: 72 73 ``` 74 START → PROFILED → PLANNED → DETECTED → ANALYZED → [ITERATE → PLANNED | DONE] 75 ``` 76 77 Each step returns the updated `InvestigationState` with a typed `next_action` that tells the agent what to do next. The agent (or skill) follows `next_action` without needing OD knowledge. 78 79 **Decision (Codex finding #1):** The session execution method is named `run(state)`, not `detect(state)`. The existing `detect(X_train, ...)` one-shot method is unchanged. No naming conflict. 80 81 ### 4.2 New API: Investigation Session 82 83 ```python 84 class ADEngine: 85 # --- Existing methods (unchanged, still work independently) --- 86 def profile_data(self, X, data_type=None): ... 87 def plan_detection(self, profile, ...): ... 88 def run_detection(self, X_train, plan, ...): ... 89 def analyze_results(self, result, ...): ... 90 # ... all existing methods preserved ... 91 92 # --- V3: Session-based workflow --- 93 94 def start(self, X, data_type=None): 95 """Start an investigation session. 96 97 Profiles the data and returns an InvestigationState with 98 the profile and recommended next action. 99 100 Parameters 101 ---------- 102 X : array-like, Data, list, or dict 103 Input data (any modality). 104 data_type : str or None 105 Explicit type override. 106 107 Returns 108 ------- 109 state : InvestigationState 110 """ 111 112 def plan(self, state, priority='balanced', constraints=None): 113 """Plan detection for the investigation. 114 115 Selects top-N candidate detectors based on profile and 116 benchmark evidence. Updates state with detection plan. 117 118 Parameters 119 ---------- 120 state : InvestigationState 121 priority : str 122 'speed', 'accuracy', or 'balanced'. 123 constraints : dict or None 124 {'exclude_detectors': [...], 'max_detectors': int} 125 126 Returns 127 ------- 128 state : InvestigationState 129 """ 130 131 def run(self, state): 132 """Run detection with all planned detectors. 133 134 Calls ``run_detection()`` for each plan in ``state.plans``. 135 Collects per-detector results, computes consensus scores 136 via rank normalization, and measures detector agreement. 137 If a detector errors, it is recorded in ``state.results`` 138 with ``status='error'`` and excluded from consensus. 139 140 Parameters 141 ---------- 142 state : InvestigationState 143 Must be in phase ``'planned'``. 144 145 Returns 146 ------- 147 state : InvestigationState 148 Phase set to ``'detected'``. 149 """ 150 151 def analyze(self, state): 152 """Analyze detection results. 153 154 Assesses result quality (score distribution, detector 155 agreement, contamination stability), identifies top 156 anomalies, computes feature importance, and generates 157 a human-readable summary. Updates state with analysis. 158 159 Parameters 160 ---------- 161 state : InvestigationState 162 163 Returns 164 ------- 165 state : InvestigationState 166 """ 167 168 def iterate(self, state, feedback): 169 """Iterate based on user/agent feedback. 170 171 **Structured feedback (primary, executes immediately):** 172 - ``{"action": "adjust_contamination", "value": 0.05}`` 173 - ``{"action": "exclude", "detectors": ["IForest"]}`` 174 - ``{"action": "include", "detectors": ["ECOD"]}`` 175 - ``{"action": "rerun"}`` (same plan, different random seed) 176 177 **Natural-language feedback (best-effort, needs confirmation):** 178 - ``"too many false positives"`` 179 - ``"try without IForest"`` 180 When NL feedback is provided, the engine parses it into a 181 proposed structured action with a confidence score. If 182 confidence < 0.8, ``next_action`` is set to 183 ``'confirm_with_user'`` with the proposed change for the 184 agent to confirm before executing. 185 186 Parameters 187 ---------- 188 state : InvestigationState 189 Must be in phase ``'analyzed'``. 190 feedback : str or dict 191 Structured dict (executed immediately) or NL string 192 (parsed to proposed action, may need confirmation). 193 194 Returns 195 ------- 196 state : InvestigationState 197 Phase reset to ``'planned'`` if action is clear, 198 or kept at ``'analyzed'`` with ``next_action = 199 'confirm_with_user'`` if ambiguous. 200 """ 201 202 def report(self, state, format='text'): 203 """Generate a report for the investigation. 204 205 Parameters 206 ---------- 207 state : InvestigationState 208 format : str 209 'text' or 'json'. 210 211 Returns 212 ------- 213 report : str or dict 214 """ 215 216 def investigate(self, X, data_type=None, priority='balanced'): 217 """One-shot expert investigation (convenience). 218 219 Runs the full workflow: start → plan → run → analyze. 220 Returns an InvestigationState ready for user review, 221 iteration, or reporting. 222 223 Parameters 224 ---------- 225 X : array-like 226 Input data. 227 data_type : str or None 228 priority : str 229 230 Returns 231 ------- 232 state : InvestigationState 233 """ 234 ``` 235 236 ### 4.3 InvestigationState 237 238 A typed dataclass with closed enums and defined schemas. 239 240 **Phase enum (closed):** 241 242 ```python 243 PHASES = ('profiled', 'planned', 'detected', 'analyzed') 244 ``` 245 246 **ActionType enum (closed):** 247 248 ```python 249 ACTION_TYPES = ( 250 'plan', # engine recommends planning next 251 'run', # engine recommends running detection 252 'analyze', # engine recommends analyzing results 253 'report_to_user', # results ready for user review 254 'confirm_with_user', # engine needs user confirmation before acting 255 'iterate', # engine suggests trying a different approach 256 'done', # investigation complete 257 ) 258 ``` 259 260 **State dataclass:** 261 262 ```python 263 @dataclass 264 class InvestigationState: 265 # --- Workflow tracking --- 266 phase: str # one of PHASES 267 iteration: int # 0 = first run, increments on iterate() 268 history: list # list of HistoryEntry dicts (see schema) 269 270 # --- Data context --- 271 data: object # reference to input data (not copied) 272 profile: dict # output of profile_data() 273 274 # --- Detection --- 275 plans: list # list of DetectionPlan dicts (top-N) 276 results: list # list of DetectorResult dicts (see schema) 277 consensus: dict or None # ConsensusResult dict (see schema) 278 279 # --- Analysis --- 280 analysis: dict or None # InvestigationAnalysis dict (see schema) 281 quality: dict or None # QualityAssessment dict (see schema) 282 283 # --- Workflow guidance --- 284 next_action: dict # NextAction dict (see schema) 285 ``` 286 287 **Typed schemas:** 288 289 ```python 290 # HistoryEntry: one per workflow step 291 HistoryEntry = { 292 'phase': str, # phase after this step 293 'action': str, # what was done ('start', 'plan', 'run', ...) 294 'iteration': int, # iteration number 295 'timestamp': float, # time.time() 296 'detail': str, # human-readable summary 297 } 298 299 # DetectorResult: one per detector in run() 300 # Superset of run_detection() output — raw result stored verbatim 301 # plus status/error fields for the session wrapper. 302 DetectorResult = { 303 'detector_name': str, 304 'status': str, # 'success' | 'error' | 'skipped' 305 'error': str or None, # error message if status='error' 306 # --- Fields from run_detection() (present when status='success') --- 307 'plan': dict, # the DetectionPlan used 308 'scores_train': np.ndarray, # (n_samples,) 309 'labels_train': np.ndarray, # (n_samples,) 310 'threshold': float, 311 'n_anomalies': int, 312 'anomaly_ratio': float, 313 'detector': object, # fitted BaseDetector 314 'runtime_seconds': float, 315 'score_summary': dict, # mean, std, min, max, q25, q75 316 } 317 318 # ConsensusResult: aggregated across successful detectors 319 ConsensusResult = { 320 'scores': np.ndarray, # (n_samples,) rank-normalized mean 321 'labels': np.ndarray, # (n_samples,) majority-vote labels 322 'n_detectors': int, # number of successful detectors 323 'agreement': float, # mean pairwise Spearman correlation [0,1] 324 'disagreements': list, # sample indices where detectors disagree 325 } 326 327 # ConsensusAnalysis: lightweight summary (NOT analyze_results() output) 328 ConsensusAnalysis = { 329 'n_anomalies': int, 330 'anomaly_ratio': float, 331 'score_distribution': dict, # mean, std, min, max, median, q25, q75 332 'top_anomalies': list, # top-k by consensus score 333 'summary': str, # generated narrative 334 } 335 336 # InvestigationAnalysis: output of analyze() 337 InvestigationAnalysis = { 338 'consensus_analysis': ConsensusAnalysis, 339 'per_detector_analysis': list, # positionally aligned with state.results; 340 # None for error/skipped entries, 341 # analyze_results() output for successful 342 'best_detector': str, # name of best detector 343 'best_detector_index': int, # index into state.results (always a 344 # successful entry) 345 'summary': str, # human-readable summary 346 } 347 # best_detector selection (deterministic fallback chain): 348 # 1. Highest finite Spearman correlation with consensus scores 349 # 2. If tied: highest plan confidence (from routing) 350 # 3. If still tied: fastest successful detector (lowest runtime) 351 # 4. If all correlations are NaN (constant scores): first successful detector 352 # Single-detector case: best_detector_index = that detector's index 353 354 # QualityAssessment 355 QualityAssessment = { 356 'separation': float, # score separation ratio [0, 1] (see 4.4) 357 'agreement': float, # detector agreement [0, 1], N/A → 0.5 358 'stability': float, # label stability [0, 1] (see 4.4) 359 'overall': float, # mean(separation, agreement, stability) 360 'verdict': str, # 'high' | 'medium' | 'low' 361 'explanation': str, # human-readable quality summary 362 } 363 364 # StructuredFeedback: typed actions for iterate() 365 # Each action has a closed set of required fields. 366 StructuredFeedback = { 367 'action': str, # one of: 368 # 'adjust_contamination' — requires 'value': float 369 # 'exclude' — requires 'detectors': list[str] 370 # 'include' — requires 'detectors': list[str] 371 # 'rerun' — no extra fields 372 } 373 374 # NextAction: closed action type with typed payload 375 # 'action' and 'reason' are always required. 376 # Additional fields are per action type (R=required, O=optional): 377 # 378 # action='plan': (no extra fields) 379 # action='run': O 'adjustment': str (present after iterate) 380 # action='analyze': (no extra fields) 381 # action='report_to_user': R 'summary': str 382 # R 'confidence': float 383 # action='confirm_with_user': O 'suggestion': str (present for change confirmation) 384 # O 'proposed_change': StructuredFeedback (present for change confirmation) 385 # (when used for error/retry, only 'reason' is present) 386 # action='iterate': R 'suggestion': str 387 # action='done': (no extra fields) 388 NextAction = { 389 'action': str, # one of ACTION_TYPES 390 'reason': str, # always present 391 } 392 ``` 393 394 **Worked example: state after `plan()`** 395 396 ```python 397 state.phase = 'planned' 398 state.iteration = 0 399 state.profile = {'data_type': 'tabular', 'n_samples': 1000, 'n_features': 20, ...} 400 state.plans = [ 401 {'detector_name': 'IForest', 'params': {}, 'confidence': 0.85, ...}, 402 {'detector_name': 'ECOD', 'params': {}, 'confidence': 0.8, ...}, 403 {'detector_name': 'KNN', 'params': {}, 'confidence': 0.75, ...}, 404 ] 405 state.results = [] 406 state.consensus = None 407 state.next_action = { 408 'action': 'run', 409 'reason': 'Top 3 detectors selected: IForest (0.85), ECOD (0.80), KNN (0.75). Ready to run.', 410 } 411 ``` 412 413 **Worked example: state after `analyze()`** 414 415 ```python 416 state.phase = 'analyzed' 417 state.iteration = 0 418 state.quality = { 419 'separation': 0.82, 420 'agreement': 0.91, 421 'stability': 0.88, 422 'overall': 0.87, 423 'verdict': 'high', 424 'explanation': 'Strong score separation, high detector agreement (Spearman 0.91), stable labels under contamination perturbation.', 425 } 426 state.next_action = { 427 'action': 'report_to_user', 428 'reason': 'High-quality results (0.87). 3 detectors agree on 95 of 100 anomaly labels.', 429 'summary': '100 anomalies detected (10%) with high confidence. Top anomalies at indices [42, 87, 156, ...].', 430 'confidence': 0.87, 431 } 432 ``` 433 434 ### 4.4 Key Behaviors 435 436 **Multi-detector comparison (in `run()`):** 437 438 `run(state)` wraps the existing `run_detection(X, plan)` method, called once per plan in `state.plans`. 439 440 How it works: 441 1. For each plan in `state.plans`, call `self.run_detection(state.data, plan)`. 442 2. Collect results into `state.results` as `DetectorResult` dicts. 443 3. If a detector raises an exception, record `status='error'` with the error message and continue. 444 4. After all detectors finish, compute consensus from successful results. 445 446 **Consensus preconditions:** 447 - All successful detectors must produce scores of the same length (`n_samples`). This is guaranteed because they all fit on `state.data`. 448 - Scores are rank-normalized per detector (rank / n_samples) before averaging, so different score scales are comparable. 449 - Labels are majority-voted across detectors. 450 451 **Consensus computation:** 452 ``` 453 rank_scores[i] = rankdata(scores_i) / n_samples # per detector 454 consensus_scores = mean(rank_scores, axis=0) # across detectors 455 consensus_labels = (vote_count > n_detectors / 2).astype(int) 456 ``` 457 458 **Fallback for single detector:** consensus = that detector's raw scores and labels, agreement = 0.5. 459 460 **Fallback for all detectors erroring:** `state.results` is all errors, `state.consensus = None`, `next_action = 'confirm_with_user'` with explanation. 461 462 **How `plan(state)` wraps `plan_detection()`:** 463 - Calls `self.plan_detection(state.profile, priority, constraints)` to get primary plan with up to 2 alternatives. 464 - Extracts primary + alternatives into `state.plans` (up to 3 detectors). 465 - v1 limit: `max_detectors` is capped at 3 because `plan_detection()` returns at most 1 primary + 2 alternatives. Higher values are silently capped. Future versions may extend `plan_detection()` to support larger candidate sets. 466 467 **How `report(state)` wraps `generate_report()`:** 468 - Selects `best_idx = state.analysis['best_detector_index']`. 469 - Selects `best_result = state.results[best_idx]` (raw `run_detection()` output, fully compatible with `generate_report()`). 470 - Selects `best_analysis = state.analysis['per_detector_analysis'][best_idx]` (raw `analyze_results()` output for that detector). 471 - Calls `self.generate_report(best_result, best_analysis, format)` for the main report body. This is a direct wrapper with no contract mismatch — both inputs are exactly what the existing helpers produce. 472 - Prepends a session-level section with: consensus summary, detector agreement score, quality verdict, and disagreement highlights. This section is generated by `report()` itself, not by `generate_report()`. 473 - If format='json', constructs the best-detector section directly from `best_result` and `best_analysis` (bypasses `generate_report(format='json')` which returns a string). Returns a native Python dict: 474 ```python 475 { 476 'session': { 477 'consensus': state.consensus, 478 'quality': state.quality, 479 'comparison': {detector agreement, disagreements}, 480 }, 481 'best_detector': { 482 'name': best_result['detector_name'], 483 'scores': best_result['scores_train'].tolist(), 484 'labels': best_result['labels_train'].tolist(), 485 'threshold': best_result['threshold'], 486 'analysis': best_analysis, 487 }, 488 } 489 ``` 490 - If format='text', calls `self.generate_report(best_result, best_analysis, format='text')` for the main body (returns a string) and prepends the session-level section. 491 492 **How `analyze()` constructs `state.analysis`:** 493 - `consensus_analysis` is a `ConsensusAnalysis` dict (see schema in 4.3), built directly by `analyze()` from the consensus scores and labels. It is NOT produced by calling `analyze_results()` (since consensus has no `plan` or `threshold`). 494 - `per_detector_analysis` is a list positionally aligned with `state.results`. For each entry: 495 - If `status='success'`: calls `self.analyze_results(result, X=state.data)` and stores the output. Fully compatible with `generate_report()`. 496 - If `status='error'` or `'skipped'`: stores `None`. 497 - This alignment means `state.analysis['per_detector_analysis'][i]` always corresponds to `state.results[i]`, regardless of error/skip entries. 498 - **All-detectors-error path:** If every detector in `state.results` has `status='error'`, then `state.analysis = None`. In this case, `state.quality` is set to all-zeros with verdict `'low'`, and `state.next_action = {'action': 'confirm_with_user', 'reason': 'All detectors failed...'}`. Calling `report(state)` when `state.analysis is None` raises `ValueError("No successful detectors to report on. Use iterate() to adjust the plan.")`. 499 500 **Result quality assessment (in `analyze()`):** 501 502 Three metrics, each normalized to [0, 1]. No new dependencies beyond numpy and scipy (both already required by PyOD). 503 504 1. **Score separation** (`quality.separation`): ratio of mean anomaly score to mean inlier score, clipped to [0, 1]. 505 ``` 506 anomaly_mean = mean(scores[labels == 1]) 507 inlier_mean = mean(scores[labels == 0]) 508 separation = clip(anomaly_mean / (inlier_mean + 1e-10) - 1, 0, 1) 509 ``` 510 Values near 1.0 indicate anomalies have much higher scores (good). Near 0.0 means indistinguishable (bad). 511 512 2. **Detector agreement** (`quality.agreement`): mean pairwise Spearman rank correlation across detectors, clipped to [0, 1]. 513 ``` 514 from scipy.stats import spearmanr 515 correlations = [] 516 for i in range(n_detectors): 517 for j in range(i+1, n_detectors): 518 rho, _ = spearmanr(scores_i, scores_j) 519 correlations.append(max(0, rho)) 520 agreement = mean(correlations) 521 ``` 522 For single-detector runs: agreement = 0.5 (neutral, neither high nor low confidence). 523 524 3. **Label stability** (`quality.stability`): Jaccard index of top-k anomaly sets when k varies by +/-20%. 525 ``` 526 k = n_anomalies # from contamination 527 k_low = max(1, int(k * 0.8)) 528 k_high = min(n_samples, int(k * 1.2)) 529 top_k = set(argsort(scores)[-k:]) 530 top_k_low = set(argsort(scores)[-k_low:]) 531 top_k_high = set(argsort(scores)[-k_high:]) 532 stability = 0.5 * (jaccard(top_k, top_k_low) + jaccard(top_k, top_k_high)) 533 ``` 534 Uses the consensus scores. No re-fitting needed — just checks if the anomaly set is robust to the contamination threshold. 535 536 4. **Overall** (`quality.overall`): `mean(separation, agreement, stability)`. 537 538 5. **Verdict**: `'high'` if overall >= 0.7, `'medium'` if >= 0.4, `'low'` otherwise. Explanation string constructed from the three component values. 539 540 **Edge-case fallbacks:** 541 - If all consensus labels are 0 or all are 1 (empty anomaly/inlier set): `separation = 0.0` (no separation detected). 542 - If `spearmanr()` returns NaN (constant score vector): that pair contributes 0.0 to the agreement mean. If all pairs are NaN: `agreement = 0.0`. 543 - If `k = 0` (no anomalies found): `stability = 0.0`. 544 - For single-detector runs: `agreement = 0.5` (neutral — no basis for agreement or disagreement). 545 - If `state.consensus is None` (all detectors errored): all metrics = 0.0, verdict = `'low'`, `next_action = {'action': 'confirm_with_user', 'reason': 'All detectors failed. Check data format or try different detector family.'}`. 546 - `overall = mean(separation, agreement, stability)` — no values are excluded; edge-case fallbacks above ensure all three are always defined floats. 547 548 **Intelligent `next_action`:** 549 - After `analyze()` with high confidence: `{'action': 'report_to_user', 'summary': '...', 'confidence': 0.87}` 550 - After `analyze()` with low confidence: `{'action': 'iterate', 'reason': 'Detectors disagree (agreement=0.3); consider different algorithm family', 'suggestion': 'Exclude lowest-agreement detector and re-run'}` 551 - After `iterate()` with structured feedback: `{'action': 'run', 'reason': 'Plan adjusted: excluded IForest, added ECOD', 'adjustment': 'Excluded IForest, added ECOD'}` 552 - After `iterate()` with ambiguous NL: `{'action': 'confirm_with_user', 'reason': 'Interpreted "too many" as lower contamination', 'suggestion': 'Lower contamination from 0.1 to 0.05?', 'proposed_change': {...}}` 553 554 **Actionable `iterate()`:** 555 556 Two modes: 557 558 *Structured feedback (primary):* dict with a closed set of actions: 559 - `{"action": "adjust_contamination", "value": 0.05}` → updates contamination in all plans, re-runs 560 - `{"action": "exclude", "detectors": ["IForest"]}` → removes from plans, re-plans if needed 561 - `{"action": "include", "detectors": ["ECOD"]}` → adds to plans 562 - `{"action": "rerun"}` → same plans, fresh fit 563 564 These execute immediately and set `next_action` to `'run'`. 565 566 *NL feedback (best-effort):* string parsed via keyword matching: 567 - "too many false positives" → proposes `adjust_contamination` with lower value 568 - "try without X" → proposes `exclude` 569 - "missed anomalies" → proposes `adjust_contamination` with higher value 570 571 NL parsing assigns a confidence score. If confidence >= 0.8, executes immediately. If < 0.8, sets `next_action` to `'confirm_with_user'` with: 572 ```python 573 next_action = { 574 'action': 'confirm_with_user', 575 'reason': 'Interpreted "too many false positives" as lower contamination.', 576 'suggestion': 'Lower contamination from 0.1 to 0.05?', 577 'proposed_change': {'action': 'adjust_contamination', 'value': 0.05}, 578 } 579 ``` 580 The agent presents this to the user. On confirmation, the agent calls `iterate(state, proposed_change)` with the structured dict. 581 582 All iterations are logged in `state.history`. The engine tracks which detector-parameter combinations have been tried and avoids repeating them. 583 584 --- 585 586 ## 5. Skill Integration (od-expert) 587 588 The od-expert skill uses the session API to guide the conversation: 589 590 ``` 591 User: "Find anomalies in this sensor data" 592 Skill: state = engine.start(data) 593 state = engine.plan(state) # next_action='run' 594 state = engine.run(state) # next_action='analyze' 595 state = engine.analyze(state) # next_action='report_to_user' 596 Skill: presents state.next_action['summary'] to user 597 598 User: "Too many false positives" 599 Skill: state = engine.iterate(state, "too many false positives") 600 # NL confidence=0.6 < 0.8, so next_action='confirm_with_user' 601 Skill: "I interpreted that as: lower contamination from 0.1 to 0.05. Proceed?" 602 603 User: "Yes" 604 Skill: state = engine.iterate(state, state.next_action['proposed_change']) 605 # Structured dict, executes immediately → next_action='run' 606 state = engine.run(state) 607 state = engine.analyze(state) 608 Skill: presents updated results 609 610 User: "Good, give me the report" 611 Skill: report = engine.report(state) 612 ``` 613 614 The skill does not need OD knowledge. It follows `state.next_action` and translates between user and engine. When `next_action` is `'confirm_with_user'`, the skill presents the proposed change and waits for user approval before executing. 615 616 **Skill responsibilities:** 617 - Translate user intent to `iterate()` feedback 618 - Present results in human-readable format 619 - Decide when to show code vs just results 620 - Handle data loading (file paths, formats) 621 622 **Engine responsibilities:** 623 - All OD domain knowledge 624 - Workflow state tracking 625 - Algorithm selection, comparison, quality assessment 626 - Iteration logic (what to change based on feedback) 627 628 --- 629 630 ## 6. Backward Compatibility 631 632 **No breaking changes. All additions are strictly additive.** 633 634 - All existing ADEngine methods remain unchanged with identical signatures: `profile_data`, `plan_detection`, `run_detection`, `analyze_results`, `explain_findings`, `suggest_next_step`, `generate_report`, `detect`, `list_detectors`, `explain_detector`, `compare_detectors`, `get_benchmarks`. 635 - New session methods are additive: `start`, `plan`, `run`, `analyze`, `iterate`, `report`, `investigate`. 636 - The session execution method is named `run(state)` to avoid conflict with the existing `detect(X_train, ...)` one-shot method. Both coexist on the same class. 637 - `InvestigationState` is a new class in a new file (`pyod/utils/investigation.py`), no conflicts. 638 - Layer 1 (BaseDetector models) is untouched. 639 640 --- 641 642 ## 7. Scope and Non-Goals 643 644 **In scope for V3:** 645 - `InvestigationState` typed dataclass with closed enums and schemas 646 - Session workflow methods: `start`, `plan`, `run`, `analyze`, `iterate`, `report`, `investigate` 647 - Multi-detector comparison with rank-normalized consensus scoring 648 - Result quality assessment (separation, agreement, stability — exact formulas defined) 649 - Actionable iteration: structured-first with NL best-effort + confirmation 650 - Updated od-expert skill 651 - Tests and documentation 652 653 **Not in scope:** 654 - MCP server changes (Python-first; MCP can wrap session API later) 655 - Persistent cross-session memory (library stays stateless within Python session) 656 - New detectors (all shipped) 657 - Changes to BaseDetector or any detector classes 658 - AutoML / hyperparameter search 659 - UI or visualization 660 661 --- 662 663 ## 8. Implementation Feasibility 664 665 All new code lives in `pyod/utils/ad_engine.py` (session methods, ~400-500 lines) plus a new `pyod/utils/investigation.py` for `InvestigationState` (~50-100 lines). The od-expert skill update is documentation and workflow instructions (~100 lines). 666 667 | Component | Effort | Lines | Dependencies | 668 |-----------|--------|-------|-------------| 669 | `InvestigationState` dataclass + enums | Low | ~100 | None (new file `pyod/utils/investigation.py`) | 670 | Session methods (start, plan, run, analyze, iterate, report) | Medium | ~400 | Existing ADEngine methods | 671 | Multi-detector comparison + consensus | Medium | ~100 | Existing `run_detection`, scipy.stats.rankdata, scipy.stats.spearmanr | 672 | Result quality assessment (3 metrics) | Medium | ~80 | numpy, scipy.stats (both already required) | 673 | `investigate()` convenience | Low | ~20 | Session methods | 674 | od-expert skill update | Low | ~100 | Skill markdown | 675 | Tests | Medium | ~200 | pytest | 676 | Documentation | Low | ~50 | README, CHANGES | 677 678 Total: ~1000 lines of new code, all additive. 679 680 --- 681 682 ## 9. Codex Review Resolution (Round 1) 683 684 | # | Finding | Status | Resolution | 685 |---|---------|--------|------------| 686 | 1 | Blocker: session `detect(state)` conflicts with existing `detect(X_train, ...)` | **Resolved** | Renamed to `run(state)`. Existing `detect()` unchanged. No naming conflict. | 687 | 2 | Blocker: quality assessment under-defined (dip test not in scipy, stability vague) | **Resolved** | Replaced with 3 exact metrics: score separation (ratio), detector agreement (Spearman), label stability (Jaccard). All use numpy/scipy already required. No new dependencies. | 688 | 3 | High: `iterate()` overcommits on NL feedback | **Resolved** | Structured dict feedback is primary (executes immediately). NL feedback is best-effort with confidence score; if < 0.8, returns `'confirm_with_user'` action for agent to present to user. | 689 | 4 | Medium: `InvestigationState` schema is open-ended | **Resolved** | Added closed `PHASES` and `ACTION_TYPES` enums. Defined typed schemas for `HistoryEntry`, `DetectorResult`, `ConsensusResult`, `QualityAssessment`, `NextAction`. Added 2 worked examples (after `plan()` and after `analyze()`). | 690 | 5 | Medium: multi-detector flow underspecified against existing helpers | **Resolved** | Defined how `plan()` wraps `plan_detection()`, how `run()` wraps `run_detection()` per-plan with error handling, how `report()` wraps `generate_report()`. Defined consensus preconditions (same n_samples, rank normalization) and fallback for single/error cases. | 691 692 ## 10. Codex Review Resolution (Round 2) 693 694 | # | Finding | Status | Resolution | 695 |---|---------|--------|------------| 696 | 1 | Blocker: `next_action` protocol inconsistent — stale `detect`/`iterate` values outside enum | **Resolved** | Added `'iterate'` to `ACTION_TYPES` enum. Replaced all stale `detect` references with `run`. Fixed `investigate()` docstring. Updated all `next_action` examples to use only enum values. | 697 | 2 | High: `DetectorResult` mismatches `run_detection()` schema; `state.analysis` untyped | **Resolved** | `DetectorResult` is now a superset of `run_detection()` output (stores raw result verbatim). Added `InvestigationAnalysis` typed schema with `consensus_analysis`, `per_detector_analysis`, `best_detector` (selected by Spearman correlation with consensus). Updated `report()` to use typed `best_detector_index`. | 698 | 3 | High: quality metrics undefined for edge cases | **Resolved** | Added explicit fallback values: empty labels → separation=0.0, NaN Spearman → 0.0, k=0 → stability=0.0, single detector → agreement=0.5, all errors → all metrics=0.0 with `confirm_with_user`. All three metrics always produce defined floats. | 699 | 4 | Medium: `max_detectors` overstated vs `plan_detection()` | **Resolved** | Capped at 3 in v1 (matches `plan_detection()` output: 1 primary + 2 alternatives). Documented as v1 limit, silently caps higher values. | 700 701 ## 11. Codex Review Resolution (Round 3) 702 703 | # | Finding | Status | Resolution | 704 |---|---------|--------|------------| 705 | 1 | Medium: `best_detector` no tie-break or degenerate fallback | **Resolved** | Added deterministic fallback chain: (1) highest finite Spearman, (2) highest plan confidence, (3) fastest runtime, (4) first successful if all NaN. Single-detector case explicit. | 706 | 2 | Medium: `proposed_change` untyped, `suggestion` field scope unclear | **Resolved** | Added typed `StructuredFeedback` schema with closed action names and required fields. `NextAction` payload documented per action type: `suggestion` valid for `iterate` and `confirm_with_user`, `proposed_change` is a `StructuredFeedback`. | 707 | 3 | High (still open from R2): `report()` calls `analyze_results()` on consensus but consensus lacks `plan`/`threshold` | **Resolved** | `report()` now uses `per_detector_analysis[best_idx]` (fully compatible with `generate_report()`). `consensus_analysis` is a lightweight summary built by `analyze()` directly, not by calling `analyze_results()`. Session-level consensus info rendered in a separate section prepended to the report. | 708 709 ## 12. Codex Review Resolution (Round 4) 710 711 | # | Finding | Status | Resolution | 712 |---|---------|--------|------------| 713 | 1 | High: `per_detector_analysis` index misaligns with `state.results` when detectors error | **Resolved** | `per_detector_analysis` is now positionally aligned with `state.results`. Error/skipped entries get `None`. `best_detector_index` always points to a successful entry in both lists. | 714 | 2 | Medium: NextAction payload fields not marked required vs optional per action type | **Resolved** | Each action type now documents R (required) vs O (optional) fields. `confirm_with_user` has optional `suggestion`/`proposed_change` (present for change confirmation, absent for error/retry). | 715 | 3 | Medium: `consensus_analysis` schema comment still said `analyze_results() output` | **Resolved** | Defined typed `ConsensusAnalysis` schema. `InvestigationAnalysis` references it by name. Behavior section confirms it is built directly by `analyze()`, not via `analyze_results()`. | 716 717 ## 13. Codex Review Resolution (Round 5) 718 719 | # | Finding | Status | Resolution | 720 |---|---------|--------|------------| 721 | 1 | High: `generate_report(format='json')` returns JSON string, not dict — wrapper contract broken | **Resolved** | JSON path now bypasses `generate_report()` and constructs a native Python dict directly from `best_result` and `best_analysis`. Text path still wraps `generate_report(format='text')`. | 722 | 2 | Medium: all-detectors-error leaves `state.analysis` undefined | **Resolved** | `state.analysis = None` when all detectors error. `state.quality` set to all-zeros with verdict `'low'`. `report(state)` raises `ValueError` when `state.analysis is None`. |