/ README.md
README.md
  1  ---
  2  title: AE2-Applied-AI-Engineering
  3  emoji: 🏗️
  4  colorFrom: blue
  5  colorTo: green
  6  sdk: docker
  7  pinned: false
  8  app_port: 7860
  9  ---
 10  
 11  # AE² — Applied AI Engineering Environment
 12  
 13  [![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-blue)](https://github.com/meta-pytorch/OpenEnv)
 14  [![HuggingFace Space](https://img.shields.io/badge/🤗-Space-yellow)](https://huggingface.co/spaces/sudhanshu-ssd/ae2-env)
 15  [![Python 3.11](https://img.shields.io/badge/python-3.11-green)](https://python.org)
 16  [![Baseline Score](https://img.shields.io/badge/baseline-0.83-brightgreen)](baseline_results.json)
 17  
 18  > A benchmark RL environment where AI agents fix and optimize broken production Python code across 5 real-world ML engineering domains.
 19  
 20  ---
 21  
 22  ## What is AE²?
 23  
 24  AE² (Applied AI Engineering Environment) simulates the real-world task of an AI/ML engineer debugging and optimizing broken production code. Unlike toy environments, every task here mirrors genuine engineering work: fixing broken data pipelines, correcting model training bugs, optimizing slow inference code, and repairing API deployment issues.
 25  
 26  Agents interact by submitting Python code fixes. The environment executes submissions in an isolated sandbox, runs deterministic test cases, measures runtime and memory efficiency, and returns a shaped reward signal with partial credit for incremental progress.
 27  
 28  **Why this matters:** As LLM coding agents grow more capable, rigorous benchmarks must test real engineering judgment, not just syntax completion. AE² fills this gap by grounding evaluation in production ML engineering scenarios that engineers face daily.
 29  
 30  ---
 31  
 32  ## Environment Overview
 33  
 34  | Property | Value |
 35  |----------|-------|
 36  | Domains | Data Engineering, Model Ops, NLP/LLM, Deployment, Eval Analysis |
 37  | Total Tasks | 15 (5 domains × 3 difficulty levels) |
 38  | Difficulty Levels | EASY → MEDIUM → HARD |
 39  | Reward Range | [0, 1.0] shaped with partial credit |
 40  | Max Steps per Episode | 10 |
 41  | Execution | Sandboxed subprocess with 30s timeout |
 42  
 43  ---
 44  
 45  ## Action & Observation Spaces
 46  
 47  ### Action
 48  ```python
 49  class EngAction(Action):
 50      sol: str  # Complete Python function named 'solution'
 51  ```
 52  
 53  ### Observation
 54  ```python
 55  class EngObservation(Observation):
 56      domain: str           # data_eng | model_ops | nlp_llm | deployment | eval_analysis
 57      difficulty: str       # EASY | MEDIUM | HARD
 58      task: str             # Full task description and objective
 59      code: str             # Current broken/partial code to fix
 60      done: bool            # Episode complete flag
 61      reward: float         # Reward for last action (-0.3 to 1.0)
 62      output: str           # Error message or assertion failure from last run
 63      tests_passed: int     # Number of test cases passed
 64      num_tests: int        # Total test cases
 65      time_taken: float     # Execution time in milliseconds
 66      mem_taken: float      # Memory usage in MiB
 67      message: str          # Human-readable feedback with error details
 68      num_steps_remain: int # Remaining attempts in episode
 69  ```
 70  
 71  ---
 72  
 73  ## Reward Function
 74  
 75  Shaped to provide dense signal across the full episode trajectory:
 76  
 77  | Condition | Reward |
 78  |-----------|--------|
 79  | Syntax error | -0.3 |
 80  | Runtime error | -0.2 |
 81  | All tests fail | 0.0 |
 82  | Partial pass (k/n tests) | 0.6 × (k/n) |
 83  | All tests pass | 0.6 base |
 84  | + Speed faster than baseline | up to +0.25 |
 85  | + Memory below baseline | up to +0.15 |
 86  | MEDIUM difficulty bonus | +0.05 |
 87  | HARD difficulty bonus | +0.10 |
 88  | Maximum | 1.0 |
 89  
 90  The efficiency bonus rewards agents that produce optimized code, not just correct code — directly reflecting production engineering values.
 91  
 92  ---
 93  
 94  ## Tasks
 95  
 96  ### EASY
 97  | Domain | Task | Bug Type |
 98  |--------|------|----------|
 99  | data_eng | Fix Currency Parser | Crash on comma-formatted numbers |
100  | deployment | Fix Prediction Endpoint | NameError — missing class definition |
101  | eval_analysis | Fix Accuracy Calculation | Wrong division operand |
102  | model_ops | Fix sklearn Pipeline | Missing comma in step list |
103  | nlp_llm | Fix Tokenizer Argument | Invalid keyword argument |
104  
105  ### MEDIUM
106  | Domain | Task | Bug Type |
107  |--------|------|----------|
108  | data_eng | Fix Null Handling | Crash on None values in list |
109  | deployment | Fix Pydantic Validation | Wrong type annotation |
110  | eval_analysis | Fix F1 Averaging | Wrong averaging strategy |
111  | model_ops | Fix Feature Scaling | Missing StandardScaler |
112  | nlp_llm | Fix Batch Padding | Missing padding parameter |
113  
114  ### HARD
115  | Domain | Task | Challenge |
116  |--------|------|-----------|
117  | data_eng | Vectorize GroupBy | Replace O(n²) loop with pandas vectorization |
118  | deployment | Solve N+1 Latency | Replace 100 sequential calls with 1 bulk call |
119  | eval_analysis | Optimize Regex | Pre-compile pattern outside hot loop |
120  | model_ops | Mixed Precision Inference | Implement torch.amp.autocast |
121  | nlp_llm | Batched Tokenization | Replace loop with single batch call |
122  
123  ---
124  
125  ## Baseline Scores
126  
127  Evaluated using `meta-llama/Llama-3.1-8B-Instruct` via HuggingFace Router (OpenAI-compatible):
128  
129  | Domain | EASY | MEDIUM | HARD |
130  |--------|------|--------|------|
131  | data_eng | 1.000 | 1.000 | 0.600 |
132  | deployment | 1.000 | 0.793 | 1.000 |
133  | eval_analysis | 1.000 | 1.000 | 1.000 |
134  | model_ops | 1.000 | 1.000 | 1.000 |
135  | nlp_llm | 1.000 | 1.000 | 1.000 |
136  | **Average** | **0.894** | **0.868** | **0.929** |
137  
138  
139  **Overall Baseline Score: 0.897** — Full results in [baseline_results.json](baseline_results.json).
140  ---
141  
142  ## Setup & Usage
143  
144  ### Local Development
145  
146  ```bash
147  # Install dependencies
148  pip install torch --index-url https://download.pytorch.org/whl/cpu
149  pip install -r requirements.txt
150  
151  # Pre-download the model
152  python -c "from transformers import AutoTokenizer, AutoModel; AutoTokenizer.from_pretrained('distilbert-base-uncased'); AutoModel.from_pretrained('distilbert-base-uncased')"
153  
154  # Start server
155  uvicorn app:app --host 0.0.0.0 --port 7860
156  
157  # Set env vars and run baseline
158  export API_KEY=your_api_key
159  export API_BASE_URL=https://router.huggingface.co/v1
160  export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
161  export AE2_URL=http://localhost:7860
162  python inference.py
163  ```
164  
165  ### Docker
166  
167  ```bash
168  docker build -t ae2-env .
169  docker run -p 7860:7860 -e OPENAI_API_KEY=your_key -e AE2_URL=http://localhost:7860 ae2-env
170  ```
171  
172  ### Environment Variables
173  
174  | Variable | Description | Required |
175  |----------|-------------|----------|
176  | `OPENAI_API_KEY` | LLM API key | Yes |
177  | `API_BASE_URL` | LLM endpoint | No (defaults to HF router) |
178  | `MODEL_NAME` | Model identifier | No (defaults to Llama-3.1-8B) |
179  | `AE2_URL` | Environment server URL | No (defaults to localhost:7860) |
180  
181  ---
182  
183  ## Using the Environment
184  
185  ```python
186  from client import AE2Env
187  from models import EngAction
188  
189  with AE2Env(base_url="https://sudhanshu-ssd-ae2-env.hf.space").sync() as env:
190      result = env.reset(task_id="data_eng_easy_001")
191      obs = result.observation
192      print(f"Task: {obs.task}")
193      
194      result = env.step(EngAction(sol="""
195  def solution(value: str) -> float:
196      cleaned = value.replace('$', '').replace(',', '')
197      return float(cleaned)
198  """))
199      print(f"Reward: {result.reward} | Tests: {result.observation.tests_passed}/{result.observation.num_tests}")
200  ```
201  
202  ---
203  
204  ## API Endpoints
205  
206  | Endpoint | Method | Description |
207  |----------|--------|-------------|
208  | `/reset` | POST | Start episode, get initial observation |
209  | `/step` | POST | Submit code fix, get reward |
210  | `/ws` | WebSocket | Stateful connection (recommended) |
211  | `/tasks` | GET | List all 15 tasks with action schema |
212  | `/grader` | GET | Score a task+code pair directly |
213  | `/baseline` | POST | Return pre-computed baseline scores |
214  | `/health` | GET | Health check |
215  | `/docs` | GET | Swagger UI |
216  
217  ---
218  
219  ## Project Structure
220  
221  ```
222  ae2-env/
223  ├── environment.py        # EngEnv — OpenEnv Environment subclass
224  ├── models.py             # EngAction, EngObservation, EngState
225  ├── grader.py             # Deterministic grader + compare_results
226  ├── reward.py             # Shaped reward function
227  ├── sandbox.py            # Isolated subprocess execution
228  ├── task_loader.py        # Task and test loader
229  ├── client.py             # AE2Env WebSocket client
230  ├── inference.py          # Baseline inference script
231  ├── openenv.yaml          # OpenEnv spec metadata
232  ├── requirements.txt
233  ├── Dockerfile
234  ├── server/
235  │   ├── __init__.py 
236  │   └── app.py             # FastAPI server
237  ├── baseline_results.json  # Pre-computed baseline scores
238  └── tasks/                 # 15 task directories
239      └── {task_id}/
240          ├── task.json      # Task definition + broken code
241          └── tests.json     # Test cases
242  ```
243  
244  ---
245  
246  ## Author
247  
248  **Sudhanshu Saini** — B.Tech. AI & Data Science, CTAE Udaipur  
249  **Tanmay Tomar** - B.Tech. AI & Data Science, MITRC Alwar