/ README.md
README.md
1 --- 2 title: AE2-Applied-AI-Engineering 3 emoji: 🏗️ 4 colorFrom: blue 5 colorTo: green 6 sdk: docker 7 pinned: false 8 app_port: 7860 9 --- 10 11 # AE² — Applied AI Engineering Environment 12 13 [](https://github.com/meta-pytorch/OpenEnv) 14 [](https://huggingface.co/spaces/sudhanshu-ssd/ae2-env) 15 [](https://python.org) 16 [](baseline_results.json) 17 18 > A benchmark RL environment where AI agents fix and optimize broken production Python code across 5 real-world ML engineering domains. 19 20 --- 21 22 ## What is AE²? 23 24 AE² (Applied AI Engineering Environment) simulates the real-world task of an AI/ML engineer debugging and optimizing broken production code. Unlike toy environments, every task here mirrors genuine engineering work: fixing broken data pipelines, correcting model training bugs, optimizing slow inference code, and repairing API deployment issues. 25 26 Agents interact by submitting Python code fixes. The environment executes submissions in an isolated sandbox, runs deterministic test cases, measures runtime and memory efficiency, and returns a shaped reward signal with partial credit for incremental progress. 27 28 **Why this matters:** As LLM coding agents grow more capable, rigorous benchmarks must test real engineering judgment, not just syntax completion. AE² fills this gap by grounding evaluation in production ML engineering scenarios that engineers face daily. 29 30 --- 31 32 ## Environment Overview 33 34 | Property | Value | 35 |----------|-------| 36 | Domains | Data Engineering, Model Ops, NLP/LLM, Deployment, Eval Analysis | 37 | Total Tasks | 15 (5 domains × 3 difficulty levels) | 38 | Difficulty Levels | EASY → MEDIUM → HARD | 39 | Reward Range | [0, 1.0] shaped with partial credit | 40 | Max Steps per Episode | 10 | 41 | Execution | Sandboxed subprocess with 30s timeout | 42 43 --- 44 45 ## Action & Observation Spaces 46 47 ### Action 48 ```python 49 class EngAction(Action): 50 sol: str # Complete Python function named 'solution' 51 ``` 52 53 ### Observation 54 ```python 55 class EngObservation(Observation): 56 domain: str # data_eng | model_ops | nlp_llm | deployment | eval_analysis 57 difficulty: str # EASY | MEDIUM | HARD 58 task: str # Full task description and objective 59 code: str # Current broken/partial code to fix 60 done: bool # Episode complete flag 61 reward: float # Reward for last action (-0.3 to 1.0) 62 output: str # Error message or assertion failure from last run 63 tests_passed: int # Number of test cases passed 64 num_tests: int # Total test cases 65 time_taken: float # Execution time in milliseconds 66 mem_taken: float # Memory usage in MiB 67 message: str # Human-readable feedback with error details 68 num_steps_remain: int # Remaining attempts in episode 69 ``` 70 71 --- 72 73 ## Reward Function 74 75 Shaped to provide dense signal across the full episode trajectory: 76 77 | Condition | Reward | 78 |-----------|--------| 79 | Syntax error | -0.3 | 80 | Runtime error | -0.2 | 81 | All tests fail | 0.0 | 82 | Partial pass (k/n tests) | 0.6 × (k/n) | 83 | All tests pass | 0.6 base | 84 | + Speed faster than baseline | up to +0.25 | 85 | + Memory below baseline | up to +0.15 | 86 | MEDIUM difficulty bonus | +0.05 | 87 | HARD difficulty bonus | +0.10 | 88 | Maximum | 1.0 | 89 90 The efficiency bonus rewards agents that produce optimized code, not just correct code — directly reflecting production engineering values. 91 92 --- 93 94 ## Tasks 95 96 ### EASY 97 | Domain | Task | Bug Type | 98 |--------|------|----------| 99 | data_eng | Fix Currency Parser | Crash on comma-formatted numbers | 100 | deployment | Fix Prediction Endpoint | NameError — missing class definition | 101 | eval_analysis | Fix Accuracy Calculation | Wrong division operand | 102 | model_ops | Fix sklearn Pipeline | Missing comma in step list | 103 | nlp_llm | Fix Tokenizer Argument | Invalid keyword argument | 104 105 ### MEDIUM 106 | Domain | Task | Bug Type | 107 |--------|------|----------| 108 | data_eng | Fix Null Handling | Crash on None values in list | 109 | deployment | Fix Pydantic Validation | Wrong type annotation | 110 | eval_analysis | Fix F1 Averaging | Wrong averaging strategy | 111 | model_ops | Fix Feature Scaling | Missing StandardScaler | 112 | nlp_llm | Fix Batch Padding | Missing padding parameter | 113 114 ### HARD 115 | Domain | Task | Challenge | 116 |--------|------|-----------| 117 | data_eng | Vectorize GroupBy | Replace O(n²) loop with pandas vectorization | 118 | deployment | Solve N+1 Latency | Replace 100 sequential calls with 1 bulk call | 119 | eval_analysis | Optimize Regex | Pre-compile pattern outside hot loop | 120 | model_ops | Mixed Precision Inference | Implement torch.amp.autocast | 121 | nlp_llm | Batched Tokenization | Replace loop with single batch call | 122 123 --- 124 125 ## Baseline Scores 126 127 Evaluated using `meta-llama/Llama-3.1-8B-Instruct` via HuggingFace Router (OpenAI-compatible): 128 129 | Domain | EASY | MEDIUM | HARD | 130 |--------|------|--------|------| 131 | data_eng | 1.000 | 1.000 | 0.600 | 132 | deployment | 1.000 | 0.793 | 1.000 | 133 | eval_analysis | 1.000 | 1.000 | 1.000 | 134 | model_ops | 1.000 | 1.000 | 1.000 | 135 | nlp_llm | 1.000 | 1.000 | 1.000 | 136 | **Average** | **0.894** | **0.868** | **0.929** | 137 138 139 **Overall Baseline Score: 0.897** — Full results in [baseline_results.json](baseline_results.json). 140 --- 141 142 ## Setup & Usage 143 144 ### Local Development 145 146 ```bash 147 # Install dependencies 148 pip install torch --index-url https://download.pytorch.org/whl/cpu 149 pip install -r requirements.txt 150 151 # Pre-download the model 152 python -c "from transformers import AutoTokenizer, AutoModel; AutoTokenizer.from_pretrained('distilbert-base-uncased'); AutoModel.from_pretrained('distilbert-base-uncased')" 153 154 # Start server 155 uvicorn app:app --host 0.0.0.0 --port 7860 156 157 # Set env vars and run baseline 158 export API_KEY=your_api_key 159 export API_BASE_URL=https://router.huggingface.co/v1 160 export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct 161 export AE2_URL=http://localhost:7860 162 python inference.py 163 ``` 164 165 ### Docker 166 167 ```bash 168 docker build -t ae2-env . 169 docker run -p 7860:7860 -e OPENAI_API_KEY=your_key -e AE2_URL=http://localhost:7860 ae2-env 170 ``` 171 172 ### Environment Variables 173 174 | Variable | Description | Required | 175 |----------|-------------|----------| 176 | `OPENAI_API_KEY` | LLM API key | Yes | 177 | `API_BASE_URL` | LLM endpoint | No (defaults to HF router) | 178 | `MODEL_NAME` | Model identifier | No (defaults to Llama-3.1-8B) | 179 | `AE2_URL` | Environment server URL | No (defaults to localhost:7860) | 180 181 --- 182 183 ## Using the Environment 184 185 ```python 186 from client import AE2Env 187 from models import EngAction 188 189 with AE2Env(base_url="https://sudhanshu-ssd-ae2-env.hf.space").sync() as env: 190 result = env.reset(task_id="data_eng_easy_001") 191 obs = result.observation 192 print(f"Task: {obs.task}") 193 194 result = env.step(EngAction(sol=""" 195 def solution(value: str) -> float: 196 cleaned = value.replace('$', '').replace(',', '') 197 return float(cleaned) 198 """)) 199 print(f"Reward: {result.reward} | Tests: {result.observation.tests_passed}/{result.observation.num_tests}") 200 ``` 201 202 --- 203 204 ## API Endpoints 205 206 | Endpoint | Method | Description | 207 |----------|--------|-------------| 208 | `/reset` | POST | Start episode, get initial observation | 209 | `/step` | POST | Submit code fix, get reward | 210 | `/ws` | WebSocket | Stateful connection (recommended) | 211 | `/tasks` | GET | List all 15 tasks with action schema | 212 | `/grader` | GET | Score a task+code pair directly | 213 | `/baseline` | POST | Return pre-computed baseline scores | 214 | `/health` | GET | Health check | 215 | `/docs` | GET | Swagger UI | 216 217 --- 218 219 ## Project Structure 220 221 ``` 222 ae2-env/ 223 ├── environment.py # EngEnv — OpenEnv Environment subclass 224 ├── models.py # EngAction, EngObservation, EngState 225 ├── grader.py # Deterministic grader + compare_results 226 ├── reward.py # Shaped reward function 227 ├── sandbox.py # Isolated subprocess execution 228 ├── task_loader.py # Task and test loader 229 ├── client.py # AE2Env WebSocket client 230 ├── inference.py # Baseline inference script 231 ├── openenv.yaml # OpenEnv spec metadata 232 ├── requirements.txt 233 ├── Dockerfile 234 ├── server/ 235 │ ├── __init__.py 236 │ └── app.py # FastAPI server 237 ├── baseline_results.json # Pre-computed baseline scores 238 └── tasks/ # 15 task directories 239 └── {task_id}/ 240 ├── task.json # Task definition + broken code 241 └── tests.json # Test cases 242 ``` 243 244 --- 245 246 ## Author 247 248 **Sudhanshu Saini** — B.Tech. AI & Data Science, CTAE Udaipur 249 **Tanmay Tomar** - B.Tech. AI & Data Science, MITRC Alwar