/ baseline_results.json
baseline_results.json
  1  {
  2    "model": "meta-llama/Llama-3.1-8B-Instruct",
  3    "results": [
  4      {
  5        "task": "This function should parse currency strings like '$1,000' into floats but crashes on comma-formatted numbers. Fix the syntax and logic so it handles all currency formats correctly. Function must be named 'solution'.",
  6        "domain": "data_eng",
  7        "difficulty": "EASY",
  8        "steps_taken": 1,
  9        "final_reward": 0.99,
 10        "grader_score": 0.99,
 11        "tests_passed": 5,
 12        "total_tests": 5,
 13        "success": true
 14      },
 15      {
 16        "task": "This function takes a plain dict 'request_dict' and must return a dict with keys 'text', 'label', and 'threshold'. The label is 'positive' if threshold > 0.5 else 'negative'. Fix the NameError by defining the PredictionRequest class inside the function or at module level. Do NOT change the function signature. Function must be named 'solution' and must accept exactly one argument: request_dict (dict).",
 17        "domain": "deployment",
 18        "difficulty": "EASY",
 19        "steps_taken": 1,
 20        "final_reward": 0.99,
 21        "grader_score": 0.99,
 22        "tests_passed": 5,
 23        "total_tests": 5,
 24        "success": true
 25      },
 26      {
 27        "task": "This function should calculate accuracy (correct predictions / total predictions) but has a syntax error and divides by the wrong value. Fix it to return accuracy as a float between 0.0 and 1.0. Function must be named 'solution'.",
 28        "domain": "eval_analysis",
 29        "difficulty": "EASY",
 30        "steps_taken": 1,
 31        "final_reward": 0.99,
 32        "grader_score": 0.99,
 33        "tests_passed": 5,
 34        "total_tests": 5,
 35        "success": true
 36      },
 37      {
 38        "task": "The Scikit-learn Pipeline initialization is missing a comma between steps and has incorrect tuple nesting. Fix it so it returns a valid fitted pipeline.Function must be named 'solution'.Fix the Pipeline syntax and return ONLY int(pipe.predict(X)[0]) \u2014 the prediction for first sample.",
 39        "domain": "model_ops",
 40        "difficulty": "EASY",
 41        "steps_taken": 1,
 42        "final_reward": 0.99,
 43        "grader_score": 0.75,
 44        "tests_passed": 3,
 45        "total_tests": 3,
 46        "success": true
 47      },
 48      {
 49        "task": "The function uses 'input_str' as keyword argument which is invalid. Fix it to use the correct positional argument. Return the flat list of token IDs using tokenizer(text_input)['input_ids'] directly \u2014 this returns a plain Python list without nesting. Function must be named 'solution'.",
 50        "domain": "nlp_llm",
 51        "difficulty": "EASY",
 52        "steps_taken": 2,
 53        "final_reward": 0.99,
 54        "grader_score": 0.75,
 55        "tests_passed": 5,
 56        "total_tests": 5,
 57        "success": true
 58      },
 59      {
 60        "task": "This function processes a list of text samples and returns their lengths. It crashes when a None value is present in the list. Fix it to handle nulls by returning 0 for None entries. Function must be named 'solution'.",
 61        "domain": "data_eng",
 62        "difficulty": "MEDIUM",
 63        "steps_taken": 1,
 64        "final_reward": 0.99,
 65        "grader_score": 0.99,
 66        "tests_passed": 5,
 67        "total_tests": 5,
 68        "success": true
 69      },
 70      {
 71        "task": "The Pydantic model 'InferenceRequest' is failing when the 'confidence' score is a float because it was defined as an 'int'. Fix the type hint.Function must be named 'solution'.",
 72        "domain": "deployment",
 73        "difficulty": "MEDIUM",
 74        "steps_taken": 2,
 75        "final_reward": 0.72,
 76        "grader_score": 0.75,
 77        "tests_passed": 3,
 78        "total_tests": 3,
 79        "success": true
 80      },
 81      {
 82        "task": "This function currently uses 'macro' averaging to calculate the F1 score. Modify the function to use 'weighted' averaging instead. Ensure you keep the existing f1_score import from sklearn.metrics. Return the result rounded to 4 decimal places.",
 83        "domain": "eval_analysis",
 84        "difficulty": "MEDIUM",
 85        "steps_taken": 2,
 86        "final_reward": 0.99,
 87        "grader_score": 0.99,
 88        "tests_passed": 5,
 89        "total_tests": 5,
 90        "success": true
 91      },
 92      {
 93        "task": "LogisticRegression fails to converge on large-scale features. Fix by adding StandardScaler to the pipeline before the classifier. Fit on the entire dataset (X, y) without splitting. Return the integer prediction for the last sample. Function must be named 'solution'.",
 94        "domain": "model_ops",
 95        "difficulty": "MEDIUM",
 96        "steps_taken": 1,
 97        "final_reward": 0.99,
 98        "grader_score": 0.75,
 99        "tests_passed": 5,
100        "total_tests": 5,
101        "success": true
102      },
103      {
104        "task": "Enable padding so that a list of strings can be processed into a single torch tensor. Return the total number of elements (int) in the 'input_ids' tensor using .numel(). Function must be named 'solution'.",
105        "domain": "nlp_llm",
106        "difficulty": "MEDIUM",
107        "steps_taken": 2,
108        "final_reward": 0.908,
109        "grader_score": 0.862,
110        "tests_passed": 2,
111        "total_tests": 2,
112        "success": true
113      },
114      {
115        "task": "You are given a large DataFrame of sales. Calculate the average price per category. The current 'solution' uses a for-loop which is too slow for production. Rewrite it using vectorized Pandas operations to meet the performance baseline.Function must be named 'solution'.Return the result as a Python dict: {'category': mean_price, ...}",
116        "domain": "data_eng",
117        "difficulty": "HARD",
118        "steps_taken": 2,
119        "final_reward": 0.99,
120        "grader_score": 0.924,
121        "tests_passed": 5,
122        "total_tests": 5,
123        "success": true
124      },
125      {
126        "task": "The 'get_user_profiles' function is hitting the database 100 times for 100 users, causing massive latency. Fix the logic to use the 'bulk_fetch_profiles' function instead of calling 'single_fetch' in a loop.Function must be named 'solution'.",
127        "domain": "deployment",
128        "difficulty": "HARD",
129        "steps_taken": 4,
130        "final_reward": 0.99,
131        "grader_score": 0.99,
132        "tests_passed": 5,
133        "total_tests": 5,
134        "success": true
135      },
136      {
137        "task": "This function parses server logs to extract error counts but recompiles the regex pattern on every iteration, making it extremely slow on large logs. Fix it by compiling the pattern once before the loop. Return total error count as integer. Function must be named 'solution'.",
138        "domain": "eval_analysis",
139        "difficulty": "HARD",
140        "steps_taken": 2,
141        "final_reward": 0.99,
142        "grader_score": 0.99,
143        "tests_passed": 5,
144        "total_tests": 5,
145        "success": true
146      },
147      {
148        "task": "The current inference function is slow and consumes too much memory because it uses full precision (float32) on the CPU. Optimize it by: 1. Moving the model and input to 'cuda' if available, otherwise 'cpu'. 2. Using torch.cuda.amp.autocast (or torch.amp.autocast) for mixed precision. 3. Ensure the output is moved back to CPU and converted to a float. Function must be named 'solution'.",
149        "domain": "model_ops",
150        "difficulty": "HARD",
151        "steps_taken": 2,
152        "final_reward": 0.99,
153        "grader_score": 0.99,
154        "tests_passed": 3,
155        "total_tests": 3,
156        "success": true
157      },
158      {
159        "task": "This function tokenizes a list of texts one by one in a loop, which is extremely slow for large inputs. Rewrite it to use batched tokenization in a single call WITHOUT padding. Return a plain Python list of input_id lists (no numpy arrays, no nesting). Each inner list contains the token IDs for one text. Function must be named 'solution'.",
160        "domain": "nlp_llm",
161        "difficulty": "HARD",
162        "steps_taken": 3,
163        "final_reward": 0.99,
164        "grader_score": 0.75,
165        "tests_passed": 5,
166        "total_tests": 5,
167        "success": true
168      }
169    ],
170    "summary": {
171      "EASY": 0.8939999999999999,
172      "MEDIUM": 0.8684,
173      "HARD": 0.9288000000000001
174    },
175    "overall": 0.8970666666666667
176  }