slotsave.feature
1 @llama.cpp 2 @slotsave 3 Feature: llama.cpp server slot management 4 5 Background: Server startup 6 Given a server listening on localhost:8080 7 And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models 8 And prompt caching is enabled 9 And 2 slots 10 And . as slot save path 11 And 2048 KV cache size 12 And 42 as server seed 13 And 24 max tokens to predict 14 Then the server is starting 15 Then the server is healthy 16 17 Scenario: Save and Restore Slot 18 # First prompt in slot 1 should be fully processed 19 Given a user prompt "What is the capital of France?" 20 And using slot id 1 21 And a completion request with no api error 22 Then 24 tokens are predicted matching (Lily|cake) 23 And 22 prompt tokens are processed 24 When the slot 1 is saved with filename "slot1.bin" 25 Then the server responds with status code 200 26 # Since we have cache, this should only process the last tokens 27 Given a user prompt "What is the capital of Germany?" 28 And a completion request with no api error 29 Then 24 tokens are predicted matching (Thank|special) 30 And 7 prompt tokens are processed 31 # Loading the original cache into slot 0, 32 # we should only be processing 1 prompt token and get the same output 33 When the slot 0 is restored with filename "slot1.bin" 34 Then the server responds with status code 200 35 Given a user prompt "What is the capital of France?" 36 And using slot id 0 37 And a completion request with no api error 38 Then 24 tokens are predicted matching (Lily|cake) 39 And 1 prompt tokens are processed 40 # For verification that slot 1 was not corrupted during slot 0 load, same thing 41 Given a user prompt "What is the capital of Germany?" 42 And using slot id 1 43 And a completion request with no api error 44 Then 24 tokens are predicted matching (Thank|special) 45 And 1 prompt tokens are processed 46 47 Scenario: Erase Slot 48 Given a user prompt "What is the capital of France?" 49 And using slot id 1 50 And a completion request with no api error 51 Then 24 tokens are predicted matching (Lily|cake) 52 And 22 prompt tokens are processed 53 When the slot 1 is erased 54 Then the server responds with status code 200 55 Given a user prompt "What is the capital of France?" 56 And a completion request with no api error 57 Then 24 tokens are predicted matching (Lily|cake) 58 And 22 prompt tokens are processed