server.feature
1 @llama.cpp 2 @server 3 Feature: llama.cpp server 4 5 Background: Server startup 6 Given a server listening on localhost:8080 7 And a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models 8 And a model file test-model.gguf 9 And a model alias tinyllama-2 10 And BOS token is 1 11 And 42 as server seed 12 # KV Cache corresponds to the total amount of tokens 13 # that can be stored across all independent sequences: #4130 14 # see --ctx-size and #5568 15 And 256 KV cache size 16 And 32 as batch size 17 And 2 slots 18 And 64 server max tokens to predict 19 And prometheus compatible metrics exposed 20 Then the server is starting 21 Then the server is healthy 22 23 Scenario: Health 24 Then the server is ready 25 And all slots are idle 26 27 28 Scenario Outline: Completion 29 Given a prompt <prompt> 30 And <n_predict> max tokens to predict 31 And a completion request with no api error 32 Then <n_predicted> tokens are predicted matching <re_content> 33 And the completion is <truncated> truncated 34 And <n_prompt> prompt tokens are processed 35 And prometheus metrics are exposed 36 And metric llamacpp:tokens_predicted is <n_predicted> 37 38 Examples: Prompts 39 | prompt | n_predict | re_content | n_prompt | n_predicted | truncated | 40 | I believe the meaning of life is | 8 | (read\|going)+ | 18 | 8 | not | 41 | Write a joke about AI from a very long prompt which will not be truncated | 256 | (princesses\|everyone\|kids\|Anna\|forest)+ | 46 | 64 | not | 42 43 Scenario: Completion prompt truncated 44 Given a prompt: 45 """ 46 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. 47 Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 48 Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. 49 Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. 50 """ 51 And a completion request with no api error 52 Then 64 tokens are predicted matching fun|Annaks|popcorns|pictry|bowl 53 And the completion is truncated 54 And 109 prompt tokens are processed 55 56 57 Scenario Outline: OAI Compatibility 58 Given a model <model> 59 And a system prompt <system_prompt> 60 And a user prompt <user_prompt> 61 And <max_tokens> max tokens to predict 62 And streaming is <enable_streaming> 63 Given an OAI compatible chat completions request with no api error 64 Then <n_predicted> tokens are predicted matching <re_content> 65 And <n_prompt> prompt tokens are processed 66 And the completion is <truncated> truncated 67 68 Examples: Prompts 69 | model | system_prompt | user_prompt | max_tokens | re_content | n_prompt | n_predicted | enable_streaming | truncated | 70 | llama-2 | Book | What is the best book | 8 | (Here\|what)+ | 77 | 8 | disabled | not | 71 | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 128 | (thanks\|happy\|bird\|Annabyear)+ | -1 | 64 | enabled | | 72 73 74 Scenario Outline: OAI Compatibility w/ response format 75 Given a model test 76 And a system prompt test 77 And a user prompt test 78 And a response format <response_format> 79 And 10 max tokens to predict 80 Given an OAI compatible chat completions request with no api error 81 Then <n_predicted> tokens are predicted matching <re_content> 82 83 Examples: Prompts 84 | response_format | n_predicted | re_content | 85 | {"type": "json_object", "schema": {"const": "42"}} | 5 | "42" | 86 | {"type": "json_object", "schema": {"items": [{"type": "integer"}]}} | 10 | \[ -300 \] | 87 | {"type": "json_object"} | 10 | \{ " Jacky. | 88 89 90 Scenario: Tokenize / Detokenize 91 When tokenizing: 92 """ 93 What is the capital of France ? 94 """ 95 Then tokens can be detokenized 96 And tokens do not begin with BOS 97 98 Scenario: Tokenize w/ BOS 99 Given adding special tokens 100 When tokenizing: 101 """ 102 What is the capital of Germany? 103 """ 104 Then tokens begin with BOS 105 Given first token is removed 106 Then tokens can be detokenized 107 108 Scenario: Models available 109 Given available models 110 Then 1 models are supported 111 Then model 0 is identified by tinyllama-2 112 Then model 0 is trained on 128 tokens context