/ examples / server / tests / features / server.feature
server.feature
  1  @llama.cpp
  2  @server
  3  Feature: llama.cpp server
  4  
  5    Background: Server startup
  6      Given a server listening on localhost:8080
  7      And   a model file tinyllamas/stories260K.gguf from HF repo ggml-org/models
  8      And   a model file test-model.gguf
  9      And   a model alias tinyllama-2
 10      And   BOS token is 1
 11      And   42 as server seed
 12        # KV Cache corresponds to the total amount of tokens
 13        # that can be stored across all independent sequences: #4130
 14        # see --ctx-size and #5568
 15      And   256 KV cache size
 16      And   32 as batch size
 17      And   2 slots
 18      And   64 server max tokens to predict
 19      And   prometheus compatible metrics exposed
 20      Then  the server is starting
 21      Then  the server is healthy
 22  
 23    Scenario: Health
 24      Then the server is ready
 25      And  all slots are idle
 26  
 27  
 28    Scenario Outline: Completion
 29      Given a prompt <prompt>
 30      And   <n_predict> max tokens to predict
 31      And   a completion request with no api error
 32      Then  <n_predicted> tokens are predicted matching <re_content>
 33      And   the completion is <truncated> truncated
 34      And   <n_prompt> prompt tokens are processed
 35      And   prometheus metrics are exposed
 36      And   metric llamacpp:tokens_predicted is <n_predicted>
 37  
 38      Examples: Prompts
 39        | prompt                                                                    | n_predict | re_content                                  | n_prompt | n_predicted | truncated |
 40        | I believe the meaning of life is                                          | 8         | (read\|going)+                              | 18       | 8           | not       |
 41        | Write a joke about AI from a very long prompt which will not be truncated | 256       | (princesses\|everyone\|kids\|Anna\|forest)+ | 46       | 64          | not       |
 42  
 43    Scenario: Completion prompt truncated
 44      Given a prompt:
 45      """
 46      Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
 47      Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
 48      Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
 49      Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
 50      """
 51      And   a completion request with no api error
 52      Then  64 tokens are predicted matching fun|Annaks|popcorns|pictry|bowl
 53      And   the completion is  truncated
 54      And   109 prompt tokens are processed
 55  
 56  
 57    Scenario Outline: OAI Compatibility
 58      Given a model <model>
 59      And   a system prompt <system_prompt>
 60      And   a user prompt <user_prompt>
 61      And   <max_tokens> max tokens to predict
 62      And   streaming is <enable_streaming>
 63      Given an OAI compatible chat completions request with no api error
 64      Then  <n_predicted> tokens are predicted matching <re_content>
 65      And   <n_prompt> prompt tokens are processed
 66      And   the completion is <truncated> truncated
 67  
 68      Examples: Prompts
 69        | model        | system_prompt               | user_prompt                          | max_tokens | re_content                        | n_prompt | n_predicted | enable_streaming | truncated |
 70        | llama-2      | Book                        | What is the best book                | 8          | (Here\|what)+                     | 77       | 8           | disabled         | not       |
 71        | codellama70b | You are a coding assistant. | Write the fibonacci function in c++. | 128        | (thanks\|happy\|bird\|Annabyear)+ | -1       | 64          | enabled          |           |
 72  
 73  
 74    Scenario Outline: OAI Compatibility w/ response format
 75      Given a model test
 76      And   a system prompt test
 77      And   a user prompt test
 78      And   a response format <response_format>
 79      And   10 max tokens to predict
 80      Given an OAI compatible chat completions request with no api error
 81      Then  <n_predicted> tokens are predicted matching <re_content>
 82  
 83      Examples: Prompts
 84        | response_format                                                     | n_predicted | re_content             |
 85        | {"type": "json_object", "schema": {"const": "42"}}                  | 5           | "42"                   |
 86        | {"type": "json_object", "schema": {"items": [{"type": "integer"}]}} | 10          | \[ -300 \]             |
 87        | {"type": "json_object"}                                             | 10          | \{ " Jacky.            |
 88  
 89  
 90    Scenario: Tokenize / Detokenize
 91      When tokenizing:
 92      """
 93      What is the capital of France ?
 94      """
 95      Then tokens can be detokenized
 96      And  tokens do not begin with BOS
 97  
 98    Scenario: Tokenize w/ BOS
 99      Given adding special tokens
100      When  tokenizing:
101      """
102      What is the capital of Germany?
103      """
104      Then  tokens begin with BOS
105      Given first token is removed
106      Then  tokens can be detokenized
107  
108    Scenario: Models available
109      Given available models
110      Then  1 models are supported
111      Then  model 0 is identified by tinyllama-2
112      Then  model 0 is trained on 128 tokens context