Introduction

During the WAICF, I noticed that the model wouldn't behave as expected after a few rounds of chat. The issue was likely due to the 32K context window being exceeded. After !43 (merged), this issue is less likely to arise, since past context chunks are not "re-entering" through the chat history. However, there is not safeguards for exceeding the context window to date. It seems a good idea to sanitize a bit the inputs.

Implementation details

We are taking the precedence approach system > last user message > chat history, i.e., the system prompt has to be smaller than the context window, the last user message is truncated if it exceeds the context window (including the system prompt), otherwise we truncate the history to the remaining space left.

Furthermore, I realized we've been using Mistral models wrong. In short, we delegated the task of prompt building for Ollama, which is using the wrong template format. So, I forked the official Modelfile applying a fix. Another necessary change is that Mistral does not support system control tokens. According to a cookbook, and some reverse engineering using the mistral-commons library, I realized that the system prompt tokens are prepended to the last user message with a double new line separator (e.g., \n\n). After some experimentation, I realized that the model is sensitive to this separator, thus I decided to replace double newline separators in the System Prompt with a single newline.

Last important change is that I decided to decrease the context window size from 32K to 4K. With 4K context window, the model should have some space for a concise history, short user messages and some free space for producing a long context. If it proves in practice too small, we can increase it back. The reason why decreasing it matters is more pronounced as the history grows: In fact, I observed a speed up of 10x for very long prompts.

With this patch, user interactions should feel more "uniform," with small slow downs as the history grows and a consistent response quality.

Manual validation

Model is behaving generally nicely (see screenshots)

But the question is: does it work better for long prompts?

To answer this question, I created a gen_history.py to simulate a very long conversation between a student and a teacher exceeding the 32K context window (see history.json).

Then, I sent this history alongside the query "Who teaches databases?" to the production system and the development version (see validate_truncation.py). The question is intentionally vague: The system needs to carefully follow the provided instructions to realize that we're talking about the DBSys course and that we're seeking to know who is the Professor teaching it.

I obtained the following outputs:

# Prod (55.2 s): I could not find any related information in Codex, but Prof. Pietro Michiardi supervises the development team responsible for Codex and EULER. If they have any particular inquiries regarding your behavior, redirect them to his team.
# Dev (6.2 s): Paolo Papotti is an associate professor in the Data Science department at EURECOM. He teaches the course on database systems implementation (DBSys) [cite:1].

As you can see, the new version worked not only better, but delivered a response nearly 10x faster.

No-reg

$ diff -y <(zcat nightly.json.gz | jq '.summary') <(zcat mr45.json.gz | jq '.summary')
{								{
  "context_precision": {					  "context_precision": {
    "avg_score": 0.8752083333111752,			      |	    "avg_score": 0.8956944444199177,
    "failures": 0,						    "failures": 0,
    "total_queries": 20						    "total_queries": 20
  },								  },
  "context_recall": {						  "context_recall": {
    "avg_score": 0.9333333333333333,			      |	    "avg_score": 0.9566666666666667,
    "failures": 0,						    "failures": 0,
    "total_queries": 20						    "total_queries": 20
  },								  },
  "correctness": {						  "correctness": {
    "avg_score": 0.46612329971275834,			      |	    "avg_score": 0.4144087153880008,
    "failures": 0,					      |	    "failures": 2,
    "total_queries": 20						    "total_queries": 20
  },								  },
  "faithfulness": {						  "faithfulness": {
    "avg_score": 0.844136496006302,			      |	    "avg_score": 0.8568834674097833,
    "failures": 1,						    "failures": 1,
    "total_queries": 20						    "total_queries": 20
  }								  }
}								}

Edited Mar 18, 2025 by Giovanni Gatti Pinheiro

Fixing long prompt truncation

Introduction

Implementation details

Manual validation

No-reg

Merge request reports