Introduction

Since the WAICF, I noticed that EULER can be a little too brittle to use. For example, if you query "Who teaches databases?" it will answer generically what it takes to teach databases, missing that the context relates to EURECOM. A better query would be "Who teaches databases at EURECOM"?

There is a natural ambiguity to language, and the LLM should be able to work with it. Except for contextual search. The query "who teaches databases?" will deliver useless text chunks, and, as a result, the LLM does its best and produces a generic response. However, with the query "who teaches databases at EURECOM?," it will deliver the correct chunks and the LLM will produce the intended answer.

The current "condense_prompt.txt" is plagued with this issue. And, here is the fix!

Implementation details

Rather than rushing a solution, I decided to do something more scientific. I manually crafted a dataset of 22 chat histories and reference query, touching many different topics and domains (any test that came to my mind). Then, I tuned a judge model to evalute good/bad examples according to the reference question.

Finally, I use this judge model to iteratively improve the query rewrite prompt. The accuracy improved from 0.41 to 0.91!

The scripts are stored within the project, so, if we decide to change the underlying model, we can reassess it.

Check the original-report.json for the old assessment, and report.json for the new one.

Here is a small list of the failed query rewriting:

    "summary": {
        "accuracy": 0.9090909090909091,
        "failed_examples": [
            {
                "test_case": "The user query needs to be expanded on conversational jargon.",
                "reference": "In what ways do Graph Neural Networks (GNNs) differ from Convolutional Neural Networks (CNNs)?",
                "query_rewrite": "What are the advantages and disadvantages of Graph Neural Networks compared to Convolutional Neural Networks?",
                "reason": "The standalone question is clear, but it does not perfectly align with the reference question. The reference question asks about the differences between Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), which implies a comparison of their characteristics, advantages, or disadvantages. However, the standalone question explicitly asks for advantages and disadvantages, which is a more specific type of comparison than what the reference question implies. The reference question is more open-ended, allowing for any type of difference (not limited to advantages and disadvantages), while the standalone question narrows down the comparison to pros and cons. Therefore, the standalone question, although related, does not capture the full scope of the comparison intended by the reference question."
            },
            {
                "test_case": "The user is asking about model deployment but needs a more specific question for actionable insights.",
                "reference": "What are the main challenges when deploying deep learning models in production environments, and how can they be addressed?",
                "query_rewrite": "What issues might I run into while deploying my deep learning model in a production environment?",
                "reason": "The standalone question is clear and concise, but it lacks the comprehensive aspect of the reference question. The reference question not only asks about the challenges but also inquires about how they can be addressed, which provides a more complete understanding of the issue at hand. Additionally, the reference question generalizes the scenario to \"deep learning models\" rather than the user's specific model, making it more applicable and informative."
            }
        ]
    }

As you can see, it is not that bad.

Here is a positive example

        {
            "test_case": "Given the last user query it is not possible to infer who he is talking about. Therefore, it must be enriched from further context from the chat history.",
            "history": [
                {
                    "role": "user",
                    "message": "Who is the director of EURECOM?"
                },
                {
                    "role": "assistant",
                    "message": "Prof. David Gesbert who took over Ulrich Finder in 2022."
                },
                {
                    "role": "user",
                    "message": "What do you know about him?"
                }
            ],
            "reference": "What do you know about Prof. David Gesbert, the director of EURECOM?",
            "query_rewrite": "What do you know about Prof. David Gesbert, including his background and contributions to EURECOM?",
            "assessment": true,
            "reason": "The standalone question is concise and clear, and it provides more specific information about what the user wants to know about Prof. David Gesbert, such as his background and contributions to EURECOM. This additional detail enhances the clarity of the query without introducing ambiguity. The reference question and the standalone question both identify the subject of inquiry as Prof. David Gesbert, the director of EURECOM, ensuring that the context is preserved and well-defined. Therefore, the standalone question is as comprehensive and clear as the reference question."
        },

Manual Tests

EULER Prod

After this package

I performed several other manual tests, and it generally performed well. I also checked how the queries are being rewritten from, and it seems consistent. However, I didn't notice much difference in the quality of interactions between production and this patch. The improvements are clearer on very vague questions.

No-Reg

$ diff -y <(zcat mr44.json.gz | jq '.summary') <(zcat nightly.json.gz | jq '.summary')
{								{
  "context_precision": {					  "context_precision": {
    "avg_score": 0.8812499999790037,			      |	    "avg_score": 0.8881249999777483,
    "failures": 0,						    "failures": 0,
    "total_queries": 20						    "total_queries": 20
  },								  },
  "context_recall": {						  "context_recall": {
    "avg_score": 0.9202380952380953,			      |	    "avg_score": 0.8847619047619049,
    "failures": 0,						    "failures": 0,
    "total_queries": 20						    "total_queries": 20
  },								  },
  "correctness": {						  "correctness": {
    "avg_score": 0.34919569038581966,			      |	    "avg_score": 0.38428611898835896,
    "failures": 3,						    "failures": 3,
    "total_queries": 20						    "total_queries": 20
  },								  },
  "faithfulness": {						  "faithfulness": {
    "avg_score": 0.8255418850155692,			      |	    "avg_score": 0.8874477861319967,
    "failures": 1,						    "failures": 1,
    "total_queries": 20						    "total_queries": 20
  }								  }
}								}
							}

$ diff <(zcat mr44.json.gz | jq ' .details.[] | {"query": .query, "answer": .response}') <(zcat mr44.json.gz | jq ' .details.[] | {"query": .query, "answer": .response}')
$ echo $?
0

Edited Mar 05, 2025 by Giovanni Gatti Pinheiro

improving query rewriting for contextual search

Introduction

Implementation details

Manual Tests

No-Reg

Merge request reports