LLM performance evaluation

Large Language Models (LLMs) present a unique challenge when it comes to performance evaluation. Unlike traditional machine learning where outcomes are often binary, LLM outputs dwell in a spectrum of correctness. Also, while your base model may excel in broad metrics, general performance doesn’t guarantee optimal performance for your specific use cases. 

Therefore, a holistic approach to evaluating LLMs must utilize a variety of approaches, such as using LLMs to evaluate LLMs (i.e., auto-evaluation) and using human-LLM hybrid approaches. This article dives into the specific steps of different methods, covering how to create custom evaluation sets tailored to your application, pinpoint relevant metrics, and implement rigorous evaluation methods – both for selecting models and monitoring ongoing performance in production.

Build Targeted Evaluation Sets For Your Use Cases

To assess the performance of an LLM on a specific use case, you need to test the model on a set of examples that are representative of your target use cases. This requires building a custom evaluation set. 

  1. Start small. For testing LLM performance on your use case, you may start with as few as 10 examples. Each of these examples can be run multiple times to assess the model’s consistency and reliability.
  2. Pick up challenging examples. The examples you choose should not be straightforward. They should be challenging, designed to test the model’s capacity to the fullest. This could include prompts with unexpected inputs, queries that could induce biases, or questions that require a deep understanding of the subject. It’s not about tricking the model, but rather ensuring it’s prepared for the unpredictable nature of real-world applications.
  3. Consider harnessing LLMs for building an evaluation set. Interestingly, it’s a common practice to leverage language models for building evaluation sets to assess either itself or other language models. For example, an LLM can generate a set of Q&A pairs based on an input text, which you can use as a first batch of samples for your question-answering application. 
  4. Incorporate user feedback. Whether from internal team testing or wider deployment, user feedback often reveals unforeseen challenges and real-world scenarios. Such feedback can be integrated as new challenging examples in your evaluation sets. 

In essence, building a custom evaluation set is a dynamic process, adapting and growing in tandem with your LLM project’s lifecycle. This iterative methodology ensures your model remains attuned to current, relevant challenges.

Combine Metrics, Comparisons, and Criteria-Based Evaluation

Metrics alone are usually insufficient to evaluate LLMs. LLMs operate in a realm where there isn’t always a singular “correct” answer. Furthermore, using aggregate metrics can be misleading. A model might excel in one domain and falter in another, yet still register an impressive average score.

Your evaluation criteria will depend on the distinct attributes of the particular LLM system. While accuracy and unbiasedness are common objectives, other criteria might be paramount in specific scenarios. For instance, a medical chatbot may prioritize response harmlessness, a customer support bot might emphasize maintaining a consistent friendly tone, or a web development application could require outputs in a specific format.

To streamline the process, multiple evaluation criteria can be integrated into a singular feedback function. It will take as input the text generated by an LLM and some metadata, and then output a score that indicates the quality of the text.

Thus, holistic evaluation of LLM performance typically entails at least 3 different approaches: 

  • Quantitative Metrics: When definitive correct answers exist, you can default to traditional ML evaluation methods using quantitative approaches
  • Reference Comparisons: For instances without a clear-cut singular answer but with an available reference of acceptable responses, the model’s response can be compared and contrasted against pre-existing examples. 
  • Criteria-Based Evaluation: In the absence of a reference, the focus shifts to gauging the model’s output against the predefined criteria.

Both reference comparisons and criteria-based evaluations can be executed either by human evaluators or through automated processes. Next, we’ll delve into the advantages and drawbacks of these distinct evaluation approaches.

Human, Auto-Evaluation, and Hybrid Approaches

Human evaluation is frequently viewed as the gold standard for evaluating machine learning applications, LLM-based systems included, but is not always feasible due to temporal or technical constraints. Auto-evaluation and Hybrid approaches are often used in enterprise settings to scale LLM performance evaluation. 

Human Evaluation

Having human oversight on the output of LLM-based applications is essential for ensuring the accuracy and reliability of these systems. However, relying solely on this approach to evaluate LLMs may not be ideal due to the following key limitations:

  • Quality Concerns: Surprisingly, advanced models like GPT-4 often produce evaluations of superior quality compared to the average results from workers hired via Mechanical Turk. Human evaluators, unless guided by meticulous experimental designs, may not focus on the core qualities that matter most. There’s a propensity to get caught up in superficial elements; for instance, they might favor a well-formatted but erroneous response over an accurate yet plainly presented one.
  • Cost Implications: Acquiring top-tier human evaluations is expensive. The higher the quality of evaluation you seek, the steeper the associated costs.
  • Time Constraints: Collecting human evaluations is time-consuming. In the fast-paced world of LLM-based system development, where deployments can happen within mere days or weeks, developers can’t always afford to pause and await feedback. 

These constraints underscore the importance of complementing human evaluations with more efficient assessment techniques.


Large language models have proven adept at evaluating the performance of their counterparts. Notably, a more advanced or larger LLM can be utilized to assess the performance of smaller models. It’s also common to use an LLM to assess its own output. Given the mechanics of LLMs, a model might initially provide an incorrect answer. Yet, by furnishing the same model with a strategically crafted prompt that requests an evaluation of its initial response, the model effectively gets an opportunity to “reflect” or “rethink”. This procedure substantially boosts the likelihood of the model identifying any errors.

Using LLMs to evaluate other LLMs offers a swift and cost-effective alternative to employing human evaluators. However, this method has critical pitfalls that business and technology leaders must be prepared to address: 

  • When tasked with rating a response on a 1 to 5 scale, LLMs might exhibit a consistent bias towards a specific rating, regardless of the response’s actual quality.
  • When comparing its own output with that of other models, an LLM generally shows a preference for its own response.
  • The sequencing of response candidates can occasionally influence the evaluation, such as for example, demonstrating a preference for the first displayed candidate answer. 
  • LLMs tend to favor longer responses, even if they contain factual errors or are harder for human users to understand and use. 

Given the imperfections inherent in LLM evaluations, the strategic incorporation of manual oversight by human evaluators remains an advisable step and should not be omitted from your LLM application development process.

hybrid approach to LLM performance evaluation
From Evaluating LLM-based Applications presentation by Josh Tobin

Hybrid Approach

The prevailing approach is for developers to lean heavily on automatic evaluations facilitated by LLMs. This equips them with an immediate feedback mechanism, enabling swift model selection, fine-tuning, and experimentation with varied system prompts. The goal is to achieve an optimally performing system based on these automatic evaluations. Once the automated evaluation phase is completed, the next step typically involves a deeper dive with high-quality human evaluators to validate the auto-evaluation’s trustworthiness.

Securing high-quality human evaluations can be a costly endeavor. While it isn’t pragmatic to resort to this level of scrutiny after every minor system refinement, human evaluation is an indispensable phase before transitioning an LLM system into a production environment. As noted earlier, evaluations from LLMs can manifest biases and be unreliable. 

Post-deployment, it’s crucial to gather genuine feedback from the end-users of our LLM-based applications. Feedback can be as simple as having users rate a response as useful (thumbs up) or not useful (thumbs down), but ideally should be accompanied by detailed comments highlighting the strengths and shortcomings of the model’s responses.

Foundational model updates or shifts in user queries might inadvertently degrade your application’s performance or expose latent weaknesses. Ongoing monitoring of the LLM application’s performance against our defined criteria remains critical throughout its operational life so you can quickly identify and address emerging deficiencies. . 

Key Takeaways

Evaluating the performance of LLM-based systems presents unique challenges, setting the task apart from conventional machine learning evaluations. In the process of evaluating an LLM system, the following critical considerations should be taken into account to inform your methodology:

  • Tailored Evaluation Sets: To derive actionable insights, it’s imperative to construct robust, application-centric evaluation sets. These sets don’t necessarily need to be large, but they should encompass a range of challenging samples.
  • Dynamic Expansion of Evaluation Challenges: As you receive feedback from users, it’s crucial to iteratively expand and refine the evaluation set to capture evolving challenges and nuances.
  • Quantitative Metrics & Qualitative Criteria: LLMs’ intricate nature often eludes straightforward quantitative metrics. It’s essential to establish a set of criteria tailored to your specific use case, allowing for a more nuanced assessment of the model’s performance.
  • Unified Feedback Function: To simplify the evaluation process, consider combining multiple criteria into a singular, coherent feedback function.
  • Hybrid Evaluation Approach: Leveraging both LLMs and high-quality human evaluators in your evaluation process offers a more comprehensive perspective and yields the most reliable and cost-effective results.
  • Continuous Real-World Monitoring: By merging user feedback with the unified feedback function, you can continuously monitor and fine-tune LLM performance, ensuring consistent alignment with real-world requirements.