Comparison Evaluators
Comparison evaluators in LangChain help measure two different chains or LLM outputs. These evaluators are helpful for comparative analyses, such as A/B testing between two language models, or comparing different versions of the same model. They can also be useful for things like generating preference scores for ai-assisted reinforcement learning.
These evaluators inherit from the PairwiseStringEvaluator
class, providing a comparison interface for two strings - typically, the outputs from two different prompts or models, or two versions of the same model. In essence, a comparison evaluator performs an evaluation on a pair of strings and returns a dictionary containing the evaluation score and other relevant details.
To create a custom comparison evaluator, inherit from the PairwiseStringEvaluator
class and overwrite the _evaluate_string_pairs
method. If you require asynchronous evaluation, also overwrite the _aevaluate_string_pairs
method.
Here's a summary of the key methods and properties of a comparison evaluator:
evaluate_string_pairs
: Evaluate the output string pairs. This function should be overwritten when creating custom evaluators.aevaluate_string_pairs
: Asynchronously evaluate the output string pairs. This function should be overwritten for asynchronous evaluation.requires_input
: This property indicates whether this evaluator requires an input string.requires_reference
: This property specifies whether this evaluator requires a reference label.
The run_on_dataset evaluation method is designed to evaluate only a single model at a time, and thus, doesn't support these evaluators.
Detailed information about creating custom evaluators and the available built-in comparison evaluators is provided in the following sections.
📄️ Pairwise string comparison
Open In Colab
📄️ Pairwise embedding distance
Open In Colab
📄️ Custom pairwise evaluator
Open In Colab