LLM Judge Metrics¶
Use an LLM to evaluate outputs that can't be measured with simple rules.
import (
"github.com/agentplexus/go-opik/evaluation/llm"
"github.com/agentplexus/go-opik/integrations/openai"
)
Setting Up a Provider¶
First, create an LLM provider:
// OpenAI
provider := openai.NewProvider(
openai.WithAPIKey("your-api-key"),
openai.WithModel("gpt-4o"),
)
// Anthropic
provider := anthropic.NewProvider(
anthropic.WithAPIKey("your-api-key"),
anthropic.WithModel("claude-sonnet-4-20250514"),
)
// gollm (any provider)
provider := gollm.NewProvider(gollmClient,
gollm.WithModel("gpt-4o"),
)
Built-in Judge Metrics¶
Answer Relevance¶
Evaluates how relevant the answer is to the question.
metric := llm.NewAnswerRelevance(provider)
Hallucination¶
Detects factual claims not supported by the context.
metric := llm.NewHallucination(provider)
Requires context in the input:
input := evaluation.NewMetricInput(question, answer).
WithContext(relevantDocuments)
Factuality¶
Checks if the response is factually accurate.
metric := llm.NewFactuality(provider)
Context Recall¶
Measures how much of the expected information is captured.
metric := llm.NewContextRecall(provider)
Context Precision¶
Measures precision of information retrieval.
metric := llm.NewContextPrecision(provider)
Moderation¶
Checks for harmful, inappropriate, or policy-violating content.
metric := llm.NewModeration(provider)
Coherence¶
Evaluates logical flow and consistency.
metric := llm.NewCoherence(provider)
Helpfulness¶
Measures how helpful the response is to the user.
metric := llm.NewHelpfulness(provider)
G-EVAL¶
Flexible evaluation with custom criteria and evaluation steps.
geval := llm.NewGEval(provider, "fluency and coherence")
// With custom evaluation steps
geval = geval.WithEvaluationSteps([]string{
"Check if the response is grammatically correct",
"Evaluate the logical flow of ideas",
"Assess clarity and readability",
"Check for appropriate vocabulary usage",
})
score := geval.Score(ctx, input)
Custom Judge¶
Create a judge with a custom prompt template:
prompt := `
Evaluate whether the response maintains a professional tone.
User message: {{input}}
AI response: {{output}}
Provide a score from 0.0 to 1.0 where:
- 1.0: Completely professional
- 0.5: Somewhat professional with minor issues
- 0.0: Unprofessional
Return JSON: {"score": <float>, "reason": "<explanation>"}
`
judge := llm.NewCustomJudge("tone_check", prompt, provider)
Template Variables¶
| Variable | Description |
|---|---|
{{input}} |
The original input/prompt |
{{output}} |
The LLM's response |
{{expected}} |
Expected/ground truth output |
{{context}} |
Additional context |
Using Multiple Judges¶
metrics := []evaluation.Metric{
llm.NewAnswerRelevance(provider),
llm.NewHallucination(provider),
llm.NewCoherence(provider),
llm.NewHelpfulness(provider),
}
engine := evaluation.NewEngine(metrics,
evaluation.WithConcurrency(2), // Limit concurrent LLM calls
)
input := evaluation.NewMetricInput(question, answer).
WithExpected(expectedAnswer).
WithContext(documents)
result := engine.EvaluateOne(ctx, input)
Caching Responses¶
Reduce costs by caching identical evaluations:
// Wrap provider with caching
cachedProvider := llm.NewCachingProvider(provider)
// Use cached provider for metrics
metric := llm.NewAnswerRelevance(cachedProvider)
Best Practices¶
- Choose appropriate models: GPT-4 or Claude 3 for nuanced evaluation
- Limit concurrency: Respect rate limits
- Use caching: For repeated evaluations
- Combine with heuristics: Use LLM judges only when needed
- Monitor costs: LLM evaluations add up
Cost Considerations¶
Each LLM judge metric makes an API call. For large datasets:
- Pre-filter with heuristic metrics
- Use caching for duplicate inputs
- Batch evaluations during off-peak hours
- Consider smaller models for simple judgments