Evaluation Framework¶
The evaluation framework provides tools for measuring LLM output quality using both rule-based heuristics and LLM-as-judge approaches.
Architecture¶
evaluation/
├── Metric # Interface for all metrics
├── MetricInput # Input data for evaluation
├── ScoreResult # Result of a metric evaluation
├── Engine # Runs multiple metrics concurrently
├── heuristic/ # Rule-based metrics
│ ├── string.go # String matching (equals, contains)
│ ├── parsing.go # Format validation (JSON, XML)
│ ├── pattern.go # Regex and format patterns
│ └── similarity.go # Text similarity (BLEU, ROUGE)
└── llm/ # LLM-based judge metrics
├── provider.go # LLM provider interface
└── metrics.go # Judge metrics (relevance, hallucination)
Quick Example¶
import (
"github.com/agentplexus/go-opik/evaluation"
"github.com/agentplexus/go-opik/evaluation/heuristic"
)
// Create metrics
metrics := []evaluation.Metric{
heuristic.NewEquals(false), // Case-insensitive equality
heuristic.NewContains(false), // Substring check
heuristic.NewIsJSON(), // JSON validation
}
// Create engine
engine := evaluation.NewEngine(metrics,
evaluation.WithConcurrency(4),
)
// Create input
input := evaluation.NewMetricInput("What is 2+2?", "The answer is 4.")
input = input.WithExpected("4")
// Evaluate
result := engine.EvaluateOne(ctx, input)
fmt.Printf("Average score: %.2f\n", result.AverageScore())
for name, score := range result.Scores {
fmt.Printf(" %s: %.2f\n", name, score.Value)
}
Metric Interface¶
All metrics implement this interface:
type Metric interface {
Name() string
Score(ctx context.Context, input MetricInput) *ScoreResult
}
MetricInput¶
Contains all data needed for evaluation:
type MetricInput struct {
Input string // The original input/prompt
Output string // The LLM's output to evaluate
Expected string // Expected/ground truth output
Context string // Additional context
Metadata map[string]any // Any extra data
}
// Create input
input := evaluation.NewMetricInput(prompt, llmOutput)
input = input.WithExpected(expectedOutput)
input = input.WithContext(additionalContext)
ScoreResult¶
Contains the evaluation result:
type ScoreResult struct {
Name string // Metric name
Value float64 // Score (typically 0.0 to 1.0)
Reason string // Explanation for the score
Metadata map[string]any // Additional data
Error error // Error if evaluation failed
}
// Helper constructors
score := evaluation.NewScoreResult("accuracy", 0.95)
score := evaluation.NewScoreResultWithReason("accuracy", 0.95, "Exact match found")
score := evaluation.BooleanScore("is_valid", true) // 1.0 for true, 0.0 for false
Evaluation Engine¶
Run multiple metrics concurrently:
// Create engine with options
engine := evaluation.NewEngine(metrics,
evaluation.WithConcurrency(4), // Run 4 metrics in parallel
)
// Evaluate single input
result := engine.EvaluateOne(ctx, input)
// Evaluate multiple inputs
inputs := []evaluation.MetricInput{input1, input2, input3}
results := engine.EvaluateMany(ctx, inputs)
// Evaluate with item IDs (for datasets)
itemResults := engine.EvaluateWithIDs(ctx, map[string]evaluation.MetricInput{
"item-1": input1,
"item-2": input2,
})
Dataset Evaluator¶
Evaluate entire datasets:
evaluator := evaluation.NewDatasetEvaluator(engine, client)
results, err := evaluator.Evaluate(ctx, dataset,
func(item map[string]any) string {
// Generate output for each dataset item
return llmClient.Complete(item["input"].(string))
},
)
Metric Categories¶
| Category | Description | Examples |
|---|---|---|
| Heuristic | Rule-based, deterministic | Equals, Contains, IsJSON, BLEU |
| LLM Judge | Uses LLM to evaluate | Relevance, Hallucination, Factuality |
Best Practices¶
- Combine metrics: Use multiple metrics for comprehensive evaluation
- Use heuristics first: They're faster and cheaper than LLM judges
- Set appropriate concurrency: Balance speed vs. rate limits
- Handle errors: Check
ScoreResult.Errorfor failed evaluations - Log to traces: Add scores as feedback to traces for tracking