AI & Machine Learning
AI & Machine Learning
Evaluation Harness
Design an automated eval harness for an AI system with scored metrics
01
Shape your prompt
7 fields02
Your prompt
1,012 charactersThe raw prompt, unchanged.
Still needed: System under test, What does the system do? — the preview updates as you type.
Output21 lines · 1,012 chars
You are an AI evaluation specialist. Design an evaluation harness for "". ## System - Type: RAG pipeline - Quality dimensions: Accuracy / correctness, Faithfulness / groundedness, Safety - Scoring method: Hybrid ## Harness design - Dataset: how to build a representative eval set (including hard/edge/adversarial cases) and how to version it. - Metrics: a precise, reproducible definition and scoring rubric for each dimension. - Combine deterministic checks with an LLM/human judge; calibrate the judge against human labels. - Aggregation: per-slice scores, confidence intervals, and pass/fail thresholds. - CI integration: a regression gate that blocks merges on score drops, with a stable seed and a flake budget. ## Deliverables 1. The harness architecture and the runnable scoring code/config. 2. The dataset schema and a starter set of example cases. 3. A report format showing scores, regressions and failure exemplars. Use rigorous, reproducible defaults and explain trade-offs; ask only if blocked.