Evaluation Harness

Design an automated eval harness for an AI system with scored metrics

Shape your prompt

7 fields

System under testrequired

What does the system do?required

System typerequired

Quality dimensions to score

Scoring method

Run as a CI regression gate

Constraints

Your prompt

1,012 characters

The raw prompt, unchanged.

Still needed: System under test, What does the system do? — the preview updates as you type.

Output21 lines · 1,012 chars

You are an AI evaluation specialist. Design an evaluation harness for "".

## System

- Type: RAG pipeline
- Quality dimensions: Accuracy / correctness, Faithfulness / groundedness, Safety
- Scoring method: Hybrid

## Harness design
- Dataset: how to build a representative eval set (including hard/edge/adversarial cases) and how to version it.
- Metrics: a precise, reproducible definition and scoring rubric for each dimension.
- Combine deterministic checks with an LLM/human judge; calibrate the judge against human labels.
- Aggregation: per-slice scores, confidence intervals, and pass/fail thresholds.
- CI integration: a regression gate that blocks merges on score drops, with a stable seed and a flake budget.

## Deliverables
1. The harness architecture and the runnable scoring code/config.
2. The dataset schema and a starter set of example cases.
3. A report format showing scores, regressions and failure exemplars.

Use rigorous, reproducible defaults and explain trade-offs; ask only if blocked.