Tag: RAGAS

  • Specialized Judges: Scaling RAG Evaluation with Prometheus-2 and PydanticAI

    Our production benchmarks utilize the Feedback Collection and Preference Collection datasets to establish the performance delta between generalist and specialized evaluators. We observed that Prometheus-2 (8x7B) achieves a Pearson correlation of $0.898$ with human-annotated ground truth, which is on par with GPT-4 ($0.882$) and significantly higher than previous iterations of small generalist models. By enforcing…