Tag: RAGAS

  • Specialized Judges: Scaling RAG Evaluation with Prometheus-2 and PydanticAI

    Our production benchmarks utilize the Feedback Collection and Preference Collection datasets to establish the performance delta between generalist and specialized evaluators. We observed that Prometheus-2 (8x7B) achieves a Pearson correlation of $0.898$ with human-annotated ground truth, which is on par with GPT-4 ($0.882$) and significantly higher than previous iterations of small generalist models. By enforcing…

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close