About

What This Is

AI-Evals.io helps practitioners choose LLM evaluation frameworks through honest comparison, based on hands-on testing and documentation research, not marketing claims.

The framework comparison table shows what actually works, what’s documented but unverified, and what we haven’t tested yet. No vendor hype. No assumptions.

Who Built it

Built by Alex Guglielmone Nemi

LinkedInGitHubNewsletter

Why

  • For my own benefit and because I think evals and testing are pivotal to automation: You rely on what you can’t test.
  • To help the open source community
  • To help different role families pick the right tools for them.
  • To ground discussions on specifics and prove them.

Principles

  • Quality over assumptions: I’d rather have “unknown” than over-populate
  • Practitioner-focused: We compare based on real production needs, not theoretical features
  • Tool-agnostic: Tools come and go. This may even help many improve at the same time.
  • Living resource: The comparison evolves as we test more frameworks and features

Methodology

Y, proven y, docs Proven: Tested and proven it works. Docs: Mentioned in the docs but not proven. N, proven n, docs Proven: Tested and proven it does not work. Docs: Mentioned in the docs but not proven. P, partial p, docs Proven: Tested limited support or coverage. Docs: Limited support mentioned in docs but not proven.

?, unknown No information

Methodology

This site prioritizes practical, reproducible evals over marketing claims.

  • Evidence tags: proven means I validated it, docs means I only saw it in documentation.

If you spot an error or have better evidence, open a GitHub issue or send a note.

Platform Choices

  • Quarto: Separates data from presentation, making updates easy and transparent
  • Buttondown: Privacy-respecting newsletter platform with easy unsubscribe and data export
  • Open source: Full source code will be available on GitHub for transparency and contributions before

More on privacy

Roadmap

  • Expand framework coverage
  • Ensure programmatic access to the data is easy
  • Add interactive decision tree to help users choose frameworks
  • Create/Share practical cookbooks for common eval patterns
  • Build regional communities around evaluation best practices

Have feedback? Reach out or join the newsletter.