About
What This Is
AI-Evals.io helps practitioners choose LLM evaluation frameworks through honest comparison, based on hands-on testing and documentation research, not marketing claims.
The framework comparison table shows what actually works, what’s documented but unverified, and what we haven’t tested yet. No vendor hype. No assumptions.
Who Built it
Built by Alex Guglielmone Nemi
LinkedIn • GitHub • Newsletter
Why
- For my own benefit and because I think evals and testing are pivotal to automation: You rely on what you can’t test.
- To help the open source community
- To help different role families pick the right tools for them.
- To ground discussions on specifics and prove them.
Principles
- Quality over assumptions: I’d rather have “unknown” than over-populate
- Practitioner-focused: We compare based on real production needs, not theoretical features
- Tool-agnostic: Tools come and go. This may even help many improve at the same time.
- Living resource: The comparison evolves as we test more frameworks and features
Methodology
Y, proven y, docs Proven: Tested and proven it works. Docs: Mentioned in the docs but not proven. N, proven n, docs Proven: Tested and proven it does not work. Docs: Mentioned in the docs but not proven. P, partial p, docs Proven: Tested limited support or coverage. Docs: Limited support mentioned in docs but not proven.
?, unknown No information
Methodology
This site prioritizes practical, reproducible evals over marketing claims.
- Evidence tags:
provenmeans I validated it,docsmeans I only saw it in documentation.
If you spot an error or have better evidence, open a GitHub issue or send a note.
Platform Choices
- Quarto: Separates data from presentation, making updates easy and transparent
- Buttondown: Privacy-respecting newsletter platform with easy unsubscribe and data export
- Open source: Full source code will be available on GitHub for transparency and contributions before
Roadmap
- Expand framework coverage
- Ensure programmatic access to the data is easy
- Add interactive decision tree to help users choose frameworks
- Create/Share practical cookbooks for common eval patterns
- Build regional communities around evaluation best practices
Have feedback? Reach out or join the newsletter.