About

What This Is

AI-Evals.io helps practitioners choose LLM evaluation frameworks through honest comparison, based on hands-on testing and documentation research, not marketing claims.

The framework comparison table shows what actually works, what’s documented but unverified, and what we haven’t tested yet. No vendor hype. No assumptions.

Who Built it

Built by Alex Guglielmone Nemi

LinkedIn • GitHub • Newsletter

Why

For my own benefit and because I think evals and testing are pivotal to automation: You rely on what you can’t test.
To help the open source community
To help different role families pick the right tools for them.
To ground discussions on specifics and prove them.

Principles

Quality over assumptions: I’d rather have “unknown” than over-populate
Practitioner-focused: We compare based on real production needs, not theoretical features
Tool-agnostic: Tools come and go. This may even help many improve at the same time.
Living resource: The comparison evolves as we test more frameworks and features

Methodology

Y, proven y, docs Proven: Tested and proven it works. Docs: Mentioned in the docs but not proven. N, proven n, docs Proven: Tested and proven it does not work. Docs: Mentioned in the docs but not proven. P, partial p, docs Proven: Tested limited support or coverage. Docs: Limited support mentioned in docs but not proven.

?, unknown No information

Methodology

This site prioritizes practical, reproducible evals over marketing claims.

Evidence tags: proven means I validated it, docs means I only saw it in documentation.

If you spot an error or have better evidence, open a GitHub issue or send a note.

Platform Choices

Quarto: Separates data from presentation, making updates easy and transparent
Buttondown: Privacy-respecting newsletter platform with easy unsubscribe and data export
Open source: Full source code will be available on GitHub for transparency and contributions before Mar 2026

Roadmap

Expand framework coverage
Ensure programmatic access to the data is easy
Add interactive decision tree to help users choose frameworks
Create/Share practical cookbooks for common eval patterns
Build regional communities around evaluation best practices

Have feedback? Reach out or join the newsletter.