About

Maintainer

By Alex Guglielmone Nemi, 17 years building software products and helping cross-functional teams work together.

What This Is

AI-Evals.io helps builders understand why evaluation matters and choose LLM eval tools through hands-on comparison and practical starting points, so workflows can be shared, maintained, and improved over time.

Who This Is For And Why

When I say “builders” I mean anyone creating workflows with LLMs. Individual experimentation might be easier than ever but sharing, maintaining and improving workflows - especially when collaborating with others - can quickly become labor intensive.

This site provides practical starting points and evaluation patterns that help with cost, time, and collaboration once workflows grow beyond a single person.

Methodology

This site prioritizes practical, reproducible evals over marketing claims.

Symbol	Meaning
Y	Yes - I tested it and it works
y	Yes - docs say so, I haven’t verified
N	No - I tested it, not supported^(*)
n	No - docs say so, I haven’t verified^(*)
P	Partial - I tested it, limited support
p	Partial - docs say so, I haven’t verified
?	Unknown - no information

^(*) In this table, support means built-in or out-of-the-box. If it requires custom code/integration by the user, it is not counted as supported.

Why I built this

I’ve spent my career building products and helping cross-functional teams collaborate at the right level of complexity. One thing I’ve learned is that automation really pays off when it’s easy to change (ETC) - otherwise you can’t iterate fast or explore alternatives.

The principles that guide my thinking around testability and simplicity come from being accountable for live systems over many years. You learn how small decisions compound, how fear sets in when it’s hard to prove things quickly, how feedback loops make speed possible without losing context, and how difficult it is to roll back complexity once it takes hold. When working with others, I’ve found that progress depends on sharing those mental models - but also on having simple ways to practice them through real examples. My goal here is to smooth that path as much as possible, so people can automate the churny parts of their work and move on, instead of automating and hoping for the best - or quietly handing off the hard parts and losing control of the end-to-end system.

With LLMs, that comes down to evaluation (or “evals,” as it’s commonly called).

You trust what you can test. Evaluation is how automation stays reliable. But today, eval tools are hard to compare, hard to get started with, and often require commitment before you can even experiment. I built this site to make it easier to try things quickly, understand the tradeoffs, and keep control: comparison data and time-boxed starting points instead of premature decisions.

Evals also shouldn’t belong to a single role. Anyone working with LLM workflows shapes the outcome. We need shared mental models for what “good enough” means, without outsourcing responsibility or accountability. Evaluation thinking should start on day one.

And finally, good evaluation goes beyond correctness. Responsible automation includes operational and ethical constraints: rate limiting, bias checks, safety validation. This site treats those as first-class concerns, alongside practical implementation guidance.

Principles of comparisons

Quality over assumptions: I’d rather have “unknown” than overstate.
Practitioner-focused: I compare based on real production needs, not theoretical features.
Tool-agnostic: Tools come and go. This may even help many improve at the same time.
Evaluation-first: Test before you ship. Measure what matters.
Living resource: The comparison evolves as I test more tools and features.

Platform choices

Quarto: Separates data from presentation, making updates easy and transparent
Buttondown: Privacy-respecting newsletter platform with easy unsubscribe and data export
JSON API: All comparison data accessible programmatically for tools and LLM agents
Open source: Full source code will be available on GitHub for transparency and contributions soon.