By Alex Guglielmone Nemi, 17 years building software products and helping cross-functional teams work together.
About
Maintainer
LinkedIn • GitHub • Newsletter
What This Is
AI-Evals.io helps builders understand why evaluation matters and choose LLM eval tools through hands-on comparison and practical starting points, so workflows can be shared, maintained, and improved over time.
Who This Is For And Why
When I say “builders” I mean anyone creating workflows with LLMs. Individual experimentation might be easier than ever but sharing, maintaining and improving workflows - especially when collaborating with others - can quickly become labor intensive.
This site provides practical starting points and evaluation patterns that help with cost, time, and collaboration once workflows grow beyond a single person.
Methodology
This site prioritizes practical, reproducible evals over marketing claims.
| Symbol | Meaning |
|---|---|
| Y | Yes - I tested it and it works |
| y | Yes - docs say so, I haven’t verified |
| N | No - I tested it, not supported(*) |
| n | No - docs say so, I haven’t verified(*) |
| P | Partial - I tested it, limited support |
| p | Partial - docs say so, I haven’t verified |
| ? | Unknown - no information |
- (*) In this table, support means built-in or out-of-the-box. If it requires custom code/integration by the user, it is not counted as supported.
Why I built this
I’ve spent my career building products and helping cross-functional teams collaborate at the right level of complexity. One thing I’ve learned is that automation really pays off when it’s easy to change (ETC) - otherwise you can’t iterate fast or explore alternatives.
The principles that guide my thinking around testability and simplicity come from being accountable for live systems over many years. You learn how small decisions compound, how fear sets in when it’s hard to prove things quickly, how feedback loops make speed possible without losing context, and how difficult it is to roll back complexity once it takes hold. When working with others, I’ve found that progress depends on sharing those mental models - but also on having simple ways to practice them through real examples. My goal here is to smooth that path as much as possible, so people can automate the churny parts of their work and move on, instead of automating and hoping for the best - or quietly handing off the hard parts and losing control of the end-to-end system.
With LLMs, that comes down to evaluation (or “evals,” as it’s commonly called).
You trust what you can test. Evaluation is how automation stays reliable. But today, eval tools are hard to compare, hard to get started with, and often require commitment before you can even experiment. I built this site to make it easier to try things quickly, understand the tradeoffs, and keep control: comparison data and time-boxed starting points instead of premature decisions.
Evals also shouldn’t belong to a single role. Anyone working with LLM workflows shapes the outcome. We need shared mental models for what “good enough” means, without outsourcing responsibility or accountability. Evaluation thinking should start on day one.
And finally, good evaluation goes beyond correctness. Responsible automation includes operational and ethical constraints: rate limiting, bias checks, safety validation. This site treats those as first-class concerns, alongside practical implementation guidance.
Principles of comparisons
- Quality over assumptions: I’d rather have “unknown” than overstate.
- Practitioner-focused: I compare based on real production needs, not theoretical features.
- Tool-agnostic: Tools come and go. This may even help many improve at the same time.
- Evaluation-first: Test before you ship. Measure what matters.
- Living resource: The comparison evolves as I test more tools and features.
Platform choices
- Quarto: Separates data from presentation, making updates easy and transparent
- Buttondown: Privacy-respecting newsletter platform with easy unsubscribe and data export
- JSON API: All comparison data accessible programmatically for tools and LLM agents
- Open source: Full source code will be available on GitHub for transparency and contributions soon.
What’s next
Short term:
- Expand tool coverage with hands-on testing
- Refine data models and how to communicate nuance without losing the value of the discrete answers
- Evaluation patterns for Apache Airflow + GenAI workflows (related talk)
- Add interactive decision trees to help users make decisions, such as which tool to try out
- Create practical cookbooks for common evaluation patterns
- Push security defaults with one-liners so it feels there’s no tradeoff. If something is hard, people won’t default to it.
Long term:
- Build evaluation-first thinking into how people approach LLM automation
- Grow a community around responsible, effective evaluation practices
- Make testing as natural as prompting for anyone working with LLMs
Feedback welcome: Reach out directly (Linktree) or join Eval-Ception Discussions.