Dentro – AI Development & AI Consulting

Woman saying "Bet you don't know all 18 LLM benchmarks!"

Best list of LLM benchmarks – bet you don’t know all 18!

Introduction

Large language models (LLMs) are transforming how we interact with technology, but evaluating their performance can be complex.

You want to decide on which LLM to use for your specific use case.
Which one should you choose?
Well it depends. Do you need a fast LLM? A smart LLM? Good for coding? Good for conflict resolution? Cheap? Smart but open source? RAG? Completely uncenso…?

LLM benchmarks provide a structured way to compare models across all of these topics.
This helps users (you!) to choose the best fit for their needs! ✅

We at Dentro are always on the hunt for good LLM benchmarks.

This post explores a comprehensive list of 18 benchmarks.
We gathered them from the comment section in response to Simon Willison’s X post asking about credible LLM leaderboards (what a gold mine!).

Benchmark Overview

Below is a detailed look at each LLM benchmark that we found in the comment section of the X post.
We knew about half of them already, but got to know a lot more!
Here’s the complete incomplete list of LLM benchmarks available:

  • Aider bench: This LLM benchmark focuses on code editing, testing LLMs on their ability to follow instructions and edit code across multiple programming languages without human intervention. It uses a polyglot benchmark with 225 exercises to evaluate models like Gemini 2.5 Pro Preview, which achieved a 72.9% success rate.
  • EQ-Bench: Specializing in emotional intelligence, eqbench evaluates how well LLMs can mediate conflicts in emotionally charged scenarios. It assesses models based on their adherence to professional mediation standards, with details available in its GitHub repository and a related research paper (GitHub, Paper).
  • Chatbot arena (formerly LMSys): A crowdsourced platform where LLMs battle it out in chatbot interactions, with user votes determining rankings based on preferences for text, images, and more. It uses the Elo rating system to rank models like GPT-4o and Qwen 32B, with over 2.8 million votes contributing to the leaderboard.
  • SVGarena: For those who thought LLMs were only about text, svgarena pits models against each other in generating SVG images. It’s a unique benchmark that tests creative and visual capabilities, with models competing based on user preferences for the generated SVGs.
  • LongBench: As the name suggests, longbench tests LLMs on their ability to handle extremely long contexts, up to 2 million words, across tasks like QA, in-context learning, and dialogue understanding. This benchmark finds a single needle in a haystack, good for evaluating models for RAG pipelines (read here more about the 5 steps to unlock Retrieval Augmented Generation for Corporate Data).
  • Fiction.liveBench: Using the fiction.live platform, this benchmark challenges LLMs to navigate long, narrative contexts, finding multiple needles in a haystack and understanding complex relationships within stories. This LLM benchmark is great to evaluate models for RAG pipelines.
  • SEAL LLM Leaderboards: These expert-driven evaluations use complex, private datasets to assess frontier LLMs, ensuring models are tested on unseen data to prevent overfitting. Behind this LLM benchmark is the infamous “Scale” company with business prodigy Alexander Wang at the head.
  • Berkeley Function Calling Leaderboard (Gorilla bench): This LLM benchmark evaluates LLMs on their proficiency in calling functions or tools accurately, with different versions introducing advanced features like multi-turn interactions. It uses real-world data to measure metrics like accuracy, cost, and latency, with models like GPT-4o achieving high scores.
  • OpenRouter Rankings: Based on real-world API usage of OpenRouter, these rankings show which LLMs are most used on via the OpenRouter platform. It’s kind of a lagging indicator though, as applications typically don’t immediately start using the newest / best performing Large Language Models.
  • Vectara Hallucination Leaderboard: Focused on factual consistency, this LLM leaderboard measures how often LLMs hallucinate when summarizing short documents. It uses the Hughes Hallucination Evaluation Model (HHEM-2.1) to evaluate models like Google Gemini-2.0-Flash-001, which achieved a 0.7% hallucination rate.
  • Kagi LLM Benchmark: With a dynamic set of questions, this benchmark tests LLMs on reasoning, coding, and instruction following, ensuring models are evaluated on novel tasks to avoid overfitting. It’s inspired by projects like Wolfram’s LLM Benchmarking and Aider’s coding leaderboard, focusing on capabilities crucial for search applications.
  • ARC AGI Leaderboard: This LLM leaderboard has visual problems that are easy to solve for humans but difficult to solves for Large Language Models. It emphasizes fluid intelligence and efficiency, ranking systems based on their ability to solve problems with minimal resources and high adaptability.
  • SimpleBench: This LLM benchmark presents multiple-choice questions that are easy for non-specialized (high school) humans but difficult for current LLMs. It includes over 200 questions on spatio-temporal reasoning, social intelligence, and “trick questions”.
  • Convex LLM Leaderboard: Tailored for developers, this benchmark assesses LLMs on their ability to write Convex code, focusing on correctness, efficiency, and understanding of code structures. It includes seven benchmark categories, details are available on GitHub (GitHub).
  • Dubesor LLM Benchmark: A personal yet comprehensive benchmark, dubesor bench compares AI models across a variety of tasks using a weighted rating system. It includes 83 tasks and manual testing, with results shared for transparency but noted as personal and potentially variable.
  • Artificial Analysis: Providing an independent analysis of AI models and providers, this platform uses metrics like intelligence, speed, and price to help users choose the best model for their needs. It offers detailed comparisons across models from providers like OpenAI, Meta, and Google.
  • LiveBench: Designed to be contamination-free, LiveBench releases new questions monthly from recent sources like arXiv papers and news articles. It includes 18 diverse tasks across categories like math, coding, and reasoning, with objective, automatic scoring to ensure fairness.
  • SWE-bench: For those interested in software engineering, swebench tests LLMs on their ability to solve real-world GitHub issues automatically. It includes subsets like Lite, Verified, and Multimodal, with evaluation based on unit test verification.

exhausted man after reading about all the different LLM benchmarks
Imagine the person who had to write the blog post…


Comparative Analysis

This was a lot indeed! But don’t worry, we made a table for you 😉

LLM benchmarks categorized by focus area:

Focus AreaBenchmarks
Coding and ToolsAider bench, Berkeley Function Calling Leaderboard, Convex leaderboard, SWE-bench
Reasoning and IntelligenceLongBench, Kagi LLM Benchmark, ARC AGI Leaderboard, SimpleBench
Emotional and SocialEQ-Bench
Chat and InteractionChatbot Arena
Visual and CreativeSVGarena, Chatbot Arena
Factual ConsistencyVectara Hallucination Leaderboard
Usage and PopularityOpenRouter Rankings
Narrative and ContextFiction.liveBench
Comprehensive AnalysisSEAL LLM Leaderboards, Artificial Analysis leaderboard, LiveBench, Dubesor Bench

This table highlights the diversity, with coding and reasoning being prominent, reflecting community priorities in LLM evaluation.

Woman saying 'hope this helped!' indicating that the blog post about LLM benchmarks finished

Conclusion

The 18 LLM benchmarks collectively provide a comprehensive framework for evaluating LLMs across technical, creative, and social dimensions.

Their methodologies, from crowdsourced voting to expert-driven evaluations, address challenges like contamination and overfitting, though debates persist on their generalizability.

This analysis, rooted in community insights from Simon Willison’s X post, underscores the evolving landscape of LLM benchmarking, offering valuable resources for informed decision-making in AI development.