BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks Jun 18 • 36
Falcon 2: An 11B parameter pretrained language model and VLM, trained on over 5000B tokens tokens and 11 languages May 24 • 24
CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models May 24 • 21
Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs Apr 16 • 13
Introducing ConTextual: How well can your Multimodal model jointly reason over text and image in text-rich scenes? Mar 5 • 4
NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates Feb 2 • 2
The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models Jan 29 • 14
A guide to setting up your own Hugging Face leaderboard: an end-to-end example with Vectara's hallucination leaderboard Jan 12 • 6
view article Article Democratization of AI, Open Source, and AI Auditing: Thoughts from the DisinfoCon Panel in Berlin By frimelle • 1 day ago • 4
Molmo Collection Artifacts for open multimodal language models. • 5 items • Updated 14 days ago • 239
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures Paper • 2406.06565 • Published Jun 3 • 6
🎭 Avatars Collection The latest AI-powered technologies usher in a new era of realistic avatars! 🚀 • 67 items • Updated 10 days ago • 73
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published Jun 25 • 85
view article Article Jack of All Trades, Master of Some, a Multi-Purpose Transformer Agent Apr 22 • 78
view article Article LLM Comparison/Test: Llama 3 Instruct 70B + 8B HF/GGUF/EXL2 (20 versions tested and compared!) By wolfram • Apr 24 • 56
Granite Code Models Collection A series of code models trained by IBM licensed under Apache 2.0 license. We release both the base pretrained and instruct models. • 23 items • Updated Aug 30 • 165
view article Article Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face May 3 • 13
view article Article A guide to setting up your own Hugging Face leaderboard: an end-to-end example with Vectara's hallucination leaderboard Jan 12 • 6
view article Article The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models Jan 29 • 14
view article Article Introducing the Enterprise Scenarios Leaderboard: a Leaderboard for Real World Use Cases Jan 31 • 3
view article Article Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem Feb 20 • 3
view article Article Introducing ConTextual: How well can your Multimodal model jointly reason over text and image in text-rich scenes? Mar 5 • 4
view article Article NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates Feb 2 • 2
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset Paper • 2403.09029 • Published Mar 14 • 54
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits Paper • 2402.17764 • Published Feb 27 • 593
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels Paper • 2312.17090 • Published Dec 28, 2023 • 4
SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling Paper • 2312.15166 • Published Dec 23, 2023 • 56
Model Merging Collection Model Merging is a very popular technique nowadays in LLM. Here is a chronological list of papers on the space that will help you get started with it! • 30 items • Updated Jun 12 • 213
NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian Paper • 2312.01314 • Published Dec 3, 2023 • 2
LLM Leaderboard best models ❤️🔥 Collection A daily uploaded list of models with best evaluations on the LLM leaderboard: • 264 items • Updated Jun 22 • 401