Introducing LIME: A New Way Framework for LLM Evaluation

Rudra Prasad
Oct 19, 2023
2 min read

LIME Framework for LLM Evaluation - Imagined by AI

Specialized LLMs are designed for a broad array of tasks across domains. Popular LLM Evaluation frameworks score based on MMLU, ARC, HellaSwag, and TruthfulQA to evaluate and benchmark models across these generalized tasks and domains.

Additional Info

Full documentation on LIME Framework

Github

Understanding the Scores behind popular LLM Leaderboards

These are standardized benchmarks that provide valuable insights into the generalized quality of the output of the models. They are also widely used to score, compare, and rank the models like the popular hugging face leaderboard.

To address this gap, we are excited to introduce LIME: the Language Intelligence Model Evaluation framework. LIME provides a new technique tailored to evaluating language models on domain-specific tasks for which they have been optimized.

Lime Framework is open sourced under the GNU Affero General Public License and available on Github

How LIME Works

With LIME, evaluators can create custom datasets comprising context paragraphs and related questions that test the particular capabilities required for their domain, such as medical information retrieval or product data extraction.

LIME incorporates three complementary evaluation techniques to provide a multi-dimensional assessment:

Binary Evaluator: Scores models based on whether responses exactly match the ground truth.
Bi-Encoder Evaluation: Compares semantic similarity between the model response and ground truth using sentence transformer encoders.
Language Model Evaluation: An AI evaluator model assesses the quality of responses on a 1-100 scale compared to the ground truth provided in a structured JSON format.

A Powerful New Tool for LLM Evaluation

We applied LIME to evaluate several language models, including our own Tiger model fine-tuned for product information extraction, on a custom 1000 product dataset. Tiger outperformed strong general models like Llama on both the bi-encoder and language model evaluations designed for this specialized task.

With LIME's targeted evaluation approach, AI teams can now benchmark specialized language models far more effectively than using broad, domain-agnostic benchmarks. We invite you to explore LIME on our GitHub and try creating custom evaluations for your own use cases.

Need help evaluating models, creating custom datasets contact us

Let us know what you think about LIME and how you plan to leverage it! We're excited for this new era of domain-specific language model development and deployment.

Full documentation on LIME Framework

Github

Understanding the Scores behind popular LLM Leaderboards