Getting started with model evaluations

December 13, 2024

Getting started with model evaluations

When the AISI evaluation bounty programme popped onto my radar just two days before the deadline, I decided to embrace the challenge despite my limited experience with the Inspect framework suggested. This post documents my initial exploration, setting the stage for a deeper dive.

Initial Steps

At the time of this post I was enrolled on the Blue Dot AI Safety Fundamentals Alignment course, working on Technical Governance week that featured model evaluations. The course reading, a paper from Deepmind entitled Model Evaluation for Extreme Risks, outlined why model evaluation for dangerous capabilities and alignment is a critical part of technical governance. The authors advocate for extensive use of model evaluations to inform critical decision making and provide a blue print for continuous evaluation to be used pre- and post-deployment, both internally and externally to the company.
As it turned out, the AISI evaluation bounty programme was looking in particular for autonomous agent capabilities evaluations and/or agent scaffolding. The judging criteria provided ideas for evals and how the code would be judged. I pursued autonomous agent capabilities evals as that seemed slightly more achievable in the short time I had.

Structure of an Eval

The bounty document suggested first taking a look at the Inspect framework, an open source framework developed by the AISI for evaluating LLMs. I provide a snippet of code illustrating the core components of an eval task below. While this is specific to the Inspect framework, the components should be similar to other frameworks I mention later.

I was recently working on a Text2SQL proof of concept that takes natural language queries about an SQL database, generates the SQL query and returns the response after querying the database. I use this as a grounding example to help illustrate concepts.

The code sample below is from Inspect's getting started page and is sufficient to illustrate the main concepts.

@task
def theory_of_mind():
    return Task(
        dataset=example_dataset("theory_of_mind"),
        solver=[
          prompt_template(DEFAULT_PROMPT),
          generate(),
          self_critique()
        ],
        scorer=model_graded_fact()
    )

Tasks: the evaluation task is, in the case of Inspect, a wrapper object containing the components that define the task. Typically this includes a dataset, a solver and a scorer. Each task is an experiment, like something you would track in wandb, for example here is a code snippet for creating a grid of experiments to run.

Datasets will often just be [inputs, targets], for example in the text to SQL scenario introduced above, inputs could be natural language prompts such as "Query the database and return the sales values for the region" and targets could include the SQL statement expected.

In theory, the dataset could be just a few prompts with corresponding target responses in a list for testing your code. HuggingFace has a wide range of datasets available, for example the spider SQL dataset but should you need to generate your own dataset, this OpenAI tutorial has a nice example of how to generate a dataset for evaluations using GPT-4.

Solvers are how the flow of the task is managed. Here you could simply send the prompt to the LLM, or create a chain of functions as shown in the code snippet. This chain could for instance:

pre-process the inputs (say adding some SQL prompt injection text)
call the LLM to process the prompt
post-process the output, for example the code snippet above illustrates a self_critique() function that checks and potentially modifies the response before returning.

Scorers score the response with the most common scoring functions available in the framework, as in the example above (model_graded_fact()). Typically if the target is easy enough to compare to the LLM's response, for example a year like 2024 or a multiple choice answer, a scoring function that compares the target and response can be used. If the answer is more complicated like code or a paragraph of text, an LLM may be used to "grade" the answer, as in the code snippet above.

The task, therefore, is a compilation of the dataset you want to evaluate, the nature of the task itself (solver) and a scorer.

Exploring Frameworks

Coding or understanding things at the coding level always helps me gain more context so I setup a VSC project, installed the Inspect AI plugin suggested, created a .env file with my API keys and ran the getting started code.

This was all very easy to get up and running (good docs) but I hit a GPT-4 rate limit after processing 100 samples, so I adjusted my environment settings to only attempt 5 samples! Be aware this may cost you if you are not careful.
As suggested, I browsed the logs using the Inspect AI plugin for VSC which was interesting; the sample code I was running had a critique and re-write response solver() so I was submitting more calls than I had expected (not bad, just something to be aware of). I think I might write some pre-run code that could provide me with an estimate of the cost before I run the evals.

Back to the Inspect documentation feeling slightly more comfortable about the framework now but:

A little less confident about what a good eval would look like
Also a little unclear about what exactly I should provide by the deadline, assuming I came up with a worthwhile eval

I looked for more guidance on what an eval actually looks like and via a quick Google discovered:

METR, OpenAI and other key players also had their own evaluation frameworks and codebases.

Nevertheless the core eval concepts transfer (and standardisation will follow 🙃)
Model evaluation has been around for a long time but the kind of evals the AISI is interested in are specifically targeted at dangerous capabilities. The AISI or METR frameworks (and similar) may be more focused towards this direction.
METR has a helpful comparison of their framework Vivaria against Inspect
OpenAI's post on getting started with evals was most useful for understanding the nitty-gritty of eval design. It provided more context on the design process and what the deliverables were.

By chance I discovered a course from Huggingface (smol) with a chapter on evals. The page provides a walkthrough of an eval implementation with no specific focus and refers to the Evaluation Guidebook. Another useful reference to tuck away!

It turns out this is an excellent course on using small language models for your applications. "SmolLM2 is a family of compact language models available in three size: 135M, 360M, and 1.7B parameters"
The course is a work in progress as of December 2024.

Challenges Encountered

Back on the AISI bounty page I discovered the deliverables for stage 1: "In Stage 1, you will submit a design for your evaluation or proposal for an agent scaffold." So what I really needed was a good evaluation idea, not the code just yet.

How do I choose a topic? Regarding my Text2SQL use case, I wondered if anything existed for SQL prompt injection evals. SQL prompt injection is a way to use a seemingly benign prompt to execute unauthorized SQL queries on the backend server. I found some useful resources and yet more players in this space (based on a random Google search)

Promptfoo have a plugin ready to go for red-teaming but their site also provides some insight into the types of SQL prompt injections they target. They offer much more in the way of testing LLMs so I have tucked that site away for later too!
Braintrust is an LLM app development platform with a walk through an eval for Text2SQL. This is very clear and what's nice is the components are all familiar.

So where does that leave me? There is clearly already a lot of work in this area so I need to spend a little time catching up on an area I am interested in evaluating, which is largely around deception. I don't think my interest is limited to SQL per se but it might be a useful and practical entry point.

I am still a little unclear about the scale of the eval: what if I just have a few breaking prompts, is that sufficient, assuming I create the task with dataset, solver and scorer? I have more reading to do and will produce a follow-up post shortly.

Search This Blog

Emergence