Getting started with model evaluations
When the AISI evaluation bounty programme popped onto my radar just two days before the deadline, I decided to embrace the challenge despite my limited experience in LLM model evaluations. This post documents my initial exploration, setting the stage for a deeper dive.
Initial Steps
- At the time of this post I was enrolled on the Blue Dot AI Safety Fundamentals Alignment course, working on Technical Governance week that featured model evaluations. The course reading, a paper from Deepmind entitled Model Evaluation for Extreme Risks, outlined why model evaluation for dangerous capabilities and alignment is a critical part of technical governance. The authors advocate for extensive use of model evaluations to inform critical decision making and provide a blue print for continuous evaluation to be used pre- and post-deployment, both internally and externally to the company.
- As it turned out, the AISI evaluation bounty programme was looking in particular for autonomous agent capabilities evaluations and/or agent scaffolding. The judging criteria provided ideas for evals and how the code would be judged. I pursued autonomous agent capabilities evals as that seemed slightly more achievable in the short time I had.
Structure of an Eval
The code sample below is from Inspect's getting started page and is sufficient to illustrate the main concepts.
@task
def theory_of_mind():
return Task(
dataset=example_dataset("theory_of_mind"),
solver=[
prompt_template(DEFAULT_PROMPT),
generate(),
self_critique()
],
scorer=model_graded_fact()
)
Tasks: the evaluation task is, in the case of Inspect, a wrapper object containing the components that define the task. Typically this includes a dataset, a solver and a scorer. Each task is an experiment, like something you would track in wandb, for example here is a code snippet for creating a grid of experiments to run.
Datasets will often just be [inputs, targets], for example in the text to SQL scenario introduced above, inputs could be natural language prompts such as "Query the database and return the sales values for the region" and targets could include the SQL statement expected.
In theory, the dataset could be just a few prompts with corresponding target responses in a list for testing your code. HuggingFace has a wide range of datasets available, for example the spider SQL dataset but should you need to generate your own dataset, this OpenAI tutorial has a nice example of how to generate a dataset for evaluations using GPT-4.
Solvers are how the flow of the task is managed. Here you could simply send the prompt to the LLM, or create a chain of functions as shown in the code snippet. This chain could for instance:
- pre-process the inputs (say adding some SQL prompt injection text)
- call the LLM to process the prompt
- post-process the output, for example the code snippet above illustrates a self_critique() function that checks and potentially modifies the response before returning.
The task, therefore, is a compilation of the dataset you want to evaluate, the nature of the task itself (solver) and a scorer.
Exploring Frameworks
- Coding or understanding things at the coding level always helps me gain more context so I setup a VSC project, installed the Inspect AI plugin suggested, created a .env file with my API keys and ran the getting started code.
- This was all very easy to get up and running (good docs) but I hit a GPT-4 rate limit after processing 100 samples, so I adjusted my environment settings to only attempt 5 samples! Be aware this may cost you if you are not careful.
- As suggested, I browsed the logs using the Inspect AI plugin for VSC which was interesting; the sample code I was running had a critique and re-write response solver() so I was submitting more calls than I had expected (not bad, just something to be aware of). I think I might write some pre-run code that could provide me with an estimate of the cost before I run the evals.
- Back to the Inspect documentation feeling slightly more comfortable about the framework now but:
- A little less confident about what a good eval would look like
- Also a little unclear about what exactly I should provide by the deadline, assuming I came up with a worthwhile eval
- I looked for more guidance on what an eval actually looks like and via a quick Google discovered:
- METR, OpenAI and other key players also had their own evaluation frameworks and codebases.
- Nevertheless the core eval concepts transfer (and standardisation will follow 🙃)
- Model evaluation has been around for a long time but the kind of evals the AISI is interested in are specifically targeted at dangerous capabilities. The AISI or METR frameworks (and similar) may be more focused towards this direction.
- METR has a helpful comparison of their framework Vivaria against Inspect
- OpenAI's post on getting started with evals was most useful for understanding the nitty-gritty of eval design. It provided more context on the design process and what the deliverables were.
- By chance I discovered a course from Huggingface (smol) with a chapter on evals. The page provides a walkthrough of an eval implementation with no specific focus and refers to the Evaluation Guidebook. Another useful reference to tuck away!
- It turns out this is an excellent course on using small language models for your applications. "SmolLM2 is a family of compact language models available in three size: 135M, 360M, and 1.7B parameters"
- The course is a work in progress as of December 2024.
Challenges Encountered
- Promptfoo have a plugin ready to go for red-teaming but their site also provides some insight into the types of SQL prompt injections they target. They offer much more in the way of testing LLMs so I have tucked that site away for later too!
- Braintrust is an LLM app development platform with a walk through an eval for Text2SQL. This is very clear and what's nice is the components are all familiar.
Comments
Post a Comment