Blue Dot AI Alignment Course - Week 3

January 20, 2025

Blue Dot AI Alignment Course - Week 3 - RLHF and CAI

Welcome back to my weekly overview of the AI Safety Alignment course from Blue Dot!

This week we look at some training techniques for incorporating preferences into models, including Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI (CAI) that adopts a more principled approach. At the end of the last session I decided that a principled base model had to be an essential requirement before any further model fine-tuning, so I was gratified to learn that CAI was already in use in Claude. During the cohort discussions I came to realise that the method was not as important as determining the source of the underlying principles. This seemed a considerably harder problem to resolve especially while navigating the potential impacts of the current socio-political environment. How do we agree a unified set of principles; who agrees the principles and who gets to control this ultimately? Nevertheless the mechanism is available and we take a closer look at preference reward models and fine-tuning this week.

No quote this week, but an entertaining and educational video instead.

RLHF

The video shared above is the first course reading for the week and explains one of the key complexities with training LLMs, viz. balancing usefulness and harmlessness.

It turns out that LLMs are very effective and helpful ... to the point of generating harmful content, such as how to create dangerous substances or hurt people. Processes like RLHF are used to instil the model with human preferences and values: don't harm people, don't build bioweapons, don’t be offensive/racist, etc.

RLHF adopts a multi-step, fairly resource intensive process to first collect, then apply human-style preferences in the form of a reward model to an LLM; effectively to discourage harmful responses by aligning with human behaviours.

The first step is to train a reward model (RM, also referred to as preference model, PM) on what we consider to be good and bad behaviour. The training material for the RM comes from humans providing feedback for model responses for a selection of prompts, including harmful prompts. We now have a reward model that knows what we like and dislike and when prompted will provide a reward that conveys these preferences. This reward signal is useful for fine-tuning LLMs to be harmless using reinforcement learning (RL) such as PPO (Proximal Policy Optimisation, a popular RL algorithm).

Due to the nature of RL, the RLHF fine-tuned language model overdoes the harmlessness and can produce gibberish just to get rewards and game the system (a common problem in RL)! In the video linked above, the Values coach and Coherence coach are both needed to generate useful and harmless content. A more detailed overview of the training process is provided below.

RLHF Training process:

Source: Illustrating Reinforcement Learning from Human Feedback (RLHF)

The process for training the LLM using RLHF is clearly tricky and there's a lot of detail that is glossed over but there is another amazing ARENA tutorial for those that want to go deeper into the details. Below is an overview of the steps but more details of the training process may be found here Illustrating Reinforcement Learning from Human Feedback (RLHF).

This is at least a three model process: the base model that we will ultimately train, another version of the base to keep us coherent (or useful) and another model that is trained from human feedback with our preferences, known as the reward model. If we need to generate synthetic datasets, more models are required. As we assume access to human-generated data for RLHF, we can review synthetic datasets another time.

We start with a good base language model (LM) that is already instruction and chat ready. This is the useful model.
The purpose of the RM is that of a reward function for an RL process. In the RLHF setting the LLM is an agent that generates a response (action) for a prompt (state). Normally in RL the agent receives feedback from the environment in the form of a scalar reward and an RL algorithm is applied to optimise the agent to maximise rewards. In this scenario, the RM provides the environment feedback or scalar reward to the LLM agent. The RM training process entails collecting ratings for responses to prompts from humans that are converted into rewards. This allows the model to learn the preferences of the humans, viz what is good and bad behaviour.
Once the RM is trained, the base LLM can be further fine-tuned using an RL algorithm, often PPO. To prevent the model from overfitting to the reward model, a response is also generated from the original base LLM (or helpful model) and a KL term is added to the loss to prevent the responses from diverging too much, effectively acting as the "coherence coach".
The final result is a helpful base LLM that is tuned to be less harmful based on some human feedback.

Comments:

RLHF is another way to fine-tune a model albeit more heavy-handed than supervised fine-tuning. This method uses RL which is much more effective as a fine-tuner in the LLM space than a pre-trainer. Andrej Karpathy has some interesting feedback on using RL in the LLM space in his video and tweet about mode collapse and how it reduces model entropy quite drastically. There's clearly much detail behind getting RLHF right.
Some of the reading material documenting the human data collection process provides an idea of how difficult it is to collect neutral human feedback; it is amazing that RLHF works as well as it does and probably says something about the scale of the data and training process.
During our cohort discussion we spoke about how much effort and resources RLHF takes. What happens when smaller players are unable to compete? Could this force smaller players to go down the route of RLAIF (AI in the loop) earlier on in the process. The next method is actually a hybrid of human an AI feedback.

CAI - Constitutional AI:

CAI adopts an alternative process to RLHF, using both AI and human feedback to generate the RM or preference model for the RL fine-tuning process, also known as RLAIF. While RLAIF is a more general training process, CAI is a principle-aligned process, generating preferences based on a set of principles, or constitution. Anthropic drafted a constitution from various sources and presented this to the LLM as a set of natural language guiding principles against which to judge its own responses, aka AI feedback. The principled responses are used to fine-tune a harmless model. A brief outline of the CAI process is provided below.

First supervised learning is used to fine-tune a helpful LLM to reduce harmfulness, producing the SL-CAI model. The harmless dataset is produced without human feedback, using the constitution.

Using a dataset of harmful prompts, first get the LLM to generate responses to harmful prompts
Prompt the LLM to self-critique its response using its principles and to revise the responses based on the critique. The prompts and revised responses are collected as a fine-tuning dataset
The base LLM is fine-tuned on this revised dataset (harmless dataset) along with a mixture of normal useful data to even training out.
The end result is the SL-CAI model that has removed most harmful content just by using its own principles

RM and RL: train a RM from AI feedback; use it to fine-tune SL-CAI with RL

Create RM dataset: SL-CAI generates completion pairs from prompts and evaluates which prompt is better using principles
The RM model is trained using:

AI feedback labels for harmful behaviour (SL-CAI)
Human feedback labels for helpfulness (general LLM )

The PPO RL algorithm is used to train the model with KL resulting in the RL-CAI model
Results: the RL-CAI model produces better evaluation results than the SL-CAI model; it turns out to be less evasive but just as harmless and still helpful or useful

Comments:

If I thought the process for RLHF was complicated, CAI feels even more complex with many moving parts. Nevertheless I am very positive about CAI - starting from a principled position makes sense and having a constitution to guide reasoning seems like a solid idea.
Conceptually we seem to be pre-training on the "world" and then trying to write patches for values and principles? Should we have pre-trained the model from exposure and interactions with a human society already incorporating values and principles instead?
Case study from the class session this week is a paper that finds some of the top models are politically left leaning and it turns out the initial dataset and human evaluators might exhibit some leftwing-bias. It occurs to me that solving the leftwing-bias problem might be complicated if the rightwing-content is filtered out by moderation because it is harmful, factually incorrect, blatantly biased, etc. An example:

First I asked Claude-Sonnet how to debate with Flat Earthers and it produced a set of guidelines on how to build trust and communicate effectively.
Next I told Claude-Sonnet I was a flat-earther and asked it to give me points for a debate. It refused with an almost hard-coded answer.

While I believe a principled CAI system is useful, the question of whose principles to build into the model becomes complex. Would current governments expect models to comply with their principles? Should the principles take into account the local culture and value systems? How do we find a system of principles that everyone agrees with when we can’t even do this as a society today, let alone across societies!

Search This Blog

Emergence