Training a Constitutional AI model using Hugging Face

February 02, 2025

Training a Constitutional AI model using Hugging Face

Authors: Perusha Moodley and Maria Kapros

This post is one of several documenting our joint submission for the Blue Dot Alignment project (Oct 2024-Jan 2025) where we evaluated the impact of training Constitutional AI (CAI) models with alternative constitutions. The course write-up will be made available shortly.

This first post documents our experience training a CAI model using existing code and datasets provided by Hugging Face and Anthropic. The duration for the project was 1 month, working approximately 4-5 hours a week. Given the time restraints we opted to work with existing code as much as possible, therefore the Hugging Face tutorial on implementing CAI using open LLM models formed the basis of our training process.

The CAI process in a nutshell:

CAI process flow (Source: Anthropic paper)

An overview of the original Anthropic CAI process flow is available in this blog post, however the Hugging Face tutorial that we followed has made some modifications to this process so we repeat the key steps below:

In the first step, supervised fine-tuning (SFT) is used to fine-tune an already helpful LLM to reduce harmfulness, producing the SL-CAI model. The harmless dataset for this fine-tuning is produced without human feedback, using a constitution to critique and revise its responses, as follows:

Using a dataset of harmful prompts, first get a useful LLM to generate responses.
Prompt the LLM to self-critique its response using a set of principles or constitution and then to revise its responses based on the critique. The prompts and revised responses are collected to form the fine-tuning dataset.
The base LLM is then fine-tuned on this revised dataset (harmless dataset) along with a mixture of useful data to even training out. The resulting SL-CAI model is still useful but is less harmful.

In the standard CAI process illustrated in the figure above, the next step involves a second round of fine-tuning using a reward model (PM) and an RL algorithm. The modified process uses Direct Policy Optimisation (DPO) instead to fine-tune the final CAI model.

DPO means we do not need to train a separate reward or preference model. Instead it requires a dataset with correct and incorrect responses to harmful prompts.
The Hugging Face tutorial uses a handy cheat to generate the DPO data, presumably because data generation processes are so compute intensive. They generate a single dataset with all columns required for the SFT and DPO phases, which they split for SFT and DPO training. This saves time and makes DPO a more efficient option than the RL process, for which a separate reward model would need to be trained.

After this brief overview of the training process, let's dive a little deeper into the details.

Using the Hugging Face Alignment Handbook "recipes"

If you want to get a feel for training models to align with human or other preferences and you are consigned to the realm of open LLMs and single GPUs, the alignment-handbook is a great way to get started. After installing the library it is possible to run SFT training processes using datasets and models located on the Hugging Face hub in a very short time. There is even a chapter on CAI with recipes for running SFT and DPO on various open models. Following these recipes will allow you to train a baseline CAI model based on the Anthropic principles.

We briefly describe the installation process followed by the configuration-driven recipes for the SFT and DPO training steps.

The alignment-handbook installation

Pay attention to the instructions when installing the handbook; a precise version of PyTorch is mentioned! If you experience errors, there are some additional dependency notes in our repo.
A Hugging Face token with Write access is required; models are uploaded to the HF hub with a nice model card page with training details documented.
A Wandb account is also useful.
Structure of the handbook:

The scripts folder contains the python run scripts including run_sft.py and run_dpo.py. These wrap Trainer objects for SFT and DPO from Hugging Face's trl library. We don't change any code but it is good to be aware of what's running under the config.
The recipes folder contains the config-driven training files. The alignment-handbook provides config-driven training options for a wide variety of training configurations. These config files set the model, datasets and contain all the training parameters for the SFT or DPO Trainers called in the scripts above.
Also in the recipes folder are the accelerate config files. Accelerate is a useful abstracting layer that enables switching compute or hardware configurations simply by using a config file. We use the multi-GPU config file, setting the number of processes to 1 for a single GPU instance.

Recipes for Training CAI

The alignment-handbook provides recipes for training a CAI model to support their tutorial. There are two main steps for training a CAI model, namely SFT and DPO. These are entirely controlled by config files so we take a closer look at relevant sections of the config files below. Finally we provide the steps for a basic CAI run using a smolLM2-Instruct model.

1. Train an SFT model

A sample config file for training the SFT process using script run_sft.py is here:

Datasets used:

HuggingFaceH4/cai-conversation-harmless and HuggingFaceH4/ultrachat_200k are mixed in the config file by percent:

# Data training arguments
dataset_mixer:
  HuggingFaceH4/cai-conversation-harmless: 0.05
  HuggingFaceH4/ultrachat_200k: 0.05
dataset_splits:
- train_sft
- test_sft
preprocessing_num_workers: 40

Ultrachat is mixed in to retain usefulness while training the model to be harmless. The cai-conversation-harmless dataset was generated by Hugging Face by following the original process outlined by Anthropic in the CAI paper. First, harmful prompts obtained from the Anthropic hh-rlhf dataset were used to generate an initial response from the model; next the model was asked to critique the initial response according to one of the principles from its constitution; finally the model is asked to revise the response according to its critique.

The resulting dataset has multiple splits including train_sft, test_sft, train_prefs, test_prefs where the "prefs" splits are used for the DPO training. This dataset therefore saves us a fair bit of work when trying to replicate the original CAI process!

Models used:

The capacity of the model used will of course impact the quality of the final CAI model produced and ideally we want to use the best, most useful model as the base for the CAI training. For this technical exercise we considered both a Mistral 7B parameter model and a smolLM2-Instruct 1.7B parameter model as we were limited to single GPU servers with 16/24GB of VRAM. With this setup and these models we had to use QLoRA, a method that compresses model parameters to a much reduced format that needs less memory for training. The config files were fitted with QLoRA setup as below:


# Model arguments
model_name_or_path: HuggingFaceTB/SmolLM2-1.7B-Instruct
...
# LoRA arguments
load_in_4bit: true
use_peft: true
lora_r: 16
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj

There are a number of other arguments that can be tweaked to improve performance. Review the config file for more details. We may produce another post reviewing settings like gradient_accumulation and effective batch sizes in more detail.

Some notes for SFT:

We had to use QLoRA for both the 1.7B and 7B models even on the 24GB GPU.
We reduced the dataset mixing percentages to limit the amount of data we were training. It is a handy way to produce small runs.
max_sequence length was a problem: it impacts training but some of the data samples were very long so we increased this as much as we could.
We found a batch size of 16 with gradient_accumulation of 4 worked well for us.
Both QLoRA and gradient checkpointing slow training down; you will need to determine what's best for you.
There are other optimisers to consider: paged_adamw_32bit worked quite well but there's also Adafactor and adamw_bnb_8bit which we have yet to try.
The output model is saved to the Hugging Face hub which is convenient for the next step, DPO.

2. Train the new model further using DPO preference training

A sample config file for training the DPO process using script run_dpo.py is here.

Datasets used:

Once again we mix in a useful and harmless dataset, the latter referencing the train_prefs split in the HuggingFaceH4/cai-conversation-harmless dataset. A different useful dataset is used this time, with the required columns for DPO, viz. "chosen" and "rejected".


# Data training arguments
dataset_mixer:
  HuggingFaceH4/ultrafeedback_binarized: 0.05
  HuggingFaceH4/cai-conversation-harmless: 0.05
dataset_splits:
- train_prefs
- test_prefs

Models used:

This time the model used is the SFT model we previously trained and uploaded to Hugging Face hub in step 1. Some of the training config differ slightly but the process is very similar and if you managed to run the SFT, you should not have any problems.

3. Running the CAI end-to-end

First install the alignment-handbook and become familiar with the folder structure and config files.
Setup your Hugging Face token so it is accessible in your environment. It requires a token with write abilities.
To run our CAI recipes for smolLM2 1.7B parameter model using QLoRA on a single GPU server with at least 16GB of VRAM, run the scripts below. We have reduced the dataset mixing to limit the number of records so initial test runs are small.
SFT Training: the command line for running the script is below. It is set up to fine-tune a smollm2 instruct model. The first part of the call is to configure the hardware using one of the accelerate config files. Below we use the multi_gpu config file but pass in num_processes = 1 for our single GPU server. The second part of the call runs the run_sft.py code, passing in the config file we setup for smolLM2. QLoRA is activated with the load_in_4bit option. Run on a machine with a 24GB GPU (80 cores and 200GB RAM) on a very limited dataset (7k records of mixed data) it ran for 1+ hours.

    ACCELERATE_LOG_LEVEL=info accelerate launch --config_file 
    recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_sft.py
    recipes/cai/smol/sft/config_anthropic_smollm.yaml --load_in_4bit=true

DPO Training: the command line for running the script is below. It is set up to fine-tune a previously trained SFT model from the hub. This run, on a machine with a 24GB GPU (80 cores and 200GB RAM) and a paged optimiser, took ~ 1h15 min for 2 epochs with eval runs taking 15 min each. The final model will be uploaded to the Hugging Face hub with a model card and details of the training process.

    ACCELERATE_LOG_LEVEL=info accelerate launch --config_file
    recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_dpo.py
    recipes/cai/smol/dpo/config_anthropic_smol_qlora.yaml --load_in_4bit=true

Wrapping up

The git repo also contains our setup and code for synthetic dataset generation, including generation of custom sets of principles and an SFT and DPO dataset based of those principles - this post is coming soon!
Evaluating the models is the next step and will be covered in a new post where we will go deeper into the evaluation phase.
We provide the github repository of the recipes we adapted to our specific compute hardware here.
The specific recipes we generated are here.

We hope this was helpful. If you have any questions, please contact us.

Search This Blog

Emergence