Training a Constitutional AI model using Hugging Face
Authors: Perusha Moodley and Maria Kapros
This post is one of several documenting our joint submission for the Blue Dot Alignment project (Oct 2024-Jan 2025) where we evaluated the impact of training Constitutional AI (CAI) models with alternative constitutions. The course write-up will be made available shortly.
This first post documents our experience training a CAI model using existing code and datasets provided by Hugging Face and Anthropic. The duration for the project was 1 month, working approximately 4-5 hours a week. Given the time restraints we opted to work with existing code as much as possible, therefore the Hugging Face tutorial on implementing CAI using open LLM models formed the basis of our training process.
The CAI process in a nutshell:
![]() |
CAI process flow (Source: Anthropic paper) |
An overview of the original Anthropic CAI process flow is available in this blog post, however the Hugging Face tutorial that we followed has made some modifications to this process so we repeat the key steps below:
- In the first step, supervised fine-tuning (SFT) is used to fine-tune an already helpful LLM to reduce harmfulness, producing the SL-CAI model. The harmless dataset for this fine-tuning is produced without human feedback, using a constitution to critique and revise its responses, as follows:
- Using a dataset of harmful prompts, first get a useful LLM to generate responses.
- Prompt the LLM to self-critique its response using a set of principles or constitution and then to revise its responses based on the critique. The prompts and revised responses are collected to form the fine-tuning dataset.
- The base LLM is then fine-tuned on this revised dataset (harmless dataset) along with a mixture of useful data to even training out. The resulting SL-CAI model is still useful but is less harmful.
- In the standard CAI process illustrated in the figure above, the next step involves a second round of fine-tuning using a reward model (PM) and an RL algorithm. The modified process uses Direct Policy Optimisation (DPO) instead to fine-tune the final CAI model.
- DPO means we do not need to train a separate reward or preference model. Instead it requires a dataset with correct and incorrect responses to harmful prompts.
- The Hugging Face tutorial uses a handy cheat to generate the DPO data, presumably because data generation processes are so compute intensive. They generate a single dataset with all columns required for the SFT and DPO phases, which they split for SFT and DPO training. This saves time and makes DPO a more efficient option than the RL process, for which a separate reward model would need to be trained.
Using the Hugging Face Alignment Handbook "recipes"
The alignment-handbook installation
- Pay attention to the instructions when installing the handbook; a precise version of PyTorch is mentioned! If you experience errors, there are some additional dependency notes in our repo.
- A Hugging Face token with Write access is required; models are uploaded to the HF hub with a nice model card page with training details documented.
- A Wandb account is also useful.
- Structure of the handbook:
- The scripts folder contains the python run scripts including run_sft.py and run_dpo.py. These wrap Trainer objects for SFT and DPO from Hugging Face's trl library. We don't change any code but it is good to be aware of what's running under the config.
- The recipes folder contains the config-driven training files. The alignment-handbook provides config-driven training options for a wide variety of training configurations. These config files set the model, datasets and contain all the training parameters for the SFT or DPO Trainers called in the scripts above.
- Also in the recipes folder are the accelerate config files. Accelerate is a useful abstracting layer that enables switching compute or hardware configurations simply by using a config file. We use the multi-GPU config file, setting the number of processes to 1 for a single GPU instance.
Recipes for Training CAI
1. Train an SFT model
# Data training arguments
dataset_mixer:
HuggingFaceH4/cai-conversation-harmless: 0.05
HuggingFaceH4/ultrachat_200k: 0.05
dataset_splits:
- train_sft
- test_sft
preprocessing_num_workers: 40
# Model arguments
model_name_or_path: HuggingFaceTB/SmolLM2-1.7B-Instruct
...
# LoRA arguments
load_in_4bit: true
use_peft: true
lora_r: 16
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
- We had to use QLoRA for both the 1.7B and 7B models even on the 24GB GPU.
- We reduced the dataset mixing percentages to limit the amount of data we were training. It is a handy way to produce small runs.
- max_sequence length was a problem: it impacts training but some of the data samples were very long so we increased this as much as we could.
- We found a batch size of 16 with gradient_accumulation of 4 worked well for us.
- Both QLoRA and gradient checkpointing slow training down; you will need to determine what's best for you.
- There are other optimisers to consider: paged_adamw_32bit worked quite well but there's also Adafactor and adamw_bnb_8bit which we have yet to try.
- The output model is saved to the Hugging Face hub which is convenient for the next step, DPO.
2. Train the new model further using DPO preference training
Datasets used:
# Data training arguments
dataset_mixer:
HuggingFaceH4/ultrafeedback_binarized: 0.05
HuggingFaceH4/cai-conversation-harmless: 0.05
dataset_splits:
- train_prefs
- test_prefs
Models used:
This time the model used is the SFT model we previously trained and uploaded to Hugging Face hub in step 1. Some of the training config differ slightly but the process is very similar and if you managed to run the SFT, you should not have any problems.
3. Running the CAI end-to-end
- First install the alignment-handbook and become familiar with the folder structure and config files.
- Setup your Hugging Face token so it is accessible in your environment. It requires a token with write abilities.
- To run our CAI recipes for smolLM2 1.7B parameter model using QLoRA on a single GPU server with at least 16GB of VRAM, run the scripts below. We have reduced the dataset mixing to limit the number of records so initial test runs are small.
- SFT Training: the command line for running the script is below. It is set up to fine-tune a smollm2 instruct model. The first part of the call is to configure the hardware using one of the accelerate config files. Below we use the multi_gpu config file but pass in num_processes = 1 for our single GPU server. The second part of the call runs the run_sft.py code, passing in the config file we setup for smolLM2. QLoRA is activated with the load_in_4bit option. Run on a machine with a 24GB GPU (80 cores and 200GB RAM) on a very limited dataset (7k records of mixed data) it ran for 1+ hours.
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file
recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_sft.py
recipes/cai/smol/sft/config_anthropic_smollm.yaml --load_in_4bit=true
- DPO Training: the command line for running the script is below. It is set up to fine-tune a previously trained SFT model from the hub. This run, on a machine with a 24GB GPU (80 cores and 200GB RAM) and a paged optimiser, took ~ 1h15 min for 2 epochs with eval runs taking 15 min each. The final model will be uploaded to the Hugging Face hub with a model card and details of the training process.
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file
recipes/accelerate_configs/multi_gpu.yaml --num_processes=1 scripts/run_dpo.py
recipes/cai/smol/dpo/config_anthropic_smol_qlora.yaml --load_in_4bit=true
Wrapping up
-
The git repo also contains our setup and code for synthetic dataset
generation, including generation of custom sets of principles and an SFT and DPO dataset based of those principles - this post is coming soon!
- Evaluating the models is the next step and will be covered in a new post where we will go deeper into the evaluation phase.
- We provide the github repository of the recipes we adapted to our specific compute hardware here.
- The specific recipes we generated are here.
We hope this was helpful. If you have any questions, please contact us.
Comments
Post a Comment