Blue Dot AI Alignment Course - Week 3 - RLHF and CAI
Welcome back to my weekly overview of the AI Safety Alignment course from Blue Dot! This week we look at some training techniques for incorporating preferences into models, including Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI (CAI) that adopts a more principled approach. At the end of the last session I decided that a principled base model had to be an essential requirement before any further model fine-tuning, so I was gratified to learn that CAI was already in use in Claude . During the cohort discussions I came to realise that the method was not as important as determining the source of the underlying principles. This seemed a considerably harder problem to resolve especially while navigating the potential impacts of the current socio-political environment. How do we agree a unified set of principles; who agrees the principles and who gets to control this ultimately? Nevertheless the mechanism is available and we take a closer look at p...