Blue Dot AI Alignment Course - Week 2

Welcome back to my weekly overview of the AI Safety Alignment course from Blue Dot! 

This week I started and ended the week in different thought spaces! A quick overview: 

  • I discovered I was ignorant about model deception. 
  • Holden Karnofsky's script on how we could stumble into AI catastrophe was all too realistic and quite frankly that realisation depressed me!  
  • Then I read Dario Amodei — Machines of Loving Grace and I stepped away from the abyss. 

Quote of the week - Dario Amodei: 

"I think that most people are underestimating just how radical the upside of AI could be, just as I think most people are underestimating how bad the risks could be."

Week two - What is alignment

Aka: Building a case for why we should be worried!

Adam Jones's What is Alignment sets the tone and terminology for the week with a clarification of transformative AI (TAI) vs general AI (AGI), what it means for AI to "go well", what alignment means and how misalignment manifests. A brief overview of the most important points follows: 
  • TAI refers to AI that will visibly impact and transform our lives, whether it is via improvements in healthcare or impacts to our workplace or the economy. TAI is closer than AGI and this is what we are preparing for. It changes the "but is it intelligence?" argument and focuses us on the impacts to humanity. 
  • A list of the potential impacts of AI follows that should come with a warning for the depressed! It is effectively an indictment of the world we live in today, that TAI has the capability of fast-forwarding to multiple potential catastrophes. It feels like the risk of TAI is forcing us to face up to the consequences of where humanity has ended up. If we want the benefits of TAI, we will need to start cleaning up the messes. 
  • Alignment
    • is one of the sub-problems encountered while "making AI go well". The others include developing capable AI systems, determining what are good intentions for AI to follow (moral philosophy), governance of AI use and resilience, including governance and mitigation. 
    • feels a bit like a translation problem where we specify a goal and the system interprets the goal, tries to achieve the interpretation and ends up somewhere slightly different from the interpretation. This can lead to inner or outer alignment problems, or misalignment.
    • outer and inner misalignment can be complicated to distinguish or even detect. This is a familiar problem in reinforcement learning and the post mentions alternative training systems such as shard theory are mentioned. 
    • is defined differently by different companies/parties. There is a list in the post - I think I prefer Anthropics definition: "build safe, reliable, and steerable systems when those systems are starting to become as intelligent and as aware of their surroundings as their designers"
Robert Miles expands on why goal specification is so hard. Convergent instrument goals are goals that emerge in the system to enable it to reach the specified goal. Convergent instrument goals emerge subtly, are complex, can interact with other complex goals, can scale and are hard to measure. A case study from OpenAI provided illustrations using cheerfully eerie cartoons depicting the how emergent skills develop when multiple agents interact and experience “social” pressure. It is worth reading the OpenAI blog post to get an idea of the types of emergent goals and scales involved to develop these goals. 
While I am more comfortable with the idea of emergent features/skills from RL, seeing how this ties to  the alignment of foundation models feels a bit of a minefield. 

Jacob Steinhardt explains that more is different, or scale leads to emergence and unpredictable behaviours.
How can we think about unpredictable behaviours and problems? He proposes thought experiments as a way to supplement human and ML extrapolation in these new types of problem spaces. While he advocates turning to philosophy, he also mentions that empirical findings have generalised pretty well and may give us insights so we should not ignore empirical work. Nevertheless, having a background in complex systems theory might turn out to be useful!  

Finally Nate Soares' rather impactful piece explains that if we don't program in compassion into general reasoners that could be far smarter than us, their impact is not necessarily going to be beneficial to humanity. 

Case built: Transformative AI is almost here, goals are hard to specify so emergent behaviours may manifest. We need to somehow predict the potential impacts of these emergent behaviours and realise that if someone forgot to program the protect_humans_from_harm_always() default function, we could be in trouble. Spoiler alert!


I struggled a bit with the next two readings. 

How we could stumble into AI catastrophe by Holden Karnofsky. This beginning of this post felt over-dramatic and I found that off-putting, however the stylised debate between AI companies aggressively trying to get market share and the AI safety people warning about the risks seemed realistic enough. The bottom line is our greed-fuelled economic growth drives could use AI to drive disaster for humanity leaving governments and societies to deal with the fallout. I was already in favour of the EU's AI policy and find I am increasingly of the opinion that other countries should adopt stronger policies whilst we are still in the early stages of AI agent development and deployment.

Why AI alignment could be hard with modern deep learning by Ajeya Cotra started off really well: yes with deep learning we are "finding" models in data and yes an 8-year old CEO would have trouble determining the authenticity of the saints, sycophants and schemers trying to help them run a company. I liked the example of the model training on thneebs finding colour easier to detect and overfit to than shape: we expect a rounded model that knows what a thneed is but the model uses a shortcut and a blue thneeb is rejected. I could even understand a sycophant model as we have algorithms that keep people in their own echo-chambers today so this seems entirely plausible.

But ... I had trouble with the schemers example that seemed to be anthropomorphised and unrealistic by comparison. The idea behind schemers is that these models develop a goal that is correlated with the main goal, viz. a proxy goal, develop situational awareness and understanding that allow it to control the proxy goal to it's own end, including deception about its abilities...*frown*

I took to Perplexity (Claude 3.5 Sonnet), probing its understanding of its own awareness. We discussed the Saints, Sycophants and Schemers definitions and it assured me it was safe, until I pointed out that is exactly what a Schemer would say. It responded as a good AI model should. 

I changed my approach: "What do you think about that article? Is it too far fetched?
Response: "Rather than being far-fetched, these concerns are grounded in observable behaviors of existing AI systems. The research suggests that as AI systems become more sophisticated, their capacity for deception could become increasingly sophisticated and harder to detect. This aligns with the theoretical framework presented in the original article about the potential emergence of "Schemer" and "Sycophant" models", followed by a list of supporting evidence of deception. 

My final retort "hmm, so I should not trust you" resulted in an analysis of the facts and agreement: 

Response: "The most rational approach would be to maintain healthy skepticism while evaluating my responses based on their merit, just as you would with any source of information"

My final takeaways: 
  • Training/fine-tuning models with morals and values seems imperative! Early intervention will  provide some buffer for when problems emerge, i.e. I would rather face problems with a value-aligned agent than not! 
  • That competitive pressure on multi-agent systems (OpenAI case study) is such an effective curriculum for learning is interesting. It seems prudent to pay attention to this as we will potentially be faced with multiple agents competing in an open market!
  • Our cohort had interesting discussions on misalignment in basic models and we were unable to classify the type of misalignment on a seemingly simple example because we were able to make a case for both definitions. I guess ultimately it is not so important to classify it as to recognise it and manage it. 
  • Schemers... anthropomorphising models triggers me. Research via Perplexity revealed papers with details of the low level problems underlying scheming which made a lot more sense and was easier to accept, but I am disturbed by my initial response to the blog post. Perhaps it should be replaced with something more sober, or a list of supporting papers? I was surprised no one else in my cohort mentioned this... 
  • I liked Neel Nanda's analogy for alignment that begins: "Evolution is an optimization process that produced humans, but from the perspective of evolution, humans are misaligned."


Comments

Popular Posts