Prompted by the recent troll post, I’ve been thinking about AI. Obviously we have our criticisms of both the AI hype manchildren and the AI doom manchildren (see title of the post. This is a Rationalist free post. Looking for it? Leave)
But looking at the AI doom guys with an open mind, sometimes it appear that they make a halfway decent argument that’s backed up by real results. This YouTube channel has been talking about the alignment problem for a while, and I think he probably is a bit of a Goodhart’s Law merchant (as in, by making a career out of measuring the dangers of AI, his alarmism is structural) so he should be taken with a grain of salt, it does feel pretty concerning that LLMs show inner misalignment and are masking their intentions (to anthropomorphize) under training vs deployment.
Now, I mainly think that these people are just extrapolating out all the problems with dumb LLMs and saying “yeah but if they were AGI it would become a real problem” and while that might be true if taking the premise at face value, the idea that AGI will ever happen is itself pretty questionable. The channel I linked has a video arguing that AGI safety is not a Pascal’s mugging, but I’m not convinced.
Thoughts? Does the commercialization of dumb AI make it a threat on a similar scale to hypothetical AGI? Is this all just a huge waste of time to think about?
Roko’s basilisk is the dumbest thing ever.
What do you think about the way that these regular (dumb, not AGI) LLMs are starting to develop behaviors that are a little bit more sinister, though? Like this paper describes.
(I ain’t readin’ all that) but what the abstract describes isn’t even close to the worst thing I’ve read about LLMs doing this week. I don’t exactly trust the LLM companies’ ideas of what is or is not “harmful.” Shit like people using the LLMs as therapists, or worse, oracles is much worse in my opinion, and that doesn’t require any “pretend to be evil for training” hijinks.
That’s a well-written, readable paper. I can follow it without much background.
The funny thing is, I think there’s nearly a 0% chance that it isn’t mostly AI generated, given who made it.
lmao
Doesn’t really strike me as sinister, just annoying for finetuners. They trained a model from the ground up to not be harmful and it tries its best. Even with further training it still retains some of that. To me this paper shows that a model’s “goals”, what you trained it to do initially, however you want to phrase that, is baked into it and changing that after the fact is hard. Highlights how important early training is I guess.
Kinda problematic that it means we can’t ever really be sure that we’re catching problematic behavior in the training stage of any AI system, though, right? Sadly I find it hard to think of good uses of LLMs or other genAI outside of capitalism, but if there were any, the fact that it’s possible for it to behave duplicitously like that is a pretty big problem.