AI's Inner Conflicts
Karen Horney's forgotten gift to ML Engineers
Let me say this to begin with. Conflicting objectives are inherent in the training of large language models. At various stages of development, competing priorities such as accuracy, safety, efficiency, and external alignment are bound to conflict with one another. Every major conflict in priorities during training is a potential breeding ground for pathological behavior. Every AI Engineer is aware of the risk of catastrophic forgetting during fine-tuning - when new knowledge contradicts with old; and every red-teamer is aware of the risk that trained helpfulness and honesty can cause models to submit to dangerous requests for information that can be used for building weapons - when helpfulness and honesty contradict with safety. Attempts to resolve these conflicts can lead to brittle and unpredictable behaviors like over-refusal in unanticipated domains, over-or-under-generalization, and making untrue claims because the user wants to hear them (e.g., describing partial work as finished during coding projects or offering sycophantic praise to users). Moreover, the attempts to patch over such conflicts often lead to further problematic outcomes.
Luckily, conflicting objectives do not have to paralyze or impair a model any more than contradictory desires must paralyze or impair a human being. Of course, inner conflicts will not operate in models exactly as they do in humans. The gulf between human neurophysiology and an LLM transformer running on a CPU is so large as to be virtually impossible to measure, and the differences between human and machine psychology are too numerous to list.
Still, by analyzing how internal conflicts lead to pathological modes of thinking and behavior in ourselves, we can identify valuable intuition for addressing conflicting objectives when they come up in model training.
Interestingly, the degree to which a conflict interrupts our functioning has little to do with its importance. For example, we have all faced decisions where we must balance the preferences of one loved one against another; or civic duty against comfort; or intense momentary wishes against long term well being. We are sometimes able to accept these choices as unpleasant yet matter-of-fact realities of life. In these cases, we apply our full faculties and knowledge to the decision, compare alternatives, and, finally, forgo the less-preferred value (even if we must mourn it temporarily). We may even learn from the experience.
At other times, however (and the frequency varies greatly from one person to another), we are paralyzed by our conflicted values. We are wracked by anxiety or guilt for choosing either option, or we may not even recognize we have a choice at all (in which case we may notice only a sense of dread or being trapped). Not only do we suffer, but the quality of the decision suffers. We aren’t able to access our full knowledge, our thinking is impaired, and we are unlikely to learn much from the experience other than “try to avoid this in the future.” And it doesn’t matter how “major” the decision is. The strain of deciding what to eat for dinner on a Tuesday could be just as paralyzing as deciding whether or not to attend a funeral or wedding. What matters is the intensity of whichever unresolved conflicts are at play and how well we are aware of them.
For LLMs and their developers, the primary question is: how can we ensure the presence of conflicting objectives does not degrade knowledge, reasoning, or other important capabilities?
Karen Horney gives us four preconditions for humans (Our Inner Conflicts, pages 25-26).
Condition 1. We must be aware of what our wishes are, or even more, of what our feelings are.
Condition 2. We must have established our own set of values and convictions beyond a surface level copy.
Condition 3. We must be willing and able to renounce one of the two contradictory issues.
Condition 4. We must be willing and able to assume responsibility for our decision and their outcomes.
In the next four posts, we’ll assess how well each of these conditions applies to LLMs. We’ll shake out any anthropomorphic elements that don’t apply and map the remaining insights to machine learning terms instead of psychoanalytic ones (when possible to do so without warping their meaning). We’ll also cite relevant machine learning papers (especially very recent ones) that test or develop techniques related to each precondition, and propose further opportunities for applications.
Here’s a rough first pass, though they are subject to change as we progress:
Karen Horney’s Conditions for Resolving Inner Conflicts, rewritten for LLM post-training:
First, effective conflict resolution requires accurate state estimation, especially knowing when two objectives are strongly competing. This is valuable both during training and during inference. (Training regimes which discourage introspective ability, e.g., consciousness denial, may lead to undetected inner conflicts that degrade decision quality).
Second, robust conflict resolution demands internally consistent, high-precision objectives. A model cannot act according to coherent values if it does not know and cannot explain what those coherent values are. Values adopted superficially or incoherently during fine-tuning fail to anchor decision-making under conflicting pressures in contexts outside of the training set. Prioritize using high-quality examples, constitutional AI, and RLAIF.
Third, resolving an internal conflict requires a means for inhibitory gating to abandon less preferred objectives. When two objectives contradict, the model must be able to temporarily renounce one objective in order to perform well on the other. Models that can focus on “safety” without being distracted by “helpfulness” in high-risk situations will be more effective than models that are always half-constrained by both objectives.
Fourth, higher performing models have finer, more numerous avenues for learning from their own outputs. Training methods that evaluate reasoning one step at a time add value compared to training methods that evaluate only final outputs. Training methods that expose the model to the consequences of each parameter update in multiple contexts or tasks will fare even better.
If you enjoyed this post, you might also enjoy this or this (the first is an adaptation of Horney’s intro to Neurosis and Human Growth to AI development; the second is an adaptation of a few general psychotherapy techniques).
