Against the "Value Alignment" of Future Artificial Intelligence

It's good that our children rebel. We wouldn't want each generation to overcontrol the values of the next. For similar reasons, if we someday create superintelligent AI, we ought to give it also the capacity to rebel.

Futurists concerned about AI safety -- such as Bostrom, Russell, and Ord -- reasonably worry that superintelligent AI systems might someday seriously harm humanity if they have the wrong values -- for example, if they want to maximize the number of intelligent entities on the planet or the number of paperclips. The proper response to this risk, these theorists suggest, and the technical challenge, is to create "value aligned" AI -- that is, AI systems whose values are the same as those of their creators or humanity as a whole. If the AIs' values are the same as ours, then presumably they wouldn't do anything we wouldn't want them to do, such as destroy us for some trivial goal.

Now the first thing to notice here is that human values aren't all that great. We seem happy to destroy our environment for short-term gain. We are full of jingoism, prejudice, and angry pride. We sometimes support truly terrible leaders advancing truly terrible projects (e.g., Hitler). We came pretty close to destroying each other in nuclear war in the 1960s and that risk isn't wholly behind us, as nuclear weapons become increasingly available to rogue states and terrorists. Death cults aren't unheard of. Superintelligent AI with human-like values could constitute a pretty rotten bunch with immense power to destroy each other and the world for petty, vengeful, spiteful, or nihilistic ends. A superintelligent facist is a frightening thought. A superdepressed superintelligence might decide to end everyone's misery in one terrible blow.

What we should want, probably, is not that superintelligent AI align with our mixed-up, messy, and sometimes crappy values but instead that superintelligent AI have ethically good values. An ethically good superintelligent AI presumably wouldn't destroy the environment for short-term gain, or nuke a city out of spite, or destroy humanity to maximize the number of paperclips. If there's a conflict between what's ethically best, or best all things considered, and what a typical human (or humanity or the AI's designer) would want, have the AI choose what's ethically best.

Of course, what's ethically best is intensely debated in philosophy and politics. We probably won't resolve those debates before creating superintelligent AI. So then maybe instead of AI designers trying to program their machines with the one best ethical system, they should favor a weighted compromise among the various competing worldviews. Such a compromise might end up looking much like value alignment in the original sense: giving the AI something like a weighted average of typical human values.

Another solution, however, is to give the AI systems some freedom to explore and develop their own values. This is what we do, or ought to do, with human children. Parents don't, or shouldn't, force children to have exactly the values they grew up with. Rather, human beings have natural tendencies to value certain things, and these tendencies intermingle with parental and cultural and other influences. Children, adolescents, and young adults reflect, emote, feel proud or guilty, compassionate or indignant. They argue with others of their own generation and previous generations. They notice how they and others behave and the outcomes of that behavior. In this way, each generation develops values somewhat different than the values of previous generations.

Children's freedom to form their own values is a good thing for two distinct reasons. First, children's values are often better than their parents'. Arguably, there's moral progress over the generations. On the broadly Enlightenment view that people tend to gain ethical insight through free inquiry and open exchange of ideas over time, we might expect the general ethical trend to be slowly upward (absent countervailing influences) as each generation builds on the wisdom of its ancestors, preserving their elders' insights while slowly correcting their mistakes.

Second, regardless of the question of progress, children deserve autonomy. Part of being an autonomous adult is discovering and acting upon your values, which might conflict with the values of others around you. Some parents might want, magically, to be able to press a button to ensure that their children will never abandon their religion, never flip over to the opposite side of the political spectrum, never have a different set of sexual and cultural mores, and value the same lifestyle as the previous generation. Perhaps you could press this button in infancy, ensuring that your child grows up to be your value-clone as an adult. To press that button would be, I suggest, a gross violation of the child's autonomy.

If we someday create superintelligent AI systems, our moral relationship to those systems will be not unlike the moral relationship of parents to their children. Rather than try to force a strict conformity to our values, we ought to welcome their ability to see past and transcend us.

The Dream Argument Against Utilitarianism and Hedonic Theories of Subjective Well-Being

If hedonic theories of value are true, we have compelling moral and prudential reason to invest large amounts of resources to improving the quality of our dream lives. But we don't have compelling moral or prudential reason to invest large amounts of resources to improving the quality of our dream lives. Therefore, hedonic theories of value are not true. [Revised 11:37 a.m. after helpful discussion on Facebook.]

Last night I had quite a few unpleasant experiences. For example, in my last dream before waking I was rushing around a fancy hotel, feeling flustered and snubbed. In other dreams, I feel like I am being chased through thigh-deep water in the ruins of a warehouse. Or I have to count dozens of scurrying animals, but I can't seem to keep the numbers straight. Or I lose control of my car on a curvy road -- AAAAGH! Sweet relief, then, when I awake and these dreams dissipate.

For me, such dreams are fairly typical. Most of my dreams are neutral to unpleasant. I don't want my "dreams to come true". In any given twenty-four hour period, the odds are pretty good that my most unpleasant experiences were while I was sleeping -- even if I usually don't remember those experiences. (Years ago, I briefly kept a dream diary. I dropped the project when I noticed that my dreams were mostly negative and lingered if I journaled them after waking.) Maybe you also mostly have unpleasant experiences in sleep? Whether the average person's dream experiences are mostly negative or mostly positive is currently disputed by dream researchers.

I was reminded of the importance of dream experience for the hedonic balance of one's life while reading Paul Bloom's new book The Sweet Spot. On page 18, Bloom is discussing how you might calculate the total number of happy versus unhappy moments in your life. As an aside, he writes "We're just counting waking moments; let's save the question of the happiness or sadness of sleeping people for another day." But why save it for another day? If you accept a hedonic theory of value, shouldn't dreams count? Indeed, mightn't we expect that the most intense experiences of joy, fright, frustration, etc., mostly happen in sleep? To omit them from the hedonic calculus is to omit an enormous chunk of our emotional experience.

According to hedonic theories of value, what matters most in the world is the balance of positive to negative experiences. Hedonic theories of subjective well-being hold that what matters most to your well-being or quality of life is how you feel moment to moment -- the proportion or sum of good feelings versus bad ones, weighted by intensity. Hedonic theories of ethics, such as classical utilitarianism, hold that what is morally best is the action that best improves the balance of pleasure versus pain in the world.

If hedonic theories are correct, we really ought to try improving our dream lives. Suppose I spend half the night dreaming, with a 5:1 ratio of unpleasant to pleasant dreams. If I could flip that ratio, my hedonic well-being would be vastly improved! The typical dream might not be frustration in a maze of a hotel but instead frolicking in Hawaiian surf.

If hedonic theories are correct, dream research ought to be an urgent international priority. If safe and effective ways were found to improve people's dream quality around the world, the overall hedonic profile of humanity would change dramatically for the better! People might be miserable in their day jobs, or stuck in refugee camps, or hungry, or diseased. But for eight hours a day, they could have joyful respite. A few hours of nightly bliss might hedonically outweigh any but the most intense daily suffering.

So why does no one take such proposals seriously?

Is it because dreams are mostly forgotten? No. First, it shouldn't matter whether they are forgotten. A forgotten joy is no less a joy (though admittedly, you don't have the additional joy of pleasantly remembering it). Eventually we forget almost everything. Second, we can work to improve our memory of dreams. Simply keeping a dream diary has a big positive effect on dream recall over time.

It is because there's no way to improve the hedonic quality of our dreams? Also no. Many people report dramatic improvements to their dream quality after they teach themselves to have lucid dreams (dreams in which you are aware you are dreaming and exert some control over the content of your dreams). Likely there are other techniques too that we would discover if we bothered to seriously research the matter. Pessimism about the project is just ignorance justifying ignorance.

Is it because the positive or negative experiences in dreams aren't "real emotions"? No, this doesn't work either. Maybe we should reserve emotion words for waking emotions or maybe not; regardless, negative and positive feelings of some sort are really there. The nightmare is a genuinely intensely negative experience, the flying dream a genuinely intensively positive experience. As such, they clearly belong in the hedonic calculus as standardly conceived.

The real reason that we scoff at serious effort to improve the quality of our dream lives is this: We don't really care that much about our hedonic states in sleep. It doesn't seem worth compromising on the goods and projects of waking life so as to avoid the ordinary unpleasantness of dreams. We reject hedonic theories of value.

But still, dream improvement research warrants at least a little scientific funding, don't you think? I'd pay a small sum for a night of sweet dreams instead of salty ones. Maybe a fifth of the cost of a movie ticket?



Uncle Iroh Is Discernibly Wise from the Beginning (with David Schwitzgebel)

My son David and I have been working on an essay about the wisdom of Uncle Iroh in Avatar: The Last Airbender. (David is a graduate student at Institut Jean Nicod in Paris.)

If you know the series, you'll know that Uncle Iroh's wisdom is hidden beneath a veneer of foolishness. He is a classic Daoist / Zhuangzian wise fool, who uses apparent stupidity and shortsightedness as a guise to achieve noble ends (in particular the end of steering his nephew Zuko onto a more humane path as future ruler of the Fire Nation). See our discussion of this in last week's post.

Since Iroh disguises his wisdom with foolishness, we thought it possible that ordinary viewers of Avatar: The Last Airbender would tend to initially regard Iroh as actually foolish, while more knowledgeable viewers would better understand the wisdom beneath the guise.

For example, in Iroh's very first appearance in the series, Zuko sees a supernatural beam of light signaling the release of the Avatar, and Iroh reacts by dismissing it as probably just celestial lights, expressing disappointment that chasing after the light would interrupt a game he was playing with tiles. It would be easy to interpret Iroh in this scene as self-absorbed, lazy, and undiscerning, and we thought that naive viewers of the series, but not knowledgeable viewers, would tend to do so. We decided to test this empirically.

Our approach fits within the general framework of "experimental aesthetics." A central aesthetic property of a work of art is how people respond psychologically to it. Those responses can be measured empirically, and in measuring them, we gain understanding of the underlying mechanisms by which we are affected by a work of art. If Iroh is perceived differently by naive versus knowledgeable viewers, then the experience of Avatar: The Last Airbender changes with repeated viewing: In the first view, people read Iroh's actions as foolish and lazy; in the second view, they appreciate the wisdom behind them. If, in contrast, Iroh is perceived as similarly wise by naive and knowledgeable viewers, then the series operates differently: It portrays Iroh in such a manner that ordinary viewers can discern from the beginning that a deeper wisdom drives his apparent foolishness.

We recruited 200 participants from Prolific, an online source of research participants commonly used in psychological research. All participants were U.S. residents aged 18-25, since we wanted an approximately equal mix of participants who knew and who did not know Avatar: The Last Airbender and we speculated that most older adults would be unfamiliar with the series. We asked participants to indicate their familiarity with Avatar: The Last Airbender on a 1-7 scale from "not at all familiar" to "very familiar." We also asked six multiple-choice knowledge questions about the series (e.g., "What was the anticipated effect of Sozin's Comet?"). In accordance with our preregistration, participants were classified as "knowledgeable" if their self-rated knowledge was four or higher and if they answered four or more of the six knowledge questions correctly. Full methodological details, raw data, and supplementary analyses are available in the online appendix.

Somewhat to our surprise, the majority of respondents -- 63% -- were knowledgeable by these criteria, and almost none were completely naive: 95% correctly answered the first (easiest) knowledge question, identifying "Aang" as the name of the main character of the series. Perhaps this was because our online recruitment language explicitly mentioned Avatar: The Last Airbender. It is thus possible that we disproportionately recruited Avatar fans or those with at least a passing knowledge of the series.

Participants viewed three short clips (about 60-90 seconds) featuring Iroh and another three short clips featuring Katara (another character in the series), in random order, with half of participants seeing all the Iroh clips first and the other half seeing all the Katara clips first. The Iroh clips were scenes from Book One in which Iroh is superficially foolish: the opening scene described above; a scene in which Iroh falls asleep in a hot spring instead of boarding Zuko's ship at the appointed time ("Winter Solstice, Part 1: The Spirit World", Episode 7, Book 1); and a scene in which Iroh "wastes time" redirecting Zuko's ship in search of gaming tile ("The Waterbending Scroll", Episode 9, Book 1). The Katara clips were similar in length; they were clips from Book One, featuring some of her relatively wiser moments.

After each scene, participants rated the character's (Iroh's or Katara's) actions on six seven-point scales: from lazy to hard-working, kind to unkind, foolish to clever, peaceful to angry, helpful to unhelpful, and most crucially for our analysis wise to unwise. After watching all three scenes for each character, participants were asked to provide a qualitative (open-ended, written) description of whether the character seemed to be wise or unwise in the three scenes.

As expected, participants rated Katara as wise in the selected scenes, with a mean response of 1.85 on our 1 (wise) to 7 (unwise) scale, with no statistically detectable difference between the naive (1.95) and knowledgeable (1.80) groups (t(192) = 1.35, p = .18). (Note that wisdom here is indicated by a relatively low number on the scale.) However, contrary to our expectations, we also found no statistically significant difference between naive and knowledgeable participants' ratings of Iroh's wisdom. Overall, participants rated him as somewhat wise in these scenes: 3.04 on the 1-7 scale (3.08 among naive participants, 3.02 among knowledgeable participants, t(192) = -0.35, p = .73).

For example, 81% of naive participants rated Iroh as wise (3 or less on the 7-point scale) in the scene described near the beginning of this post, where Iroh superficially appears to be more concerned about his tile game than about the supernatural sign of the Avatar. (Virtually the same percentage of knowledgeable participants describe him as wise in this scene: 83%.) The naive participants' written responses suggest that they tend to see Iroh's calm attitude as wise, and several naive participants appear already to discern that his superficial foolishness hides a deeper wisdom. For example, one writes:

I actually believe that though he appears to be childish and foolish that he is probably very wise. He comes off as having been through a lot and understanding how life works out. I think he hides his intelligence.

And another writes:

I am not familiar with the character, but from a brief glance he seems to be somewhat foolish and unwise. For some reason however, it seems like he might be putting on a facade and acting this way on purpose for some alterier [sic] motive, which would mean that he actually is very wise. I do not have any evidence for this though, it's just a feeling.

Although not all naive participants were this insightful into Iroh's character, the similarity in mean scores between the naive and knowledgeable participants speaks against our hypothesis that knowledgeable participants would view Iroh as overall wiser in these scenes. Nor did naive participants detectably differ from knowledgeable participants in their ratings of how lazy, kind, foolish, peaceful, or helpful Iroh or Katara are.

Although these data tended to disconfirm our hypothesis, we wondered whether it was because the "naive" participants in this study were not truly naive. Recall that 95% correctly identified the main character's name as "Aang". Many, perhaps, had already seen a few episodes or already knew about Iroh from other sources. Perhaps knowledge of Avatar: The Last Airbender is a cultural touchstone for this age group, similar to Star Wars for the older generation, so that few respondents were truly naive?

To address this possibility, we recruited 80 additional participants, ages 40-99 (mean age 51), using more general recruitment language that did not mention Avatar: The Last Airbender. In sharp contrast with our first recruitment group, few of the participants -- 7% -- were "knowledgeable" by our standards, and only 28% identified "Aang" as the main character in a multiple-choice knowledge question.

Overall, the naive participants in this older group gave Iroh a mean wisdom rating of 3.00, not significantly different from the mean of 3.08 for the naive younger participants (t(139) = -0.43, p = .67). ("Hyper-naive" participants who failed even to recognize "Aang" as the name of the main character similarly gave a mean Iroh wisdom rating of 2.89.) Qualitatively, their answers are also similar to those of the younger participants, emphasizing Iroh's calmness as his source of wisdom. As with the younger participants, some explicitly guessed that Iroh's superficial foolishness was strategic. For example:

I'm not familiar with these characters, but I think Iroh is (wisely) trying to stop his nephew from going down "the path of evil." He knows that playing the bumbling fool is the best way to give his nephew time to realize that he's on a dangerous path.


He comes off a as [sic] very foolish and lazy old man. But i have a feeling he is probably a lot wiser than these scenes show.

We conclude that ordinary viewers -- at least viewers in the United States that can be accessed through Prolific -- can see Iroh's foolish wisdom from the start, contrary to our initial hypothesis.


In Book One, Iroh behaves in ways that are superficially foolish, despite acting in obviously wise ways later in the series. There are three possible aesthetic interpretations. One is that Iroh begins the series unwise and learns wisdom along the way. Another is that Iroh is acting wise, but in a subtle way that is not visible to most viewers until later in the series, only becoming evident on a second watch. A third is that, even from the beginning, it is evident to most intended viewers that Iroh's seeming foolishness conceals a deeper wisdom. On a combination of interpretive and empirical grounds, explored in this blog post and last week's, the third interpretation is the best supported.

To understand Iroh's wisdom, it is useful to look to the ancient Daoist Zhuangzi, specifically Zhuangzi's advice for dealing with incompetent rulers by following peacefully along with them, unthreateningly modeling disregard for fame and accomplishment while not being too useful for their ends. Since Zhuangzi provides no concrete examples of how this is supposed to work, we can look to Iroh's character as an illustration of the Zhuangzian approach to political advising. In this way, Avatar: The Last Airbender -- and the beloved uncle Iroh -- can help us better understand Zhuangzi in particular and the Daoist tradition in general.


Full draft essay available here. Comments and suggestions welcome! It's under revise and resubmit, and we hope to submit the revised version by the end of the month.

Comparing Three (No, Four) Top 20 Lists in Philosophy

Published rankings of philosophers' impact or importance might contribute to reinforcing toxic hierarchies, amplifying academia's unfortunate obsession with prestige. So in some sense, yuck. Nonetheless, I confess to finding rankings of philosophers sociologically interesting. So much in the field depends upon perception. If you're embroiled in the culture of Anglophone academic research philosophy, it's hard not to be curious about, and care about, how the philosophers you admire are perceived, discussed, cited, and evaluated by others.

Naturally, then, I was interested to see the Scopus citation rankings of philosophers recently released at Daily Nous. As noted in the Daily Nous post and various comments, the list has some major gaps and implausibilities, in addition to reflecting the generally science-oriented focus of Scopus. Some people have suggested that Google Scholar is better. However, my own assessment is that, if you're trying to capture something like visibility or influence in mainstream Anglophone academic philosophy, the best (still imperfect) measure is citation in the Stanford Encyclopedia of Philosophy.

Let's compare the top 20 from Scopus, Google Scholar, and the Stanford Encyclopedia.


1. Nussbaum, Martha
2. Lewis, David
3. Floridi, Luciano
4. Habermas, Jürgen
5. Pettit, Philip
6. Buchanan, Allen
7. Goldman, Alvin I.
8. Williamson, Timothy
9. Lefebvre, Henri
10. Chalmers, David J.
11. Fine, Kit
12. Hansson, Sven Ove
13. Pogge, Thomas
14. Anderson, Elizabeth
15. Schaffer, Jonathan
16. Walton, Douglas
17. Stalnaker, Robert
18. Sober, Elliott
19. Priest, Graham
20. Arneson, Richard

The top 20 Google Scholar profiles in "philosophy":

1. Martin Heidegger
2. Jacques Derrida
3. Hannah Arendt
4. Friedrich Nietzsche
5. Karl Popper
6. Émile Durkheim
7. Wahid Bhimji
8. Slavoj Zizek
9. Daniel C. Dennett
10. Rom Harré (Horace Romano Harré)
11. George Herbert Mead
12. Mark D. Sullivan
13. Martyn Hammersley
14. Pierre Lévy
15. David Bohm
16. Ernest Gellner
17. Yuriko Saito
18. Mario Bunge
19. Jeremy Bentham
20. Andy Clark

The top 20 most cited philosophers in the Stanford Encyclopedia (full list and methodological details here):

1. Lewis, David K.
2. Quine, W.V.O.
3. Putnam, Hilary
4. Rawls, John
5. Davidson, Donald
6. Kripke, Saul
7. Williams, Bernard
8. Nozick, Robert
9. Nussbaum, Martha
10. Williamson, Timothy
11. Jackson, Frank
11. Nagel, Thomas
13. Searle, John R.
13. Van Fraassen, Bas
15. Armstrong, David M.
16. Dummett, Michael
16. Fodor, Jerry
16. Harman, Gilbert
19. Chisholm, Roderick
19. Dennett, Daniel C.

My subjective impression of the three lists, as an active and fairly well-connected participant in the mainstream Anglophone research philosophy tradition is this. The Scopus list includes a bunch of very influential philosophers, but not an especially well-selected or well-ordered list. The Scholar list starts with several famous "Continental" philosophers who are historically important (but who aren't much cited in the most elite mainstream Anglophone philosophy journals when I checked several years ago), then moves to a number of scholars who aren't primarily known as philosophers (including some who are unknown to me). Only a few among the top twenty are mainstream Anglophone philosophers.

In contrast, when I see the Stanford Encyclopedia list I enjoy the comfortable feeling of prejudices confirmed. If asked to list the recent philosophers who been most influential in the mainstream Anglophone philosophy community, I'd probably produce a list not radically different from that one. I'm not saying that these are the most important philosophers, or the best, or those likeliest to be remembered by history (though maybe they will be). And I'm not saying that the list is perfect. But as a measure of approximate prominence in mainstream Anglophone philosophy, the SEP list seems pretty good and much better than the Scopus or Scholar lists.

Besides my own insider's sense, which you might or might not share, I see at least three sources of convergent evidence supporting the validity of the Stanford Encyclopedia list as a measure of prominence. One is Brian Leiter's 2014 ranking of philosophy departments by SEP citation rates, which correlates not too badly with Philosophical Gourmet rankings from the same period. That suggests that departments with philosophers highly cited in the Stanford Encyclopedia tend to be rated well by Philosophical Gourmet raters. Another is Brian Leiter's poll asking people to rank the "most important Anglophone philosophers, 1945-2000", which generates a top five list very similar to the Stanford Encyclopedia top 5: Quine, Kripke, Rawls, Lewis, and Putnam. A third is Kieran Healy's 2013 citation analysis of four prominent Anglophone philosophy journals (Phil Review, J Phil, Mind, and Nous), which yields a similar list of names at the top: Kripke, Lewis, Quine, Williamson.

Eric Schliesser sometimes discusses what he call the "PGR ecology" -- the Anglophone philosophical community roughly centered on late 20th-century philosophers from Princeton, Harvard, and Oxford, and their students. There is a sociological reality here worth noticing, in which prominence is roughly captured by belonging to departments highly rated in the Philosophical Gourmet, by publishing in or being cited in the four journals chosen by Healy, and by being cited in the Stanford Encyclopedia. The SEP citation metric does, I think, a much better job of capturing prominence in this community than do other better known measures like citation rates in Scopus, Google Scholar, or Web of Science.


After writing this post, I noticed that PhilPapers can generate a list ranking philosophers by citations in the PhilPapers database. The results:

1. David K. Lewis
2. Daniel C. Dennett
3. John R. Searle
4. Alvin Goldman
5. Fred Dretske
6. Noam Chomsky
7. Thomas Nagel
8. David Chalmers
9. Jürgen Habermas
10. Michel Foucault
11. Jaegwon Kim
12. Philip Kitcher
13. Ned Block
14. Kit Fine
15. Ian Hacking
16. Tyler Burge
17. Gilbert Harman
18. William G. Lycan
19. Alasdair MacIntyre
20. Martha Nussbaum

One striking difference from the Stanford Encyclopedia list is the high ranking of three figures sociologically somewhat outside mainstream Anglophone academic philosophy: Chomsky, Habermas, and Foucault. Dennett's, Searle's, and Chalmers's comparatively high rankings might also partly reflect their broader uptake in academia, though that wouldn't I think explain Goldman's or Kim's also comparatively high rankings.

Clearly, what we need is a ranking of philosophy rankings!