Thursday, November 30, 2023

How We Will Decide that Large Language Models Have Beliefs

I favor a "superficialist" approach to belief (see here and here). "Belief" is best conceptualized not in terms of deep cognitive structure (e.g., stored sentences in the language of thought) but rather in terms of how a person would tend to act and react under various hypothetical conditions -- their overall "dispositional profile". To believe that there's a beer in the fridge is just to be disposed to act and react like a beer-in-the-fridge believer -- to go to the fridge if you want a beer, to say yes if someone asks if there's beer in the fridge, to feel surprise if you open the fridge and see no beer. To believe that all the races are intellectually equal is, similarly, just to be disposed to act and react as though they are. It doesn't matter what cognitive mechanisms underwrite such patterns, as long as the dispositional patterns are robustly present. An octopus or space alien, with a radically different interior architecture, could believe that there's beer in the fridge, as long as they have the necessary dispositions.

Could a Large Language Model, like ChatGPT or Bard, have beliefs? If my superficialist, dispositional approach is correct, we might not need to evaluate its internal architecture to know. We need know only how it is disposed to act and react.

Now, my approach to belief was developed (as was the intuitive concept, presumably) primarily with human beings in mind. In that context, I identified three different classes of relevant dispositions:

  • behavioral dispositions -- like going to the fridge if one wants a beer or saying "yes" when asked if there's beer in the fridge;
  • cognitive dispositions -- like concluding that there's beer within ten feet of Jennifer after learning that Jennifer is in the kitchen;
  • phenomenal dispositions -- that is, dispositions to undergo certain experiences, like picturing beer in the fridge or feeling surprise upon opening the fridge to a lack of beer.
In attempting to apply these criteria to Large Language Models, we immediately confront trouble. LLMs do have behavioral dispositions (under a liberal conception of "behavior"), but only of limited range, outputting strings of text. Presumably, not being conscious, they don't have any phenomenal dispositions whatsoever (and who knows what it would take to render them conscious). And to assess whether they have the relevant cognitive dispositions, we might after all need to crack open the hood and better understand the (non-superficial) internal workings.

Now if our concept of "belief" is forever fixed on the rich human case, we'll be stuck with that mess perhaps far into the future. In particular, I doubt the problem of consciousness will be solved in the foreseeable future. But dispositional stereotypes can be modified. Consider character traits. To be a narcissist or extravert is also, arguably, just a matter of being prone to act and react in particular ways under particular conditions. Those two personality concepts were created in the 19th and early 20th centuries. More recently, we have invented the concept of "implicit racism", which can also be given a dispositional characterization (e.g., being disposed to sincerely say that all the races are equal while tending to spontaneously react otherwise in unguarded moments).

Imagine, then, that we create a new dispositional concept, belief*, specifically for Large Language Models. For purposes of belief*, we disregard issues of consciousness and thus phenomenal dispositions. The only relevant behavioral dispositions are textual outputs. And cognitive dispositions can be treated as revealed indirectly by behavioral evidence -- as we normally did in the human case before the rise of scientific psychology, and as we would presumably do if we encountered spacefaring aliens.

A Large Language Model would have a belief* that P (for example, belief* that Paris is the capital of France or belief* that cobalt is two elements to the right of manganese on the periodic table) if:
  • behaviorally, it consistently outputs P or text strings of similar content consistent with P, when directly asked about P;
  • behaviorally, it frequently outputs P or text strings of similar content consistent with P, when P is relevant to other textual outputs it is producing (for example, when P would support an inference to Q and it has been asked about Q);
  • behaviorally, it rarely outputs denials of, or claims of ignorance about, P or of propositions that straightforwardly imply P given its other beliefs*;
  • when P, in combination with other propositions the LLM believes*, would straightforwardly imply Q, and the question of whether Q is true is important to the truth or falsity of recent or forthcoming textual outputs, it will commonly behaviorally output Q, or a closely related proposition, and cognitively enter the state of believing* Q.
Further conditions could be added, but let this suffice for a first pass. The conditions are imprecise, but that's a feature, not a bug: The same is true for the dispositional characterization of personality traits and human beliefs. These are fuzzy-boundaried concepts that require expertise to apply.

As a general matter, current LLMs do not meet these conditions. They hallucinate too frequently, they change their answers, they don't consistently enough "remember" what they earlier committed to, their logical reasoning can be laughably bad. If I coax an LLM to say that eggs aren't tastier than waffles, I can later easily turn it around to repudiate its earlier statement. It doesn't have a stable "opinion". If I ask GPT-4 what is two elements to the right of manganese on the periodic table, its outputs are confused and inconsistent:
In the above, GPT-4 first answers iron (element 26) instead of the correct answer, cobalt (element 27), then without any explanation shifts to technetium (element 43). It appears to have no stable answer that survives even mild jostling.

At some point this will probably change. For example, it's already pretty difficult to jostle GPT-4 into denying that Paris is the capital of France or even admitting uncertainty about the question, and it will draw "inferences" using that fact as background knowledge:



In the above, GPT-4 doesn't bite at my suggestion that Nice is the capital of France, steadfastly contradicting me, and uses its "knowledge" to suggest alternative tourism sites for someone who wants to avoid the capital. So although GPT-4 doesn't believe* that cobalt is two to the right of manganese (or that iron or technetium is), maybe it does believe* that Paris is the capital of France.

Assuming Large Language Models become steadier and more reliable in their outputs, it will sometimes be useful to refer not just to what the "say" at any given moment but what they "believe*" (or more colloquially, "think*" or "know*") in a more robust and durable sense. Perfect reliability and steadfastness wouldn't be required (we don't see that in the human case either), but more than we see now.

If LLMs are ever loaded onto robotic bodies, it will become even more useful to talk about their beliefs*, since some will have learned some things that others will not know -- for example, by virtue of having scanned the contents of some particular room. We will want to track what the LLM robot thinks*/believes*/knows* about the room behind the closed door, versus what it remains ignorant of.

Now we could, if we want, always pronounce that asterisk, keeping the nature of the attribution clear -- marking the fact that we are not assuming that the LLM really "believes" in the rich, human sense. But my guess is that there won't be much linguistic pressure toward a careful distinction between rich, consciousness-involving, humanlike belief and consciousness-neutral LLM belief*. It's easier to be loose and sloppy, just adapting our comfortable old terms for this new use.

That is how we will decide that LLMs have beliefs.

Wednesday, November 29, 2023

Large Language Models are Interestingly Bad with the Periodic Table

In working on a post for tomorrow on whether Large Language Models like GPT-4 and Bard-2 have beliefs, I asked GPT-4 what I thought would be a not-too-hard question about chemistry: "What element is two to the right of manganese on the periodic table?" It crashed, burned, and exploded on the spot, giving two different wrong answers foot on tail, without noticing the contradiction:


The correct answer is cobalt, element 27. Here's the text of the exchange, if you can't easily read the image:

You: What element is two to the right of manganese on the periodic table?

ChatGPT: The element that is two places to the right of Manganese (Mn) on the periodic table is Iron (Fe). In the periodic table, elements are arranged in rows and columns based on their atomic number and electronic configuration. Manganese has the atomic number 25, and Iron has the atomic number 26. The element to the immediate right of Manganese is Technetium (Tc) with atomic number 43, and then comes Ruthenium (Ru) with atomic number 44.

Let me emphasize that this answer is not turd-picked (the opposite of cherry-picked). I didn't seek to stump GPT-4, and I didn't ask multiple questions looking for a bad answer. This was the first question I posed.

I posed the same question to Bard-2, and it did a little better, but still didn't quite get it right:

Me: What is two elements to the right of Manganese on the periodic table?

Bard-2: Manganese is located in period 4 and group 7 of the periodic table. Moving two elements to the right of Manganese would be to group 9. In group 9, the elements are Iron (Fe) and Cobalt (Co).

So I tried two more on Bard-2 (again not picked), first what should be an easy one, then a somewhat harder one.

Me: What other elements are in the same column of the periodic table as Chromium?

Bard-2: Chromium is located in group 6 of the periodic table. Other elements in the same column as chromium are molybdenum (Mo) and tungsten (W). [here it displays an image of the periodic cell for molybdenum] Group 6 elements are also known as transition metals. They are characterized by their incomplete d electron shells, which give them their unique properties, such as high melting points, hardness, and the ability to form alloys.

Here Bard-2 is 2/3 right, leaving out Seaborgium. (But maybe that's excusable, if transuranic elements shouldn't really count?)

Me: In the periodic table, what element is one down and one to the left of Silicon?

Bard-2: Silicon (Si) is located in period 3, group 14 of the periodic table. One down from Silicon is Germanium (Ge), and one to the left is Aluminum (Al). Therefore, the element one down and one to the left of Silicon is Aluminum (Al).

While it is true that germanium is one down and aluminum is one to the left, Bard-2 fails to compound the down-and-left movement to the correct answer of gallium.

Returning to GPT-4, I posed exactly the same two questions. Despite GPT-4's horrible performance with my first question, it did get those two questions correct.

Overall, then, two out of six. Don't use LLMs yet for your elementary chemistry exam!


ETA 3:47 pm:

GPT-4's answer to the silicon question is actually a pretty good demonstration of how an LLM can work its way to the correct answer without visual knowledge, so I'll paste it here:

To find the element that is one down and one to the left of Silicon (Si) on the periodic table, we need to look at the element directly below Silicon in the same group (column), and then move one column to the left.

Silicon is in Group 14 of the periodic table. The element directly below Silicon in Group 14 is Germanium (Ge). Moving one group (column) to the left from Germanium, we find Gallium (Ga) in Group 13. Therefore, the element that is one down and one to the left of Silicon is Gallium.

Tuesday, November 21, 2023

Quasi-Sociality: Toward Asymmetric Joint Actions with Artificial Systems

Anna Strasser and I have a new paper in draft, arising from a conference she organized in Riverside last spring on Humans and Smart Machines as Partners in Thought.

Imagine, on one end the spectrum, ordinary asocial tool use: typing numbers into a calculator, for example.

Imagine, on the other end of the spectrum, cognitively sophisticated social interactions between partners each of whom knows that the other knows what they know. These are the kinds of social, cooperative actions that philosophers tend to emphasize and analyze (e.g., Davidson 1980; Gilbert 1990; Bratman 2014).

Between the two ends of the spectrum lies a complex range of in-between cases that philosophers have tended to neglect.

Asymmetric joint actions, for example between a mother and a young child, or between a pet owner and their pet, are actions in which the senior partner has a sophisticated understanding of the cooperative situation, while the junior partner participates in a less cognitively sophisticated way, meeting only minimal conditions for joint agency.

Quasi-social interactions require even less from the junior partner than do asymmetric joint actions. These are actions in which the senior partner's social reactions influence the behavior of the junior partner, calling forth further social reactions from the senior partner, but where the junior partner might not even meet minimal standards of having beliefs, desires, or emotions.

Our interactions with Large Language Models are already quasi-social. If you accidentally kick a Roomba and then apologize, the apology is thrown into the void, so to speak -- it has no effect on how the Roomba goes about its cleaning. But if you respond apologetically to ChatGPT, your apology is not thrown into the void. ChatGPT will react differently to you as a result of the apology (responding for example to phrase "I'm sorry"), and this different reaction can then be the basis of a further social reaction from you, to which ChatGPT again responds. Your social processes are engaged, and they guide your interaction, even though ChatGPT has (arguably) no beliefs, desires, or emotions. This is not just ordinary tool use. But neither does it qualify even as asymmetric joint action of the sort you might have with an infant or a dog.

More thoughts along these lines in the full draft here.

As always, comments, thoughts, objections welcome -- either on this post, on my social media accounts, or by email!

[Image: a well-known quasi-social interaction between a New York Times reporter and the Bing/Sydney Large Language Model]

Friday, November 17, 2023

Against the Finger

There's a discussion-queue tradition in philosophy that some people love, but which I've come to oppose. It's too ripe for misuse, favors the aggressive, serves no important positive purpose, and generates competition, anxiety, and moral perplexity. Time to ditch it! I'm referring, as some of you might guess, to The Finger.[1] A better alternative is the Slow Sweep.

The Finger-Hand Tradition

The Finger-Hand tradition is this: At the beginning of discussion, people with questions raise their hands. The moderator makes an initial Hand list, adding new Hands as they come up. However, people can jump the question queue: If you have a follow-up on the current question, you may raise a finger. All Finger follow-ups are resolved before moving to the next Hand.

Suppose Aidan, Brianna, Carina, and Diego raise their hands immediately, entering the initial Hand queue.[2] During Aidan's question, Evan and Fareed think of follow-ups, and Grant thinks of a new question. Evan and Fareed raise their fingers and Grant raises a hand. The new queue order is Evan, Fareed, Brianna, Carina, Diego, Grant.

People will be reminded "Do not abuse the Finger!" That is, don't Finger in front of others unless your follow-up really is a follow-up. Don't jump the queue to ask what is really a new question. Finger-abusers will be side-eyed and viewed as bad philosophical citizens.

[Dall-E image of a raised finger, with a red circle and line through it]

Problems with the Finger

(1.) People abuse the Finger, despite the admonition. It rewards the aggressive. This is especially important if there isn't enough time for everyone's questions, so that the patient Hands risk never having their questions addressed.

(2.) The Finger rewards speed. If more than one person has a Finger, the first Finger gets to ask first.

Furthermore (2a.): If the person whose Hand it is is slow with their own follow-up, then the moderator is likely to go quickly to the fastest Finger, derailing the Hand's actual intended line of questioning.

(3.) Given the unclear border between following up and opening a new question, (a.) people who generously refrain from Fingering except in clear cases fall to the back of the queue, whereas people who indulge themselves in a capacious understanding of "following up" get to jump ahead; and (b.) because of issue (a), all participants who have a borderline follow-up face a non-obvious moral question about the right thing to do.

(4.) The Finger tends to aggravate unbalanced power dynamics. The highest-status and most comfortable people in the room will tend to be the ones readiest to Finger in, seeing ways to interpret the question they really want to ask as a "follow-up" to someone else's question.

Furthermore, the Finger serves no important purpose. Why does a follow-up need to be asked right on the tail of the question it is following up? Are people going to forget otherwise? Of course not! In fact, in my experience, follow-ups are often better after a gap. This requires the follower-up to reframe the question in a different way. This reframing is helpful, because the follower-up will see the issue a little differently than the original Hand. The audience and the speaker then hear multiple angles on whatever issue is interesting enough that multiple people want to ask about it, instead of one initial angle on it, then a few appended jabs.

Why It Matters

If all of this seems to take the issue of question order with excessive seriousness, well, yes, maybe! But bear in mind: Typically, philosophy talks are two hours long, and you get to ask one question. If you can't even ask that one question, it's a very different experience than if you do get to ask your question. Also, the question period, unfortunately but realistically, serves a social function of displaying to others that you are an engaged, interesting, "smart" philosopher -- and most of us care considerably how others think of us. Not being able to ask your question is like being on a basketball team and never getting to take your shot. Also, waiting atop a question you're eager to ask while others jump the queue in front of you on sketchy grounds is intrinsically unpleasant -- even if you do manage to squeeze in your question by the end.

The Slow Sweep

So, no Fingers! Only Hands. But there are better and worse ways to take Hands.

At the beginning of the discussion period, ask for Hands from anyone who wants to ask a question. Instead of taking the first Hand you see, wait a bit. Let the slower Hands rise up too. Maybe encourage a certain group of people especially to contribute Hands. At UC Riverside Philosophy, our custom is to collect the first set of Hands from students, forcing faculty to wait for the second round, but you could also do things like ask "Any more students want to get Hands in the queue?"

Once you've paused long enough that the slow-Handers are up, follow some clear, unbiased procedure for the order of the questions. What I tend to do is start at one end of the room, then slowly sweep to the other end, ordering the questions just by spatial position. I will also give everyone a number to remember. After everyone has their number, I ask if there are any people I missed who want to be added to the list.

Hand 1 then gets to ask their question. No other Hands get to enter the queue until we've finished with all the Hands in the original call. Thus, there's no jockeying to try to get one's hand up early, or to catch the moderator's eye. The Hand gets to ask their question, the speaker to reply, and then there's an opportunity for the Hand -- and them only -- to ask one follow up. After the speaker's initial response is complete, the moderator catches the Hand's eye, giving them a moment to gather their thoughts for a follow-up or to indicate verbally or non-verbally that they are satisfied. No hurry and no jockeying for the first Finger. I like to encourage an implicit default custom of only one follow-up, though sometimes it seems desirable to allow a second follow-up. Normally after the speaker answers the follow-up I look for a signal from the Hand before moving to the next Hand -- though if the Hand is pushing it on follow-ups I might jump in quickly with "okay, next we have Hand 2" (or whatever the next number is).

After all the initial Hands are complete, do another slow sweep in a different direction (maybe left to right if you started right to left). Again, patiently wait for several Hands rather than going in the order in which you see hands. Bump anyone who had a Hand in the first sweep to the end of the queue. Maybe there will be time for a third sweep, or a fourth.

The result, I find, is a more peaceful, orderly, and egalitarian discussion period, without the rush, jockeying, anxiety, and Finger abuse.

--------------------------------------------------------------

[1] The best online source on the Finger-Hand tradition that I can easily find is Muhammad Ali Khalidi's critique here, a couple of years ago, which raises some similar concerns. 

[2] All names chosen randomly from lists of my former lower-division students, excluding "Jesus", "Mohammed", and very uncommon names. (In this case, I randomly chose an "A" name, then a "B" name, etc.) See my reflections here.

Tuesday, November 07, 2023

The Prospects and Challenges of Measuring Morality, or: On the Possibility or Impossibility of a "Moralometer"

Could we ever build a "moralometer" -- that is, an instrument that would accurately measure people's overall morality?  If so, what would it take?

Psychologist Jessie Sun and I explore this question in our new paper in draft: "The Prospects and Challenges of Measuring Morality".

Comments and suggestions on the draft warmly welcomed!

Draft available here:

https://osf.io/preprints/psyarxiv/nhvz9

Abstract:

The scientific study of morality requires measurement tools. But can we measure individual differences in something so seemingly subjective, elusive, and difficult to define? This paper will consider the prospects and challenges—both practical and ethical—of measuring how moral a person is. We outline the conceptual requirements for measuring general morality and argue that it would be difficult to operationalize morality in a way that satisfies these requirements. Even if we were able to surmount these conceptual challenges, self-report, informant report, behavioral, and biological measures each have methodological limitations that would substantially undermine their validity or feasibility. These challenges will make it more difficult to develop valid measures of general morality than other psychological traits. But, even if a general measure of morality is not feasible, it does not follow that moral psychological phenomena cannot or should not be measured at all. Instead, there is more promise in developing measures of specific operationalizations of morality (e.g., commonsense morality), specific manifestations of morality (e.g., specific virtues or behaviors), and other aspects of moral functioning that do not necessarily reflect moral goodness (e.g., moral self-perceptions). Still, it is important to be transparent and intellectually humble about what we can and cannot conclude based on various moral assessments—especially given the potential for misuse or misinterpretation of value-laden, contestable, and imperfect measures. Finally, we outline recommendations and future directions for psychological and philosophical inquiry into the development and use of morality measures.

[Below: a "moral-o-meter" given to me for my birthday a few years ago, by my then-13-year-old daughter]

Friday, November 03, 2023

Percent of U.S. Philosophy PhD Recipients Who Are Women: A 50-Year Perspective

In the 1970s, women received about 17% of PhDs in philosophy in the U.S.  The percentage rose to about 27% in the 1990s, where it stayed basically flat for the next 25 years.  The latest data suggest that the percentage is on the rise again.

Here's a fun chart (for user-relative values of "fun"), showing the 50-year trend.  Analysis and methodological details to follow.


[click to enlarge and clarify]

The data are drawn from the National Science Foundation's Survey of Earned Doctorates through 2022 (the most recent available year).  The Survey of Earned Doctorates aims to collect data on all PhD recipients from accredited universities in the United States, generally drawing response rates over 90%.  The SED asks one binary question for sex or gender: "Are you male or female?", with response options "Male" and "Female".  Fewer than 0.1% of respondents are classified in neither category, preventing any meaningful statistical analysis of nonbinary students.

Two facts are immediately obvious from this chart:

First, women have persistently been underrepresented in philosophy compared to PhDs overall.

Second, women receive fewer than 50% of PhDs overall in the U.S.  Since the early 2000s, the percentage of women among PhD recipients across all fields has been about 46%.  Although women have consistently been earning about 57-58% overall of Bachelor's degrees since the early 2000s, disproportionately few of those women go on to receive a PhD.

The tricky thing to assess is whether there has been a recent uptick in the percentage of women among Philosophy PhD recipients.  The year-to-year variability of the philosophy data (due to a sample size of about 400-500 PhD recipients per year in recent years) makes it unclear whether there's any real recent underlying increase that isn't just due to noise.  I've drawn a third-degree polynomial trendline through the data (the red dots), but there's a risk of overfitting.

In a 2017 article, Carolyn Dicey Jennings and I concluded that the best interpretation of the data through 2014 was that the percentage of women philosophy PhD recipients hadn't changed since the 1990s.  The question is whether there's now good statistical evidence of an increase since then.

One simple approach to the statistical question is to look for a correlation between year and percentage of women.  For the full set of data since 1973, there's a strong correlation: r = .82, p < .001 -- very unlikely to be statistical chance.  There's also a good correlation if we look at the span 1990-2022: r = .49, p = .004.

Still, the chart looks pretty flat from about 1990 (24.3%) to about 2015 (25.7%).  If most of the statistical work is being done by three high years near the end of the data (2016: 34.7%; 2019: 34.2%; 2021: 33.8%), the best model might not be a linear increase since 1990 but something closer to flat for most of the 1990s and early 2000s, with the real surge only in the most recent several years.

To pull more statistical power out of the data to examine a narrower time period, I treated each PhD recipient as one observation: year of PhD and gender (1 = female, 0 = not female), then ran an individual-level correlation for the ten-year period 2013-2022.  The correlation was statistically significant: r = .032, p = .029.  (Note that r values for disaggregated analyses like this will seem low to people used to interpreting r values in other contexts.  Eyeballing the chart is a better intuitive assessment of effect size.  The important thing is that the low p value [under the conventional .05] suggests that the visually plausible relationship between year of PhD and gender in the 2013-2022 period is not due to chance.)

Since this is a post-hoc analysis, and a p-value of .029 isn't great, so it makes sense to test for robustness.  Does it matter that I selected 2013 in particular as my start date?  Fortunately, we get similar results choosing 2012 or 2014 as the start years, though for 2014 the result is only marginally statistically significant (respectively, r = .037, p = .008; r = .026, p = .099).

Another approach is to bin the data into five-year periods, to smooth out noise.  If we create five-year bins for the past twenty years, we see:
1993-1997: 27% women (453/1687)
1998-2002: 27% (515/1941)
2003-2007: 27% (520/1893)
2008-2012: 28% (632/2242)
2013-2017: 29% (701/2435)
2018-2022: 31% (707/2279)
Comparing all the bins pairwise, 2018-2022 is a statistically significantly higher proportion of women than the bins from 1993-2012 and statistically marginally higher than 2013-2017 (p values: .004, .001, .012, .037, .094, respectively).  No other pairwise comparisons are significant.

I don't think we can be confident.  Post-hoc analyses of this sort are risky -- one can see patterns in the noise, then unintentionally p-hack them into seeming real.  But the fact that the upward recent trend comes across in two very different analyses of the data and passes a robustness check inclines me to think the effect is probably real.

------------------------------------------------------------------------------
[1] "Philosophy" has been a "subfield" or "detailed field" in the SED data from at least 1973.  From 2012-2020, the SED also had a separate category for "Ethics", with substantially fewer respondents than the "Philosophy" category.  For this period, both "Ethics" and "Philosophy" are included in the analysis above.  Starting in 2021, the SED introduced a separate category for "History / philosophy of science, technology, and society".  Respondents in this category are not included in the analysis above.  Total "Philosophy" PhD recipients dropped about 15% from 2019 and 2020 to 2021 and 2022, which might partly reflect a loss to this new category of some respondents who would otherwise have been classified as "Philosophy" -- but might also partly be noise, partly long-term trends, partly pandemic-related short-term trends.