Wednesday, December 23, 2015

A Response to Critiques of Cushman's and My Work on Philosophers' Susceptibility to Order Effects

The order in which moral dilemmas are presented matters to people's judgments and can substantially influence later judgments about abstract moral principles. This is true even among professional ethicists with PhD's in philosophy. In 2012 and 2015, Fiery Cushman and I published empirical evidence supporting these claims. We invite a metaphilosophical conclusion: If even professional philosophers' expert judgments are easily swayed by order of presentation, then such judgments might not be stable enough to serve as secure grounds for philosophical theorizing.

Synthese has recently published two critiques of the literature on order effects in philosophy, which address Fiery's and my work (HT Wesley Buckwalter). Both critiques make valuable points. However both also admit of some clear replies.

To fix ideas, consider two versions of the famous Trolley Problem:

Push: A runaway boxcar is headed toward five people it will kill if nothing is done. Jane can stop the boxcar by pushing a hiker with a heavy backpack in front of the boxcar, killing him but saving the five.

Switch: A runaway boxcar is headed toward five people it will kill if nothing is done. Vicki can stop the boxcar by flipping a switch to divert it to a sidetrack where it will kill one person instead of the five.

Fiery and I presented Push-type and Switch-type scenarios (fleshed with a bit more detail) to professional philosophers and two comparison groups of non-philosophers. We found that when professional philosophers saw a Push-type scenario before a Switch-type scenario, 73% rated the two scenarios equivalently on a 7-point scale. Then later in the questionnaire when asked about the Doctrine of the Double Effect -- a moral principle often interpreted implying that Push-type cases are morally worse than Switch-type cases -- only a minority, 46%, endorsed that principle. In contrast, among philosophers who saw Switch before Push only 54% rated the two scenarios equivalently, and then later a majority, 62%, endorsed the Doctrine of the Double Effect. Endorsement of the principle thus seemed to shift, post-hoc, to rationalize philosophers' order-manipulated judgments about the scenarios.

We found similar effects for Action-Omission, Moral Luck, and "Asian disease" type cases (though not consistently for every measure across the board). Philosophers with PhDs and self-reported competence or specialization in ethics showed no smaller effects than other philosophers or than comparison groups of non-philosophers -- and in fact trended slightly (non-significantly) toward showing larger order effects.

In general, we found pretty substantial effect sizes, suggesting substantial instability of judgment even in philosophical respondents' areas of expertise. Hence the metaphilosophical worry.


Critique by Zachary Horne and Jonathan Livengood.

Horne and Livengood make three main points about the literature on order effects in philosophy:

(A.) First, they helpfully distinguish between what they call "updating effects" and "genuine ordering effects". Genuine ordering effects, in their terminology, are effects measured only after all the stimuli have been presented. "Updating effects" are measures taken along the way, and might well reflect participants' learning. There is of course nothing irrational in judging Scenario B differently as a result of seeing Scenario A because one learned something by seeing Scenario A. Most philosophical research on order effects, they note, takes the measures along the way -- and thus might be measuring learning rather than true order effects.

(B.) Second, they point out that perceptual judgments also show order effects. Thus, if we are to reject any type of evidence that shows order effects, then we must reject perceptual evidence too, which would lead to radical skepticism.

(C.) Third, they point out that order can sometimes reasonably make a difference to the evaluation of evidence. For example, a smile followed by a frown, on the same person's face, is a different type of evidence than a frown followed by a smile.

On (A): I find the labels tendentious (since if we know there isn't learning-type updating going on, what we might want to call "genuine order effects" can plausibly be measured mid-stream), however it probably is correct that most studies do not sufficiently rule out the possibility of learning or updating in the course of the experiment, if they have novice participants and take the measurements after each scenario rather than after both scenarios. However, since our participants were experts, we think it unlikely that a significant number learned anything in the process of our brief experiment that would rationally justify shifting their judgment about the equivalency or non-equivalency of Push and Switch. And as Horne and Livengood note, our measure of endorsement of the Doctrine of the Double Effect is a measurement of a "genuine ordering effect" even by their own lights.

On (B): Yes, of course it would be silly to reject all means of learning that are subject to any order effects! The epistemic sting, as they note, depends not on the mere existence of an order effect in one case, but on how large and how prevalent the order effects are. This is an open empirical question. But the limited empirical evidence that exists suggests that order effects are substantial and prevalent in moral dilemma cases. So far, we have found order effects in all of the scenario types we've tried, with about a 10-20% shift in opinion on the moral equivalency of our scenario pairs and in preference for the risky option in the "Asian disease" cases.

On (C): It's interesting to consider cases in which earlier evidence rightly colors our reaction to later evidence, but trolley problems presented to disciplinary experts seems a different kind of case.

Finally, Horne and Livengood suggest that exposure to a pair of dilemmas in our study is unlikely to have a long-lasting impact on professional philosophers' beliefs. I agree. They continue, "But if there is no long-lasting impact, then we think the effect is unlikely to matter to actual philosophical practice outside of the laboratory" (p. 17). I don't think this follows. Fiery's and my view is not that philosophers' opinions are permanently influenced by the order in which the scenarios are presented on any single occasion, but rather that their opinions are unstable -- possibly influenced one direction on one occasion, in another direction on another occasion. This instability is what drives the metaphilosophical worry.


Critique by Regina Rini:

Rini -- a recent guest blogger here at the Splintered Mind -- looks only at our 2012 study. (Our 2015 study wasn't published until after her paper was in press.) She finds it plausible that if professional philosophers were already familiar with these cases they would not exhibit order effects of the sort Fiery and I find. She suggests that perhaps respondents were not previously familiar with the cases -- or at least not familiar in the right sort of way. She calls this the "familiarity problem" and offers four possible explanations:

(1.) The respondents were not really experts. She wonders if our participants, recruited through the internet, really had the degrees they claimed to have.

(2.) The respondents didn't carefully attend to our scenarios. Maybe they breezed through them so quickly that they failed to notice relevant features.

(3.) The respondents might not have familiar responses to these types of scenarios. Perhaps they have so far refrained from forming judgments on such cases and principles.

(4.) The respondents might not have diachronically stable familiar responses. This is the explanation Fiery and I favor. However, Rini helpfully points out that as long as philosophers are aware that their responses are not diachronically stable, the metaphilosophical threat is reduced: Presumably philosophers who are aware that their responses are not stable would be reluctant to ground their theorizing on those responses.

On (1): I am not aware of a general problem in the survey literature of respondents' frequently misreporting their educational status -- though certainly a bit of misreporting is possible. One specific piece of evidence against this possibility in our own study is that we recruited philosophers mostly by asking department chairs to forward a recruitment email to faculty and graduate students in their departments. Most of our "philosopher" participants took the survey within just a few days of these emails.

On (2): The median response time on the first scenario was 40 seconds, on the second scenario was 34 seconds. While these are not huge response times, if you stop to count out 34 seconds now, you'll probably notice that it's a reasonable amount of time for a thoughtful response to a brief scenario.

On (3) and (4): These are potentially quite serious issues, and in fact our follow-up study in 2015 was designed specifically to address them, after we saw an early version of Rini's critique. In our 2015 study we specifically asked participants if they were previously familiar with the scenarios. We also asked whether they regarded themselves as "having had a stable opinion" about the issues before participating in the experiment, and whether they regarded themselves as experts on those very issues. We also added a "reflection" condition to help address concern (2). In the reflection condition we asked participants to reflect carefully before responding and enforced a minimum 15-second delay between when participants reported having finished reading the scenario and when their response options appeared.

We did not find that self-reported familiarity or stability reduced the size of order effects in two different types of scenario pairs (trolley problems and risky-choice "Asian disease"-type problems), nor did we find reduced order effects in the reflection condition compared to a normal control condition without special instructions to reflect.

For example, percentage rating the Push and Switch scenarios equivalently:

Thus, I am inclined to think that Rini's fourth suggestion is the most plausible -- that participants do not have diachronically stable familiar responses, despite high levels of expertise. But since those who report having stable responses were no less subject to order effects than were those who reported not having stable responses, self-knowledge of stability appears to be largely absent. Despite Rini's interesting suggestion that instability is metaphilosophically non-threatening if people are aware of it, Fiery's and my results suggest that we should not hasten to that comfort.


Both Horne and Livengood and Rini emphasize that we only have very limited evidence about order effects on professional philosophers' judgments. I agree! Fiery's and my two studies are hardly decisive. Convergent evidence from several different labs would be necessary before drawing any confident conclusions, especially if those conclusions are at variance with what one feels one knows from personal experience. Rini also makes positive suggestions for follow-up experimental work that might be done, which I am inclined to support. Both critiques raise important methodological concerns that ought to help shape and direct future work on this topic.


Joachim Horvath said...

Hi Eric, I mostly agree with your responses. One might still be worried about the following: the way and context in which order effects are elicited is a very artificial one, and so I don't think it's all that clear how it translates into real-life philosophical thought experimentation. For example, we almost never consider several thought experiment cases in close temporal succession in our actual philosophical practice, free of any context or guiding philosophical questions. So even if this should be a real source of judgment-instability that even affects the relevant experts, it might be a source of instability that occurs so rarely under real-life, "ecological" conditions that we needn't worry about it too much.

Eric Schwitzgebel said...

Thanks for that thoughtful comment, Joachim. I agree that the context is artificial, so it's not clear how it translates into philosophizing in the wild. However, I do think that the results sit more comfortably with a certain range of metaphilosophical views (involving philosophers' susceptibility to unknown and unwanted influenced on their judgments) than with another range of views (involving philosophers having stable opinions in their areas of expertise which are relatively immune to unknown, unwanted biases). These results are only one piece of contributing evidence to a much larger picture on which we don't have a lot of systematic evidence, but on which people do have lots of informal impressions.

Wesley Buckwalter said...

Great to see another post about these articles.

You write Rini "finds it plausible that if professional philosophers were already familiar with these cases they would not exhibit order effects of the sort Fiery and I find." Is any evidence given for this claim or is it speculation? If it is just speculation, then instead of debating 1-4, why not simply reject the assumption that philosophers were only reporting previous opinions or the assumption that doing so is immune to biases of the sort discovered? (If I am wrong about the evidence for these claims, someone please correct me.)

Two very strong things are concluded in the paper, namely to undermine prior results and an entire method of studying a phenomenon. I do not think it is appropriate to conclude these things on the basis of the fact that results are unexpected after relying on premises that, again to my knowledge at least, have not been shown. It is premature at this stage can risk corrupting the research record.

The insight by Rini that familiarity could result in different tasks in the experiment is a valuable one and in some sense surely correct. Whether or not such differences are ultimately relevant to the present issue is a research question that requires data. It sounds like when minimal investigation into the issue occurred none was found. No doubt there's limits to self-report measures for this issue, but that evidence looms pretty large to me given the evidence for a problem initially.

paul ned said...

I have a question about this metaphilosophical conclusion: "If even professional philosophers' expert judgments are easily swayed by order of presentation, then such judgments might not be stable enough to serve as secure grounds for philosophical theorizing." But I gather that the ordering effects data is consistent with some of the subjects being reliably not susceptible to ordering effects. If so, and if some are resistant to ordering effects in this way, then what's the barrier to the moral judgments of these folks serving as secure grounds for philosophical theorizing? In other words, even if there is a statistical tendency for people to be irrationally sensitive to ordering effects, it doesn't follow that each person is sensitive in this way. Given this, and provided that one can tell that one isn't sensitive to these effects, then is there any objection from this study to the security of one's moral judgments?

Arnold said...

Components of self knowledge (honesty-sincerity-) in a meta-philosophical order effect would, after enough self-observation, then require reporting to others (as we are not alone) for the sake of progression...Quelling the temptation toward monasticism...

Callan S. said...

Slightly off topic : Metaphilosophical worries? Does Metaphilosophical==science?

Eric Schwitzgebel said...

Thanks for the continuing comments, folks!

Wesley: I'm inclined to think that Rini's assumption has some initial plausibility, but I agree that it's an empirical question and the limited existing empirical evidence doesn't seem to support it. I would think/hope that at *some* level of expertise order effects would vanish. I wouldn't predict qualitative order effects (in equivalency ratings, maybe some scaling effects) for Frances Kamm or John Martin Fischer. But as Fiery and I note at the end, if there is a level of expertise at which they disappear, we haven't found empirical evidence of it yet.

Paul: Right, our data are consistent with the majority of respondents NOT having ordering effects. It's tricky to know exactly what to do with this, metaphilosophically. There are two reasons to be concerned: One is that this is only a single manipulation. A respondent might not be subject to exactly this effect, but there are many other types of effects that are possible (e.g., actor/observer effects, racial bias effects, contextual effects due to one's interlocutor, mood effects, foul odors). If other measures also find substantial effects then unless it's the same few people who are subject to all of them, then it's plausible that the majority of respondents would be subject to some of them, and thus have unstable views. A second reason to be concerned is this: Say you accept that there's a 20% chance that your view is unstable. That's a minority chance, but even minority chances can be epistemically worrying, especially when one idea needs to be strung next to another next to another to get one's final conclusions.

Unknown: I'm inclined to recommend dialogue rather than solitude.

Callan: "Metaphilosophical" -- the philosophy of philosophy. I view philosophy and science as overlapping rather than as distinct.

Regina Rini said...

Thanks for the interesting discussion, Eric! I’ll try to say a couple things in reply to your comments on my paper.

First, for those who haven’t read the paper, let me clarify what it claims. I’m claiming that professional philosophers are probably doing something different than ordinary people when they participate in these studies, and this fact makes it difficult to interpret the results. You can get the idea if you imagine the following: You see Peter Singer at a conference and you walk up to him. You say, “Peter Singer, is it ethically consistent to save a child drowning right in front of you but *not* contribute anything to alleviating global poverty?” It would be very very strange if Singer replied, “Hold on a moment, I’ve got to work out my answer to that question.” This would be strange because Singer has written about this question for more than forty years. He already knows his answer. He doesn’t need to generate a new answer each time the question is asked – he just remembers what he said every time before.

My claim is that the philosophers in your initial study are like Singer in this story. They are familiar with the trolley cases you use as central stimuli. They have reacted to these cases before. It’s just strange that they would stop to generate *new* intuitions, rather than immediately report their memories of how they replied in the past. Note – IMPORTANT – that this point is independent of the hypothesis of your study. That is, the expectation that people would just report their familiar responses to familiar cases does NOT depend on affirming (or denying) any special reliability to philosophers’ intuitions. It’s just common sense. If your job involves teaching and defending a set of views you’ve taught and defended dozens of times before, you don’t need to bother generating those same views every time you think about them.

That is my reply to concerns Wesley has raised here and elsewhere. I’m not floating some empirical conjecture about the underlying psychological machinery. My point about familiarity is extremely simple: experts in a domain are likely to use memory of past response (rather than novel response-generation) to reply to familiar cases. This seems hard to deny; I certainly think the burden of proof is on anyone who wants to deny it.

So, if we grant my point about familiarity, then the results you report are perplexing. Presentation order shouldn’t matter to what people *remember* of their past responses to familiar cases. Singer is not going to give a different answer if you ask him about distant poverty before you ask him about the drowning child, because he already knows how he has always responded to each. And yet your professional philosopher subjects apparently *do* show order effects! The puzzle is: why? Why don’t they just report their memory of past responses? Why do they apparently generate novel intuitive responses (which are subject to order effects) rather than report memories (which predate the order manipulation)?
(continued in next comment...)

Regina Rini said...

(..continued from previous comment)

In the paper I go through various explanations, which you helpfully summarize here. I agree with you that the first two explanations are unlikely. A combination of the third and fourth explanations seems most likely. That is, (3) some philosophers have never felt the need to form opinions about these cases, or reject any simple ‘permissible/impermissible’ answer, so when your study forced them to give something other than their familiar non-answer, they were subject to order effects. And (4) some philosophers have unstable intuitions about these cases, such that they have no *single* past reaction to remember, and so they are forced to generate a novel response, subject to order effects. My broader dialectical claim is that, if (3) and (4) are right, then it is no problem for philosophy if these people account for your data --- so long as they aren’t also the SAME people going around proclaiming that their own intuitions are distinctly reliable. (Importantly, your results do not show that *all* or even most philosophers are subject to order effects.) The data in your initial study is compatible with the possibility that people who claim stable and reliable intuitions were not subject to order effects, and those who do not claim stable or reliable intuitions account for the reported effects.

That’s where your follow-up study, which I was very glad to see, comes in. You explicitly asked subjects about their familiarity with the stimuli and whether they think their reactions are stable. My worry here is about we interpret self-report. When you ask your participants whether their intuitions are “stable”, this doesn’t necessarily mean the same thing, out-of-context, as it does in the dialectical context surrounding the expertise defense. In my paper I used ‘diachronic stability’ as a technical term, which may or may not have the same application as an out-of-context use of the colloquial word ‘stable’. It’s usually not a good thing to self-ascribe “unstable” professional opinions, absent context. It is less of a problem to agree that, on certain topics, one does not always react the same way to a particular case – especially if the context makes clear that “unstable” views are fine so long as they are not conjoined to claims of reliability, or to philosophical practice that implicitly presupposes reliability. Basically, to get better self-report of stability, I think it needs to be made clear to participants that it is not necessarily a bad thing to admit to instability.

In short, I’m not convinced that out-of-context self-report is sufficient to sort out the target population. What we want to do is look at philosophers who (a) make use of particular intuitions in their philosophical practice and (b) believe that those same intuitions are stable and reliable, in the relevant sense. If *these* people display order effects, then I agree that the expertise defense is seriously undermined. I have some more suggestions for how to get that data. But this comment is already long enough, so I’ll encourage interested readers to look at the last section of my paper!

Eric Schwitzgebel said...

Thanks for clarifying all this, Gina!

My guess is that Kamm and Fischer, and others who have published multiple articles on trolley problems, will not have unstable intuitions in the relevant sense. However, this guess is not yet empirically confirmed (and might be beyond realistic hope of confirmation). But another relevant group is their audience of expert readers, by which I mean the advanced graduate students and professors who might not publish on these problems themselves but who are the primary target audience of the articles. If these philosophers have unstable intuitions, in the relevant sense of "unstable", then the reception of work on the topic is subject to troubling influences. The reception is probably what matters most to the discipline (or maybe public reception is what matters most -- then then the case for instability is even stronger). One argument for this is to consider an extreme case: If everyone ignored Kamm, it wouldn't matter to the discipline what she wrote.

I agree that there's a gap between the conclusions of this study and the conclusion that philosophers who determine the reception of philosophical work have unstable opinions that are open to a variety of epistemically troubling biases. For one thing, we only test two types of instability on a few types of cases. For another, as you point out, the context of an internet survey -- even one targeted toward professional philosophers -- is different from the context of reading a work and forming an opinion. A direct empirical test of the phenomena in question is not going to be possible.

What is possible is plausibility arguments based on convergent evidence. Much of my career has been devoted to exploring various ways in which the factors driving philosophical opinion are not what expert philosophers think they are, and are epistemically suspect -- from my work on the unreliability of introspection (including how philosophers' opinions about the stream of experience seem to be shaped by media metaphors), to my work on implicit bias, to my work on cultural variation in philosophical opinion, to my work on the origins of philosophical views in developmental psychology, to my work on the shaky mix of cosmology and intuition that must be the epistemological basis of grand metaphysical theories. It all fits into a picture, of which this work with Fiery is one piece.

Callan S. said...

I think it's very strange in the other direction in regards to not form a new answer each time. Or more exactly, to not form a new answer each time and to think that is fine and legitimate - that seems very strange to me. What if you are wrong? Then you're just repeating by rote learning an answer that is wrong. Indeed it'd go against the very grain of the notion of critical thinking - every time you were probed on a subject upon which another person thinks you were wrong, you'd end up just repeating the rote answer over and over again.

I grant the practicality of not wanting to form a new opinion each time - it's a labour.

What I don't get and find weird is treating it as legitimate to do so somehow, rather than "Yeah, I can't be bothered, though hey, maybe if I thought about it I'd find a new insight/find some way I'm wrong on the matter". There's probably already a study that's named a bias that assigns accuracy to a thought precisely for how much it has NOT been questioned/thought about.

John Turri said...

"I’m not floating some empirical conjecture about the underlying psychological machinery. My point about familiarity is extremely simple: experts in a domain are likely to use memory of past response (rather than novel response-generation) to reply to familiar cases. This seems hard to deny; I certainly think the burden of proof is on anyone who wants to deny it."

That is a textbook case of an empirical conjecture about psychological processing.

"Presentation order shouldn't matter to what people *remember* of their past responses to familiar cases."

Maybe in some sense it shouldn't, but is there any evidence that memory is *not* susceptible to order effects? Clearly another empirical question. Based on what is known regarding the malleability of memory, I would not be surprised to learn to that it exhibits order effects.

Arnold said...

Has Socratic method for understanding oneself been the 'presentation effect' causing philosophical order effects and annotations to this day...Happy New Year...

ref: searches for stability...

Wesley Buckwalter said...

I still see no evidence for the two premises in the central argument of this paper, namely that philosophers use only memory when responding to certain thought experiments or that doing so is not susceptible to bias. I ask again, is any evidence offered for these claims about either mental processing or the nature of bias?

I can also engage in a chain of speculation but for the opposite claim that the results are expected. For instance, we might expect more errors when people pay less attention to things. And we might expect people seeing the thought experiment for the first time to pay more attention to it than someone familiar with the basic idea and is just recalling a result from memory rather than thinking about it. Therefore according to this chain of speculation, the results obtained would be quite expected. Of course, we don't know if this chain of speculation is correct, which is why it would be important to test it before making claims on its basis about actual controlled experimental findings.

Eric Schwitzgebel said...

Callan, John, and Wesley -- I was inclined to spot Rini those claims as a plausibility-based starting point, but clearly not everyone shares Rini's assessment here. To be fair to her, empirical work too, especially in relatively sparsely-studied areas, needs to start with plausibility-based assessments of, for example, what types of factors are and are not worth trying to control for, what Hypothesis A and Hypothesis B would reasonably be interpreted as predicting, etc. When the interlocutors disagree about what is initially plausible, as apparently here, that process breaks down and further work needs to be done -- as I think all of us including Rini agree about!

John Turri said...

Eric, I can see making or granting the assumption for this or that purposen. But I don't see how it could be construed as non-empirical or "just commonsense" or placing a special burden on someone who does not proceed by making the same assumption.

Eric Schwitzgebel said...

My perception of the dialectic here is that Rini is assuming that familiarity would lead to stability as part of a plausibility argument that raises worries about my work among others, so in spotting it to her, I'm being charitable. I don't think my own work relies on it. Yes?

John Turri said...

Agreed, I don't see how your work would rely on it.

Wesley Buckwalter said...

Hey Eric, I agree empirical work must regularly make plausible assumptions when designing experiments. To me, the possibility of the claims we are considering being true seems like an excellent place to begin inquiry, in the form of designing experiments to test them. But the issue isn't whether the process of what interlocutors find initially plausible breaks down. The issue is that initial plausibility constitutes speculation that is insufficient to substantiate a chain of empirical premises for the purposes of undermining actual experimental findings.

I am a firm believer in charitably interpreting arguments. But what I will not accept as a research practice, and I will stick to this, is dismissing a record of controlled experimental research, or broader aspects of research methodology, on the basis of inferences that depend on chains of highly unsubstantiated empirical assumptions. The conclusions made here go far past raising a possible worry, and I believe that is harmful to inquiry.

Arnold said...

Today does Dialecticism have many forms for the pursuit of truth...the stability of meta-physical for self knowledge, meta-philosophical for forms of knowledge, philosophical for measuring knowledge, then...