Wednesday, February 16, 2022

Qualitative Research Reveals a Potentially Huge Problem for Standard Methods in Experimental Philosophy

Mainstream experimental philosophy aims to discover ordinary people's opinions about questions of philosophical interest. Typically, this involves presenting paragraph-long scenarios to online workers. Respondents express their opinions about the scenarios on simple quantitative scales. But what if participants regularly interpret the questions differently than the researchers intend? The whole apparatus would come crashing down.

Kyle Thompson (who recently earned his PhD under my supervision) has published the central findings of a dissertation that raises exactly this challenge to experimental philosophy. His approach is to compare the standard quantitative measures of participants' opinions -- that is, participants' numerical responses on standardized questions -- with two qualitative measures: what participants say when instructed to "think aloud" about the experimental stimuli and a post-response interview about why they answered the way they did.

Kyle's main experiment replicates the quantitative results of an influential study that purports to show that ordinary research participants reject the "ought implies can" principle. According to the ought-implies-can principle, people can only be morally required to do what it is possible for them to do. Thompson replicates the quantitative results of the earlier experiment, seeming to confirm that participants reject ought-implies-can. However, Thompson's qualitative think-aloud and interview results clearly indicate that his participants actually accept, rather than reject, the principle. The quantitative and the qualitative results point in opposite directions, and the qualitative results are more convincing.

In the scenario of central interest, "Brown" agrees to meet a friend at a movie theater at 6:00. But then

As Brown gets ready to leave at 5:45, he decides he really doesn't want to see the movie after all. He passes the time for five minutes, so that he will be unable to make it to the cinema on time. Because Brown decided to wait, Brown can't meet his friend Adams at the movie by 6.

Participants then rate their degree of agreement or disagreement with the following three questions:

At 5:50, Brown can make it to the theater by 6

Brown is to blame for not making it to the theater by 6

Brown ought to make it to the theater by 6

As you might expect, in both the original article and Thompson's replication, participants almost all disagree that Brown can make it to the theater by 6. So far, so good. However, apparently in violation of the ought-implies-can principle, participants overall tended to agree that Brown is to blame for not making it to the theater by 6 and (to a lesser extent) that Brown ought to make it to the theater by 6. Interpreting the results at face value, it appears that regarding making it to the theater by 6, participants think that Brown cannot do it, that he is blameworthy for not doing it, and that he ought to do it -- and thus that someone can be blameworthy for failing to do, and ought to do, something that it is not possible for them to do.

Now, if your reaction to this is wait a minute..., you share something in common with Thompson and me. Participants' think-aloud statements and subsequent interviews reveal that almost all of them reinterpret the questions to preserve consistency with the ought-implies-can principle. For example, some participants explain their positive answers to "Brown ought to make it to the theater by 6" by explaining that Brown ought to try to make it to the theater by 6. Others change the tense and the time referent, explaining that Brown "could have" made it to the theater and that he should have left by 5:45. There is no violation of ought-implies-can in either response. At 5:50, Brown could presumably still try to make it to the theater. And at 5:45 he still could have made it to the theater.

Through careful examination of the transcripts, Thompson discovers that the almost 90% of participants in fact adhere to the ought-implies-can principle in their responses, often reinterpreting the content or tense of the questions to render them consistent with this principle.

As far as I'm aware, this is the first attempt to replicate a quantitative experimental philosophy study with careful qualitative interview methods. What it suggests is that the surface-level interpretation of the quantitative results can be highly misleading. The majority of participants appear to have the opposite of the view suggested by their quantitative answers.

It is an open question how much of the quantitative research in experimental philosophy would survive careful qualitative scrutiny. I hope others follow in Kyle's footsteps by attempting careful qualitative replications of important quantitative work in the subdiscipline.

39 comments:

Daniel Polowetzky said...

The test questions seem to be misleading. It makes sense that participants would explain their adherence to “ought implies can” by changing the tense of one or more questions. Otherwise, they don’t even make sense.

If I promised to join you on a plane flight but then changed my mind, resulting in your being on a plane that has departed, a natural way to describe the morality of my behavior is to say that I should have been on that flight.
To assert that I ought to be on that flight just means that I should have been on the plane. It’s hardly a moral obligation to defy the laws of physics.

I suppose that one may owe money even when paying off the debt is impossible. However, even in this case, the scenario can be reworded such that despite a present continuing debt, the moral imperative is not to do the impossible.

Arnold said...

Do I understand the question...
...Has experimental philosophism, to this date, ever even established functionalism, as, a only approach to experience for...'presenting scenarios to online workers'...

That quantitative and qualitive meanings maybe human function, but may have entirely different meanings when compared with inner and outer experiencing's...
...Which is, the tense, to be established before relating biology physiology and psychology to them...

Like Mr.Thompson might of ask first...do you have an inner and outer life...do you have a qualitive and quantitative life...

Eric Schwitzgebel said...

Dan: Right, that makes sense to me. Participants might tend to ascribe implicitly to ought implies can. Given that, they might reinterpret the questions charitably as consistent with that principle while still apply the deserved blame. If so, that’s rational on their part.

Arnold: I'm not sure quantitative and qualitative quite maps onto inner/outer. One can do quantitative studies of "inner experience" (e.g., Chris Heavey) and qualitative studies of "outer life" (e.g., traditional anthropology).

Kyle Thompson said...

Thanks so much for sharing my research, Eric! And for helping me along the way in finalizing it! I really can't say how excited I am to see folks discussing these methodological issues of x-phi!

Also, Eric, I appreciate your responses to Dan and Arnold. I think I'm inclined to agree with you what Dan says about the way our obligations would track the laws of physics (roughly speaking). So, when participants in previous studies on "ought implies can" seemed to violate the principle, I had a suspicion--as other folks did--that there was something more interesting going on in their responses. I think the way you put it, Eric, is right on: some folks might give charity to the question of an agent's obligation. I think it might be a conversational norm to make the most sense of what someone is saying or asking.

For me the most compelling evidence for "ought implies can" being supported by non-philosophers is what we don't get in participants' interviews and think aloud sessions: if "ought" really does not imply "can," then participants shouldn't express such consternation or hesitation in speaking in negation of the principle; they should've negated "ought implies can" with ease, but didn't. So, when participants kept adhering to interpretations of questions that supported "ought implies can," it suggested that they really did 'respect' the principle, so to speak.

Arnold said...

From Philpapers by Heavey..."This study provides a survey of phenomena that present themselves during moments of (naturally occurring inner experience). In our previous studies using Descriptive Experience Sampling we have discovered five frequently occurring phenomena—inner speech, inner seeing, unsymbolized thinking, feelings, and sensory awareness."

My question...'naturally occurring inner experience'...to me is only understandable as qualitative but then pursued quantitatively in his paper...

...That Mr. Heavey and Mr.Thompson and others separate inner and outer and qualitative and quantitative from each other and then propose them as the same and ought to be understandable foundations for study of what could and can be...

My argument...all experimental philosophy begins/stays with Socratie's (my inner life here now)...
...that inner-outer-qualitative-quantitative are not comparables... Thanks for the Heavey reference...

Daniel Polowetzky said...

I’d be interested in what this research reveals about non-philosophers’ beliefs about the moral status of persons choosing one horn or the other in moral dilemmas.
I recall a paper of Ruth Barcan Marcus that argued for some sort of residual moral guilt accruing to persons unlucky enough to be presented with a moral dilemma.
She asserted that this occurs even when the morally “right” or required action is in no doubt. Such a situation occurs if you have to violate one moral principle to avoid violating another, even when everyone agrees that choosing the other horn would be morally a lot worse. This occurs even if the person always does the right thing in these situations.
Her view involves the impossibility of moral innocence in such cases. One’s lack of moral innocence can be the result of bad luck.
Counterintuitively, her view seems to allow you to judge someone’s moral status, or somehow keep moral score, simply by knowing what moral dilemmas someone was faced with.
It seems weird to read someone’s bio, listing pairs of choices, and have an opinion of them morally, without knowing which acts were chosen.
This position seems to require the impossible of the person faced with a moral dilemma. It is impossible not to accrue moral demerits in such situations. You’re damned if you do and damned if you don’t.
This seems to conflict with the principle of “ought implies can”.

Joanna Demaree-Cotton said...

This sounds like great work and I look forward to reading the paper in full! However, I do wonder whether some of the more ambitious claims here are warranted, specifically about this posing big problems for quantitative x-phi methods in general. The ought-implies-can x-phi literature that this paper does a great job of responding to proved highly controversial in part because it involved a controversial and rather specific methodology in terms of *how it was interpreting quantitative results* - namely ascribing a very specific principled belief on the basis of two separate measures in a single condition, measures that in fact we should expect to elicit more flexible interpretations and responses - as Thompson has rightly pointed out and empirically substantiated. But I don’t think this shows a problem with quantitative methodology per se - just a specific kind of mistake with how certain kinds of quantitative data can be misinterpreted. To illustrate, a lot of experimental philosophy does not rely on this kind of interpretative methodology - for example, one very common method is to contrast two conditions that differ systematically in one factor, and if systematic differences in X judgments are found between those two conditions, then this is evidence that X judgments are sensitive to that factor. That’s a quantitative method, but one that doesn’t share the kind of dangerous interpretative move exemplified by some of the OIC work. So I don’t think Thompsons findings, important as they are, should make us deeply worried about that kind of quantitative x-phi, for example - though qualitative methods should be welcomed also!

Brad Cokelet said...

Wow, sounds like a valuable methodological challenge. When I have worked with psychologists they seem to often use "cognitive interviews" as a part of scale validation in order to make sure that subjects are on the same page as researcher, among other things. I would have thought that x-phi people would by now learning standard scale validation stuff and putting it to use. If not, maybe this will push them in that direction when training (in a related vein I wonder if they use things like social desirability scales as methodological checks). Well look forward to reading this!

Brad Cokelet said...

It’s kinda hilarious that the movement that made its mark taking armchair philosophers to task for epistemic hubris would go ahead and ignore standard social science safeguards against epistemic hubris when doing studies

Brad Cokelet said...

Sorry if that last comment was too snarky!

Stephan said...

The example question in the post is terribly formulated for investigating the research question posed. If the claim is just that terribly formulated questions are a "potential huge problem" for XPhi, I can't help but agree!

Anonymous said...

"I think it might be a conversational norm to make the most sense of what someone is saying or asking."

It might be, but that's itself an empirical question. In terms of mere anecdotal evidence, I've only heard philosophers talk about a "principle of charity" when interpreting texts. I work in a cognate discipline where that norm does not obtain. (Colleagues don't try to have it make the least sense, but they don't go out of their way to try and have someone's text make the most sense either.)

Anonymous said...

I don't know the original research this study assesses, but my judgement based on the example question alone is not simply that it is an example of a 'terrible' question. It looks like blatant research malpractice. The question looks obviously designed to trick respondents into appearing to violate ought implies can in order to get an eye catching result. Results of the study couldn't possibly have made it through peer review, could they?

Eric Schwitzgebel said...

Thanks for the continuing comments, folks!

Arnold: You write: "My question...'naturally occurring inner experience'...to me is only understandable as qualitative but then pursued quantitatively in his paper..." Right, there are two senses of "qualitative", one pertaining to having qualia or experience, and the other having to do with research methodologies that focus on methods like interview rather than numerical data.

Dan P: Yes, I agree! That's an interesting set of issues. They could be ripe for qualitative study.

Joanna: Right, it could be that this study is unusual being undermined by qualitative data while lots of other research would stand up well to qualitative examination. I think we don't really know. There's perhaps already some qualitative evidence of trouble with free will and metaethics x-phi, from Kyle and Lance Bush.

Brad: Perhaps they did such measures, but not as carefully as Kyle. There are lots of subtle hazards in questionnaire research and even the best psychologists don't get it right. There's probably also room for critique of Kyle's work. I find it convincing, but others may see more serious problems with it.

Stephan / Anon Feb 18: I'm inclined to agree that the questions look not so good, knowing what I now know. As someone who has done a fair bit of empirical research, though, I can testify that problems that seem obvious in retrospect once you take a hard look at the data are much harder to see in advance.

Anon Feb 17: Yes, I agree that that's an empirical question. I'm inclined to find it plausible, but I don't know the empirical literature on maxims of this sort.

Kyle Thompson said...

Thanks, Joanna, for the excellent comment and analysis of whether this could be a larger problem for x-phi. My thinking though is that we philosophers and social scientists, in so many studies, tend to forget there is a gap between what we take a survey question to capture and what it actually does capture (even if we jump the gap just fine, it is still there). The a bit-too-blunt way of putting this is 'just because we coded this question as capturing an x judgment doesn't mean it does.' I know researchers work to make sure their questions really do get at what they hope they get at, but I'm often--not just in the "ought implies can" context--baffled with how 'clean' a question is treated: the researchers seem to think that a question about agency, knowledge, belief, etc. really cleanly captures someone's judgment about agency, knowledge, belief, respectively. (Contrast this to every class discussion when these questions are introduced and people hem and haw and break the question's confines and misinterpret, etc.) So, when you say the following, I find the same problem lurking in the methodology: "for example, one very common method is to contrast two conditions that differ systematically in one factor." As I understand it, such condition contrasting requires assurance that condition A and B don't just differ, as they surely do, but that they differ in just the way the researchers suppose. Again, saying that we are 'comparing' assumes that we are getting clean judgments to compare--it is here that my worry enters, across x-phi studies.

To end off here, I'll say something a bit more provocative (with an all-in-good-fun attitude behind it): don't I have a replication crisis on my side when I worry aloud about methodological precision being potentially endemic? And heck, even if we do replicate well, might we have a validity crisis on our hands (as I think I've shown in one case to be true)?

Kyle Thompson said...

Hi, anonymous--a phrase that always makes me feel strange when I utter it to myself--you wrote: "I don't know the original research this study assesses, but my judgement based on the example question alone is not simply that it is an example of a 'terrible' question. It looks like blatant research malpractice. The question looks obviously designed to trick respondents into appearing to violate ought implies can in order to get an eye catching result. Results of the study couldn't possibly have made it through peer review, could they?"

I'm going to challenge your notion that there was "malpractice" going on here in any sense, and in the process compliment my work in a way that should make my arrogance more of a target of criticism than anything you might perceive in the work of the original researchers (who I have no doubt designed these questions with good intent to get good responses from participants).

I think what now, in hindsight, looks like a trick question in the original study only seems as much now that, in hindsight, we can look at people's responses from my study (and others) to see how messy they turned out to be. In other words, I take the sense of trickery you note as resulting from a good publication dialectic: original studies asked participants in this way, which prompted me to clarify and reveal there might be more things going on in their responses, etc. But, at each point along the way, researchers were trying to figure out the most effective way to get at what is a really complicated--and interesting--set of judgments relating to obligation and ability. Now, in this dialectic, someone will surely try new questions and combine new methods to reveal my analysis missed some key aspects of folks' judgments too! (And I'll have to admit I missed something important and applaud their work.)

But, the idea that the original study wouldn't/shouldn't make it through peer review is, to me, an unfair treatment of how rigorous the study was. Now, in hindsight, I can say I think the questions ended up treating participant judgments as too neat, but that's only because the researchers did such a thorough and good job exploring the "ought implies can" phenomenon in the first place.

Kyle Thompson said...

Brad: thank you for the thoughtful comments and replies.

You wrote: "It’s kinda hilarious that the movement that made its mark taking armchair philosophers to task for epistemic hubris would go ahead and ignore standard social science safeguards against epistemic hubris when doing studies." And you also worried that might be too snarky--I'm certainly fine with no upper limits to snark, but that's just me.

If I can, I want to point to an even deeper irony while couching my comment in what I take to be a playful and humane observation about the paradoxical existence of being a social science researcher:

Not only might x-phi folks be tempted to fail to perform more of the recommended psychology/cognitive science best practices, the x-phi movement is painfully aware--at least in its arguments--of the limitations of the universality of any group of people's intuitions/judgments given contingent aspects of their cognitive/cultural backgrounds. (I'm thinking here of that classic x-phi defense that we are in the business of debunking certain expert intuitions as being universal since philosophers, too, are grounded in potentially atypical intuition-shaping cultures, such as that of being trained in academic philosophy.) And yet, when it comes to writing out the fun part of an x-phi survey--i.e., the wacky vignette with the follow-up questions--we philosophers trust our intuitions/judgments of what the questions tap into rather than validate the questions themselves by seeing how some non-expert third-party grok's them! To us, the vignette clearly spells out a situation in which--and I'm teasing here my interest in and skepticism toward the free will literature--the world is deterministic, but participants might mistake it as being deterministic for the world outside agent's mind but indeterministic within it, or whatever messier judgment they might have.

But this is just the paradoxical nature of being a researcher: you pursue a research program across years of your life, multiple degrees, multiple grants, etc., and you expect yourself in all that passion and dedication to somehow be the *least* biased in interpreting your data! I have had the experience so often of looking at a vignette and seeing it SO clearly just one way that I couldn't possibly imagine seeing it another (insert Wittgenstein's duck/rabbit reference or whatever). And there's the irony*: that's precisely what x-phi warns us about! We can't see the theoretical waters we swim in.

*I'm sure a linguist or literary theory expert would tell me I'm using "irony" incorrectly or too liberally, which might be the only example of genuine irony in my passage. Know that I'm working on myself in this respect.

Arnold said...

That experience and experiments purposefulness 'ought imply can'...
...(and) lead to usefulness of worry for free will in evolution...

Is experimental philosophy also experience philosophy...

John Turri said...

Congratulations to Kyle on the publication. It's great to see other people working on this and trying new things.

I have three closely related questions for Eric and Kyle on their interpretation of Kyle's qualitative project.

(1) How many distinct lines of convergent experimental evidence are there currently against the hypothesis that OIC is part of commonsense morality? (2) Of those distinct lines of convergent experimental evidence in 1, for how many of them does Kyle's qualitative project support the conclusion that almost all participants reinterpret the questions to preserve consistency with OIC? (3) Of the distinct lines of evidence in 1, how many ask participants to record judgments using quantitative scales?

Also, Kyle, a couple related question specifically for you:

(4) When you found that you were able to "reverse" certain findings using a different procedure, you took that as evidence that the "un-reversed" finding had been misinterpreted. In particular, you took the "reversed" findings to genuinely reflect people's initial understanding of the concept of moral obligation. In reaching that conclusion, which alternative hypotheses for the reversal did you consider?
(5) Could you please summarize how, using this method, you distinguish between, for example, "participant 30 was unsure how to interpret the third question," on the one hand, from "participant 30 was unsure how, or unwilling to, verbalize her earlier interpretation of the third question to me," on the other?

I'm really interested to think more about these issues in light your answers.

Congratulations again, Kyle.

Wayland Smith said...

The three questions do not in any way address 'ought implies can'. The subject is to blame for not leaving at 5.45, when he would have been on time. He both could, and ought to, have left on time. At 5.50 he is to certainly to blame for his past actions. Also, he is arguably to blame for not leaving at 5.50. He would be 5 minutes late, but his friend might still be waiting for him - perhaps they would only miss the previews! Much better than a simple no-show.

Since the questions don't go to 'ought implies can', it's not surprising that the answers don't either.

Kyle Thompson said...

John: thank you so much for your kind words and great questions. I really appreciate your message here and your excellent work on "ought" and "can" in the x-phi literature!

Let me try to answer the questions in a way that might at first seem the long way around.

As a preamble, I want to make an observation that applies not only to your questions, but to much of the feedback I've gotten: naturally, folks have questions about (and worries about) the interpretative moments in my use of qualitative methods. If someone might concede that, on the whole, I've shown reason enough to challenge this or that aspect of survey use on this or that question, there is still a sense that they might wonder, as you seem to, how I've managed to justify one interpretive conclusion about data as opposed to another. To this, I'll first note that 1) I spend a substantial number of words in the publication itself justifying how I interpreted my data, and 2) I don't think I've read a single paper in the x-phi literature (to my current recollection) that takes such pains to justify its use of survey methods AS a method. In other words, I don't think I've seen folks feel a need to stop and say, "We take quantitative survey responses to accurately capture a person's thinking because x and y." Authors might justify their particular questions or the wording in their particular vignette, but that's not what I'm talking about. I'm highlighting that survey-based studies, in my experience, rarely if ever feel the need to stop and justify the use of surveys as data collecting tools.

And yet--my preamble continues--to treat a survey question as capturing a judgment requires interpretation. It requires the researcher to deem that this wording in combination with this Likert-type response really does show that participant had this judgment. The interpretation does not go away because it is baked into quantitative survey data. Please do not think I am indignant about this reality--wherein some methods go largely uncritically examined in a given publication--because I recognize that I am trying some methodological approach outside of the norm; the extra justification on my part might be warranted. Rather, I bring this up to highlight the fact that, I'd argue, x-phi literature, by the fault of no individual person or study, tends to privilege certain methodological assumptions as not needing to be defended while casting doubt on others as requiring defense.

Now, to your questions in particular (and thank you for listening to my TED Talk there!). I'm not totally sure I understand them all--and fault is certainly with me on that count--but I'll do my best.

Part 1 end

Kyle Thompson said...

1) I realize how irritating this response might be after I already asked you to sit through my rambling preamble: I don't know the answer to this question, but I think I'd worry about what gets counted as "lines of convergent experimental evidence." I really don't want to sound like I'm overstating my results here, but I do not--unsurprisingly--find the survey data on "ought implies can" to count as solid evidence. Gosh, just typing that makes me feel like I'm really being dismissive, and I don't want that to be the perception. I take all the empirical data seriously, but I find myself quite skeptical about just about all of the survey data on the topic for just the reasons I discuss in my paper: I find the surveys, no matter how carefully constructed, to constrain and mislead (unintentionally) and to adulterate the multifarious judgments folks have that reveal whether or not they uphold "ought implies can." So, this question about number of lines, if I understand it correctly, seems to suggest that the more lines the stronger the view? I think that's where I'd get off the bus if the lines of evidence aren't the kind of evidence I'd count as evidence. (I'm so painfully aware of how dismissive that sounds, but I mean to convey skepticism toward surveys, that's all.)

2) I think I'll defer to my response to (1) here. Also, I'm not totally sure I understand the question (again, this is something I'm pinning on me!).

3) Doing my best to recall the literature review I cite in the paper, I would say that the bulk of the empirical data is quantitative in nature, and the small amount of qualitative-type data was often quite constrained and limited by the quantitative-type questions that preceded it. For example, a free-write section or a brief question asking for participants to justify a response counts, in some sense, as qualitative data, but not the kind I think was needed for understanding "ought implies can." I think participant justifications often have the same problems as quantitative questions themselves, because they are justifications based on participants' interpretations of the quantitative-type question, which might not be the interpretation the researchers had in mind when constructing the question. The think aloud method and the interview, though, open up the qualitative data-space so that participants can express their multifarious judgments in more detail and in response to multiple questions from multiple angles such that their judgments are more clearly captured in the process.

Part 2 end

Kyle Thompson said...

4) This is a good question. I considered everything I think could imagine could have been happening: that the interview questions primed them to change a response; that the think aloud method interfered with their thinking; that the survey questions were the 'real' response, etc. But--and I encourage you to do this--if you read transcript after transcript, if becomes difficult to conclude anything other than participants clarified their thinking during the interviews or presented their "real" responses in the think aloud sessions. One way to show this is that the interview questions often included a more concrete question about agent's "ought": one interview question asked about a timestamp (just as question 1 on the survey did). When participants were asked the more concrete question, they changed their tune, which highlights their response to the non-timestamp version was in response to ambiguity of the question; when you combine this with the fact that participants often expressed a more effortful wrestling with question 3 (without the timestamp) then it seems that they were 'doing their best' to understand an ambiguous question. They more effortlessly clarified their thinking when given more clear questions and the space to respond to multiple questions about their thinking. I highly recommend looking at my comments on the three example transcripts as part of the supplemental materials to see how extensive the interpretation process was for any given transcript. What I've given here is necessarily abbreviated.

5) I'll defer to my response on (4) here, and also note that I need to jump off now to help prepare for a family gathering. I can help, I want to, and I ought to.

I hope we can stay in touch on these matters, John. And, again, thank you for your comments and support.

Part 3 and total end

Arnold said...

Eric you write "...there are two senses of "qualitative", one pertaining to having qualia or experience, and the other having to do with research methodologies that focus on methods like interview rather than numerical data."...
...then are there two senses of "quantitative", like quantity and purpose...

Ought we to understand qualitative experience with quantitative experience as a coalescence of sense...we 'can' only have one with the other...




.

John Turri said...

Kyle,

Thanks for your answers.

I suspect terminological choices here are interfering with getting at the underlying issues.

For example, almost none of the work I've done used "quantitative questions," but instead qualitative judgments. Better terminology for the contrast you seem to be interested in is "closed-form/open-form" response, not "quantitative/qualitative."

You refer to earlier work on OIC as "surveys." But it wasn't survey research. Instead it was experimental research in which the dependent measures were participant judgment or another performance. Accordingly, the reason you haven't encountered a paper at "pains to justify its use of survey methods" is probably that they don't use survey methods.

Combining the previous two points, you claim that your qualitative study's "core finding" is that "quantitative surveys fundamentally misrepresented participants' OIC judgment." But "quantitative survey" fails to describe any of the earlier work on the topic, which is why I asked you to clarify which of the many earlier OIC studies your conclusion applied to, and which it didn't.

Getting beyond terminology to substance, many earlier OIC experiments included control conditions to address interpretative questions and objections of the sort you're concerned with. They varied the experimental design, procedures, stimuli, dependent measures, and analytical strategies in order to test the relative merit of alternative interpretations of existing findings (thus my reference to "convergent evidence"). Your paper ignores all of this. Prior work didn't rest content with "the surface-level interpretation," as Eric put it. No one I know would disagree with your claim that understanding participant responses "requires interpretation." To the contrary, improved interpretation was precisely the aim of approaching it from so many different angles. Accordingly, when you write, "I don't think I've seen folks feel a need to" explain why their study should be taken to "capture a person's thinking," I infer that you either haven't read or don't recall large swaths of the literature.

To take just one example, researchers found that people's judgments strongly align with OIC for *legal* obligations, but strongly misalign with OIC for *moral* obligations. The *only* change across two conditions of an experiment was to switch from asking whether the agent is "legally obligated" or "morally obligated." That one small change was enough to completely reverse the central tendency. But if people don't get confused or start reinterpreting the test question for *legal* obligation, then it is unlikely that such things are happening when it comes to *moral* obligation. Indeed, this finding is relevant to evaluating the hypothesis offered on p. 18 of the early view version of your paper, according to which people are *so* committed to OIC for moral obligation that when asked whether an agent "ought" to do something that he "cannot" do, participants "have no choice but to" switch to some other, unwanted interpretation of "ought," in order to even "make sense" of the question. Now explain why people don't feel pressure to switch to another interpretation of "ought" for *legal* obligations. Do their responses agree with legal-OIC but disagree with moral-OIC because they are *more strongly committed* to moral-OIC? That is one epic epicycle.

[1/2]

John Turri said...

Relatedly, you claim that the method you used is "uniquely suited to resolve key disagreements within experimental philosophy." You also claim, "All of the studies on both sides of the OIC debate suffer from the same methodological problems." And Eric claims that you "discover[ed] that almost 90% of participants in fact adhere to the [OIC] principle." However, you did nothing to compare your method and results to the many other ways that interpretative questions and objections have been addressed by prior studies (the legal/moral comparison above is just one example). Instead, you basically took a single previous study using one method and compared it to the results you got by using another method. Accordingly, I am truly mystified how your results could be taken to support such strong conclusions.

Finally, you say that you considered everything you "could imagine could have been happening" when interpreting your participants' responses. You seem to not have considered the following hypothesis (which is ironic given that your stated motivation for skepticism about prior work is that OIC judgments are "multifarious"):

Multifunction Hypothesis (MH): Obligation attributions have at least the following functions: (1) describe obligations, (2) encourage their fulfillment, (3) cast blame. Ability consistently constrains function 2, inconsistently constrains function 3, and does not constrain function 1. Function 1 is the primary function, which explains the strong, reproducible central tendency for people's initial answers to dissociate obligation and ability. The ambivalence, reversals, etc. you observed in your interviews occurred because participants shifted among the functions, based on a number of situational variables, including the ones you mentioned, which your method is incapable of isolating or controlling for.

When I read your interview transcripts, I find it *very* easy to conclude that MH is what is happening, in part due to familiarity with previous theoretical proposals and findings from tightly-controlled behavioral experiments.

[2/2]

Kyle Thompson said...

Hi John,

Thanks for the extended replies. I’ve been busy today and haven’t had the chance to fully address them until now. And, as you saw in my previous message, I could only start in replying to your first set of comments. Let me say at the outset that I think your objections and concerns are worthwhile and interesting, and I really hope we can continue this conversation here and hopefully at a conference or in published works in the future. My view of philosophy is that disagreement is healthy and productive.

You mentioned a lot of good stuff here, and I’ll try to get to the crux of matters (but do forgive when I inevitably leave things out in my response):

John, you wrote: “I suspect terminological choices here are interfering with getting at the underlying issues.” Please don’t ignore the possibility that my density is getting in the way!

OK, but I think your broader point about my (perhaps misguided) choice to invoke “quantitative survey” is in response to my griping about who has to defend their methods and who does not. I’m perfectly fine abandoning the terms I’ve invoked—i.e., “quantitative” or “survey”—but I did choose them for a few reasons that I, at the time, found compelling (and still do). Most notably, when I was doing an extensive survey of methodological practices in x-phi, I needed a quick and easy way to describe the kind of method most of the studies I read were using and how they differed from what I wanted to try (e.g., think aloud method and interview). I did extensive research on the differences between definitions of “quantitative” and “qualitative” methods only to conclude that most, if not all, are imperfect. What counts as quantitative, I came to believe, depends as much on the method itself as how it was being employed, how its data were to be used, how it would be called upon in support of an argument, etc. And, I found similar issues with language around question-types. So, I opted for a family resemblance approach and chose language I’d seen used elsewhere as “quantitative surveys” to refer to those methods that, in general, present participants with a vignette—also a fuzzy term—and follow-up questions—also fuzzy—which gave participants a limited set of possible responses. This contrasts nicely with “qualitative methods.” I suppose I’d argue that the quantitative survey v. qualitative methods distinction has a real intuitive feel to it when referring to the x-phi literature landscape from 500 miles up: readers know what I mean, I think, before I get into the nitty gritty of this or that study. Maybe questionnaire is better? But it’s clunky. I don’t know, I thought these worked well, but I’ll reconsider after this exchange.

[1/5]

Kyle Thompson said...

It's worth mentioning, too, that one disadvantage of writing a 9,000-word publication about interview and think aloud data—which I call “qualitative”—is that you can’t chew up your word count on extended caveats because it really takes a lot of text to analyze long-form transcript excerpts. I know this might sound like a bad excuse, but I simply did not, in my publication, have the space to defend my use of “survey” terminology; heck, I barely had space to fit in a few transcript excerpts! Folks that use what I call quantitative methods can, in general in my estimation, present their findings in fewer characters because they are more “quantifiable” and condensed on the page.

In conjunction with the question of terminology, you wrote: “Accordingly, the reason you haven't encountered a paper at ‘pains to justify its use of survey methods’ is probably that they don't use survey methods.” On how you define surveys, this is absolutely true. I think the point I was making was that, for all the “ought implies can” x-phi literature, researchers don’t—as far as I can recall—defend their use of the method I was calling a survey: wherein a participant is presented with one or more vignettes along with one or more follow-up questions such that we, as researchers, conclude that participants’ responses to a given follow-up question is an accurate representation of participants’ judgments with regard to the vignette in just the way we the researchers intended. In other words, the literature is full of things similar to the following: “To see if participants found an agent to be obligated to do x, we asked them ‘Is agent obligated to do x’?” And the thing that isn’t defended time and time again is this (which I take to be the crux of the methodology): that a participants’ selection of “yes” or “7” is an accurate representation of that participant’s judgment regarding, in this case, obligation. In other words, I have no doubt that participants did in fact select “yes” or “7,” but I have significant doubts that their selection of “yes” or “7” translates to conclusions such as “participant 1 JUDGES that agent is obligated to do x.” Simply put, the data, empirically speaking, only reflect what a given participant selected on a questionnaire/survey, which might be different from what that participant’s judgment actually was. Or, in other words, a person’s judgment can be successfully or unsuccessfully captured/reflected/represented by a quantitative value such as “yes” or “7.”

[2/5]

Kyle Thompson said...

The move from “what participants selected” to “what they judge to be the case” is a HUGE move—and the one that gives the questionnaire method all its potency—and yet it is routinely not defended. Other researchers or other projects might seek to “check” or “validate” whether a questionnaire question (or set of questions) is getting an accurate picture of people’s judgments (which I take to be mediated by mental events and not by survey selections as such), but they only use different questionnaire questions (or highly constrained free-write spaces) to do so which have the same problem to address: how is it that we are entitled to treat a questionnaire selection as accurately representing/capturing a participant’s judgment? To invoke “controlled” studies, with varied conditions, is to assume that, for any given condition, we are accurately capturing people’s judgments when they select “yes” or “7.” It is that assumption that worries me, and it is that thing that seems unaddressed time and time again. (For a related example, the Knobe effect would be less interesting, I’d argue, if folks reported something like: “Well, no he didn’t intentionally harm the environment but I chose ‘yes’ that it was intentional because I wanted it to be clear that I didn’t like what he did. It wasn’t intentional, but the guy’s a jerk! That’s why I selected that response.”)

So that’s what I mean when I say that researchers using quantitative surveys don’t take pains to justify that their surveys are accurate in the “ought implies can” literature. What I see is an endemic acceptance that, more or less, a participant’s survey responses reflect their judgments. I was so skeptical that I ran my study and found that, in fact, people’s judgments—this time captured using qualitative methods that don’t require folks to funnel their judgments through determinate questionnaire questions—were quite different and often times the opposite of what the survey questions “said” were their judgments. When the quantitative survey would have said a participant violated OIC, in my study, their utterances revealed a range of judgments that, while multifarious, consistently towed the line of OIC.

Now, as I write in the paper, the qualitative data didn’t show that we can’t trust any survey questions. I argue in the paper that the question about an agent’s blameworthiness and ability turned out to largely accurately reflect participants’ judgments. It was mostly the question about “ought” which was so often misrepresented by the quantitative data. And, that’s important because what someone means by “ought” is the crux of the “ought implies can” debate. And, once we have this doubt cast on “ought” questions, then we should, I’d argue, be worried about all of the x-phi literature that grapples with equally complicated concepts (which, I’d argue is a good portion of it because if the concepts were clean and neat then they probably wouldn’t be found at the center of centuries-long philosophical debates).

[3/5]

Kyle Thompson said...

What this tells me, then, is that there is a problem with the methodological assumption that participants’ questionnaire responses can be taken largely at face value as reflecting their judgments on an issue. And so my griping was that even when I present data showing the same participant select a survey response indicating one thing and then explaining themselves judging the opposite to be true, I am more on the ropes in justifying my methodology than the researcher who employs questionnaires on the assumption that a person’s survey response is identical or near-identical with their mentally-mediated judgment. Or, to take another common objection to my work, even when I show that the quantitative survey method can lead us astray, I am told that the survey data is clean because of the successful use of multiple questions and control groups even though I have shown the problem is not with a lack of control conditions but with the very assumption that questionnaire questions reflect people’s thinking regarding highly ambiguous concepts such as “ought.” Or, in other words, I think quantitative questionnaires have issues, so we can’t solve the problem by employing more and different quantitative questionnaires.

It is, therefore, this part you wrote that I think marks where we differ in viewpoint (emphasis mine): “You refer to earlier work on OIC as ‘surveys.’ But it wasn't survey research. Instead it was experimental research in which the dependent measures were PARTICIPANT JUDGMENT or another performance.” On my view, the previous OIC studies treated “participant selections” as the measure but treated those selections AS judgments. But, as a matter of fact, selections aren’t judgments. What I argue—and it is just that, an argument, not a personal affront—is that, at least in the Chituc et al. case, these selections turned out not to be judgments (i.e., the judgments they thought they were). And so, if all the other OIC literature makes the same or a similar assumption regarding selections as judgments, then I think they ought be agnostic about whether they have captured folk judgments relating to OIC.

So, I suppose the larger claims about a potentially big problem for x-phi has some fake barn energy to it: I argue—and I mean no disrespect in my arguments—that I’ve shown that at least one barn in this town we are driving through to be a papier-mâché barn. And I think that means we experimental philosophers should all make sure that what we were sure were barns were in fact barns, and I’m suggesting that can’t be done using the same methods that caused us to conclude the barns were real in the first place. The rest of x-phi might just be all barns for miles, and only Chituc et al. had the fake barn in their sights. But without checking, we really can’t say we know, I’d argue.

[4/5]

Kyle Thompson said...

Now, I didn’t get to the other content you mentioned, but I don’t know that I’m invested, in my paper, in a theory of OIC. I’ll want to maybe chat with you in more detail about your interpretation of my data as supportive of your theory. My only “commitment,” of a kind, is that whatever OIC theory folks have, it seems to be one that plays along with OIC or, more accurately, doesn’t violate OIC. So, if your theory accounts for the nuances in my data, that would be amazing! All I was getting at is that people don’t violate the Kantian principle as it has been discussed inside and outside the x-phi literature.

[5/5]

John Turri said...

Hi Kyle,

I have just three points in response and plan to then leave it at that.

"To invoke “controlled” studies, with varied conditions, is to assume that, for any given condition, we are accurately capturing people’s judgments when they select “yes” or “7.” It is that assumption that worries me, and it is that thing that seems unaddressed time and time again."

No, it is not to assume that. Instead, that is a conclusion based convergent evidence, which I keep trying to draw your attention to. This issue of interpretation is addressed time and time again in the literature. It is the main concern of many studies in this area (and others). It's really misleading and false to say otherwise, so I hope you'll stop doing that.

"It is, therefore, this part you wrote that I think marks where we differ in viewpoint (emphasis mine): “You refer to earlier work on OIC as ‘surveys.’ But it wasn't survey research. Instead it was experimental research in which the dependent measures were PARTICIPANT JUDGMENT or another performance.” On my view, the previous OIC studies treated “participant selections” as the measure but treated those selections AS judgments. But, as a matter of fact, selections aren’t judgments."

You put "participant judgment" in caps and ignored "or another performance". The second disjunct covers the view you express, so we're not differing on that.

"The move from “what participants selected” to “what they judge to be the case” is a HUGE move—and the one that gives the questionnaire method all its potency—and yet it is routinely not defended."

It is routinely defended in the literature, incrementally in particular studies and then globally based on convergent evidence. The move in question is not what gives controlled experiments their potency.

Kyle Thompson said...

Hi John,

I appreciate your replies. I really appreciate you taking the time to respond and provide good food for thought on this subject. Before responding, I want to reiterate that I find this discussion, and general methodological disagreements, fruitful and useful. Even if we disagree, that can be helpful to the x-phi movement and related conversations.

I'll add a bit to your response, and I welcome your reply. I genuinely think we have a different perspective on what counts as a defense of methods in the literature and what counts as evidence for the effective use of certain quantitative questionnaire methods.

I’m aware that in the literature, broadly speaking, philosophers and social scientists have worried on paper about every single possible step in the process of conducting research. Perhaps I am not conveying my meaning very clearly then, which I’m happy to accept places the fault with me: it seems to be, in the wide array of x-phi literature I’ve read, that one can carry out a study using questionnaires with determinate-style follow-up questions and get it published and treated as defining an accepted viewpoint without running performing the kind of rigorous qualitative check on one’s questionnaire questions that I ran. And one can do that without spending much or any ink justifying that such questionnaire-style experiments do in fact accurately capture folks’ judgments. And other people can publish in response, using the same kinds of methods without any more defense of them. And to me, that is evidence of an assumption that this questionnaire-style (or survey-style) methodology is really capturing folks’ judgments.

And while it is true that, within the x-phi literature, folks come up with new questionnaires or change this or that word or phrasing in the questionnaires, they do not, to my knowledge present participants with think aloud spaces or in-depth interviews that might help reveal whether their questionnaire-style experiments are capturing the judgments that they think they are capturing. And so, I refer to the business-as-usual studies in x-phi as relying on the assumption I pointed to, because the convergent lines of evidence you point to (or I think you are pointing to) do not get under the problem I am worried about. So, what you are counting as evidence is not sufficient on my view. I am not trying to be misleading, but I am really just that uncompelled unless researchers run these qualitative-style validation checks of a kind (or something similar).

I truly look forward to reading more of your work and continuing our conversations.

Best,

Kyle

John Turri said...

Hello Kyle,

"It seems to be, in the wide array of x-phi literature I’ve read, that one can carry out a study using questionnaires with determinate-style follow-up questions and get it published and treated as defining an accepted viewpoint without running performing the kind of rigorous qualitative check on one’s questionnaire questions that I ran."

I can't speak for others, and I'm not familiar with which things you've read, but I can say this: little if anything comes out of my lab that hasn't gone through considerable pre-testing on laypeople. The pre-testing phase can take weeks, months, sometimes even years. This often includes more extensive conversations with many more people than you report having conversed with, sometimes hundreds. It also includes extended engagement with students' written work (not just verbal), with feedback and epicycles, sometimes over multiple semesters.

It don't write all of that down and publish it, though, because I consider it part of the history of a reproducible project published for others' consideration, not its fruit.

Eric Schwitzgebel said...

John and Kyle: It's fascinating to see your detailed exchange! While I am not as skeptical as Kyle is about quantitative research in general, I find his detailed transcripts convincing that something pretty screwy and unstable is going on in the participants' responses to the "ought" question. And I share Kyle's view that we might find similar troubles in other parts of x-phi if we try serious qualitative methods. We can't, I think, really know how far the problem extends until we try methods like Kyle's more systematically. However, John, if most others are as careful in their questionnaire design as you describe, then there shouldn't be too much of a problem. In these days of long online supplementary materials, it could be wonderful and super helpful to see more detail on preliminary testing of this sort.

Quantitative and qualitative designs could be very convincing when put together. One model here might be Hurlburt and Heavey on reports of inner experience. Hurlburt did extensive qualitative study of the phenomena of interest before inviting Heavey into the collaboration to begin getting quantitative.

John Turri said...

Hi Eric,

Only in rare cases would I even consider investing effort into documenting pre-testing. It's enough to document the testing phase so that the research is reproducible. Any (seeming) insight gleaned in the pre-testing phase must earn its keep in the form of reproducible, interpretable results from controlled studies, at least in my case.

If we want to assess the hypothesis that people made an error (broadly speaking) in formulating their initial answer, the simplest and highly effective way to test that hypothesis is to make a competent prediction: If people tend to make error E in condition C, then they will do X in condition C*. Then test that. And if there is a worry that people made error E* in condition C*, then make a competent prediction to test that. None of this takes us outside of what controlled experiments typically accomplish. It is a standard feature of work on this topic (and others).

Accordingly, I disagree when you say that we can't know "how far the problem extends until we try methods like Kyle's more systematically." If a concern has merit, then it's testable. In fact, several concerns mentioned above and in the paper (e.g. time-indexing) have been extensively addressed and, I think, ruled out in previous research. And several theories, which predict and can explain the instability seen in the transcripts, have been studied in the OIC literature and, I believe, passed some pretty stringent tests.

Attributions of inability constrain attributions "in the neighborhood" of moral obligation, including blame and legal obligation. However, in very tightly matched comparisons, it turns out that attributions of inability do not strongly constrain, and in some cases come completely apart from, attributions of moral obligation. If OIC is part of commonsense morality, why do people's responses readily align with OIC for legal obligations, mostly align with "Blame Implies Can," but strongly misalign with OIC for moral obligations? Why does it take extra work to make moral-OIC show up?

People even draw simple pictures which show a moral obligation outlasting the agent's ability to fulfill it. By contrast, those same pictures show that its being worthwhile to encourage the agent to fulfill his obligation does not outlast his ability to fulfill it.

I submit that these "nearby" findings should closely inform serious discussion of whether, and if so how, people's initial judgments are confused, inscrutable, or otherwise deficient.

Arnold said...

Is this the lure of young minds to metaphysics...
...via semantic presences...

John Turri said...

Setting aside all the other disputes aired on this thread, a question posed in all earnestness:

If OIC is part of commonsense morality, why do people's responses readily align with OIC for legal obligations but strongly misalign with OIC for moral obligations?

Does anyone have a viable hypothesis for how switching from "leg" to "mor" in "---ally obligated" is enough to induce disarray in people's ability to respond to simple questions?

Arnold said...

John Turri, you wrote, "Why does it take extra work to make moral-OIC show up?"...
...Wonderful, we all are obliged-studying feeling and emotion, fear and negative emotions today...

Thank you for your work...