Wednesday, April 07, 2021

On Scientific Trust, Loose Summaries, and Henrich's WEIRDest People in the World

Joseph Henrich's ambitious tome, The WEIRDest People in the World, is driving me nuts.  It's good enough and interesting enough that I want to read it.  Henrich's general idea is that people in Western, Educated, Industrial, Rich, Democratic (WEIRD) societies differ psychologically from people in more traditionally structured societies, and that the family policies of the Catholic Church in medieval Europe lie at the historical root of this difference.  It's very cool and I'm almost convinced!

Despite my fascination with his argument, I find that when Henrich touches on topics I know something about, he tends to distort and simplify things.  Maybe this is inevitable in a book of such sweeping scope.  However, it does lead me to mistrust his judgment and wonder how accurate his presentation is on topics where I have no expertise.

Early in reading, I was struck by Henrich's presentation of the famous / notorious "marshmallow test".  Here's his description:

To measure self-control in children, researchers sit them in front of a single marshmallow and explain that if they wait until the experimenter returns to the room, they can have two marshmallows instead of just the one.  The experimenter departs and then secretly watches to see how long it takes for the kid to cave and eat the marshmallow.  Some kids eat the lone marshmallow right away.  A few wait 15 or more minutes until the experimenters gives up and returns with the second marshmallow.  The remainder of the children cave somewhere in between.  A child's self-control is measured by the number of seconds they wait.

Psychological tasks likes these are often powerful predictors of real-life behavior (p. 40).

It's a cute test!  However, I have a graduate student who is currently writing a dissertation chapter on problems with this test.  Maybe the test is a measure of self-control, but it could also be a measure of how much the child trusts the experimenter to actually deliver on the promise, or how much the child desires the social approval of the experimenter, or how comfortable the child is with strange laboratory experiments of this sort, or how hungry they are, how much they want to end the situation so as to reunite with their waiting parent, etc.  Indeed, the a recent conceptual replication of the experiment mostly does not find the types of predictive value that were claimed in early studies, after statistical controls are introduced to account for race, gender, home background, parents' education, vocabulary, and other possible covariates.[1]  

In general, if you've been influenced, as I have, by the "replication crisis" and other recent methodological critiques of social science and medicine, this might be the kind of result that should set off your skeptical tinglers.  The idea that how long a four-year-old waits before eating a marshmallow reveals how much self-control they have, which then "powerfully predicts" real-life behavior outside of the laboratory (e.g., college admission test scores over a decade later, as is sometimes claimed) -- well, it could be true.  I'm not saying it's not.  But I don't think I'd have written it up as Henrich does, without skeptical caveats, as though there's general consensus among psychologists that a child's behavior with a single marshmallow in this peculiar laboratory situation is a valid, powerful measure of self-control with excellent predictive value.  Its prominent placement near the beginning of the book furthermore suggests that Henrich regards this test as part of the general theoretical foundation on which psychological work like his appropriately builds.

In this matter, my knowledgeable judgment and Henrich's differ.  That's fine.  Researchers can differently weigh the considerations.  But if I hadn't had the background knowledge I did, his quick presentation might have led me into a much more optimistic assessment of the value of the marshmallow test than I would have arrived at from a more thorough presentation that acknowledged the caveats.  So there's a sense in which Henrich's presentation is a bad fit for my theoretical inclinations.

Here's another passage that bothered me:

Upon entering the economics laboratory, you are greeted by a friendly student assistant who takes you to a private cubicle.  There, via a computer terminal, you are given $20 and placed into a group with three strangers.  Then, all four of you are given an opportunity to contribute any portion of your endowment -- from nothing at all to $20 -- to a "group project."  After everyone has had an opportunity to contribute, all contributions to the group project are increased by 50 percent and then divided equally among all four group members.  Since players get to keep any money that they don't contribute to the group project, it's obvious that players always make the most money if they give nothing to the project.  But, since any money contributed to the project increases ($20 becomes $30), the group as a whole makes more money when people contribute more of their endowment.  Your group will repeat the interaction for 10 rounds, and you'll receive all of your earnings in cash at the end.  Each round, you'll see the anonymous contributions made by others and your own total income.  If you were a player in this game, how much would you contribute in the first round with this group of strangers?

This is the Public Goods Game (PGG).  It's an experiment designed to capture the basic economic trade-offs faced by individuals when they decide to act in the interest of their broader communities....  societies with more intensive kin-based institutions contribute less on average to the group project in the first round (p. 210-211).

This describes a study in which participants will receive $200-$300 each.  Of course, it's rare to award research participants such large amounts of money.  If you want, say, 200 participants, you'll need a $60,000 budget!  Henrich's endnotes cite two general books, one brief commentary without empirical data, two classic articles in which participants exited the experiment having earned about $30 each on average, and two cross-cultural studies whose payout amounts weren't readily discoverable by me from looking at the materials.  Also in the notes, Henrich says that one study "increased contributions to the group project by 40 percent, not 50 percent.  I'm simplifying" (p. 543).  However, the majority of the cited studies in fact used 40 percent increases, not just the one study to which this caveat was attached.

I'm not seeing why the more accurate 40% is "simpler" than 50%.  This seems to be a gratuitous inaccuracy.  Characterizing the experiment as ten rounds with payoffs of $20-$30 per round is potentially a more serious distortion.  Really, these experiments are run with units that are later exchanged for small amounts of real money.  This is important for at least two reasons: First, these experimental monetary units might be psychologically different from real money, possibly encouraging a more game-like attitude.  And second, when the actual amounts of money at stake are small, the costs of cooperating (and also the benefits) are less, which should amplify concerns about how representative this game-like laboratory behavior is of how the participants would behave in the real world, with more serious stakes.

Suppose that instead of exaggerating the stakes upward by a factor of about 10, Henrich had exaggerated the stakes down by a factor of about 10.  What if, instead of saying that there was $20-$30 at stake per turn, when it's typically more like $2-$3, he had said that $0.20 was at stake per turn?  I suspect this would make an intuitive difference to most ordinary readers of the book.  The leap from "here's how cooperatively research subjects act with $20" to "here's how cooperative people in that culture are with strangers in general" is more attractive than the leap from "here's how cooperatively research subjects act with $0.20" to the same broad conclusion.

In general, I tend to be wary of quick inferences from laboratory behavior to real-world behavior outside the laboratory.  Laboratories are strange social situations and differently familiar to people from different backgrounds.  This is the problem of ecological validity or external validity, and concerns of this sort are why most of my own research on social behavior uses real-world measures.  Other researchers, such as Henrich, might not be as worried about the external validity of laboratory/internet studies.  There's room for legitimate debate.  But in order for us readers to get a sense of whether external validity might be an issue in the studies he cites, at the very least we need an accurate description of what the studies involve.  Henrich's presentation does not provide that, and simplification is a poor motive for this distortion, since $2 is no less simple than $20.

Henrich does not, in my mind, cross over into bald misrepresentation.  He doesn't, for example, say of any particular study that it involves $20 per round.  Rather, the presentation seems to be loose.  He's trying to give the general flavor.  He's writing for a moderately broad audience and aiming to synthesize a huge range of work, unavoidably simplifying and idealizing along the way.  He could respond to my concerns by saying that his best judgment of the conflicting evidence about the marshmallow test is that it's a valid and highly predictive measure of self-control and that his simplified presentation of the material conveys that effectively by avoiding concerns and apparent replication failures that would just (in his judgment) be distracting.  He could say that his best reading of the literature on external validity is that the difference between $2 and $20 doesn't matter and that the quick leap to general conclusions about cooperativeness is justified because we can reasonably expect laboratory studies of this sort to be diagnostic.  He could say that the reader ought to trust that he's done his homework behind the scenes.

We must always trust, to some extent, the scientists we're reading -- that they are reporting their data correctly, that there aren't big problems with the study's execution that they're failing to reveal, and so on.  Part of this involves relying on their inevitably simplified summaries of material with which we are unfamiliar.  We trust the researcher to have digested the material well and fairly, and not to be hiding worries that might legitimately undermine the central claims.  The looser the presentation, the more trust is required.  

This invites the question of whether there are conditions under which more versus less trust is justified.  How much, as a reader, ought you be willing to glide through on trust?

I'd recommend reducing trust under the following three conditions:

(1.) The author has a prior agenda or a big picture theory that might motivate them to interpret and digest results in a biased way.  Most scientists have agendas and theories, of course, and certainly Henrich does.  But there is more and less agenda-driven work, and adversarial collaboration offers the opportunity for bias to be balanced through scientists' opposing agendas.

(2.) The author is not as skeptical as you the reader are about some of the relevant types of research.  If the author is less skeptical than you are, they might be interpreting that research more naively or more at face value than you would if you had read the same research.

(3.) Where the author makes contact with the issues you know best, they seem to be distorting, misinterpreting, or leaping too quickly to broad conclusions.  This might indicate a general bias and sloppiness that might be present but harder for you to see regarding issues about which you know less.

On all three grounds, my trust of Henrich is impaired.


Update, April 30: See my continuing thoughts about the book here.  See also Henrich's reply to my post here.


[1] Deep in an endnote, Henrich acknowledges this last concern.  He responds that "it's easy to weaken the relationship between measures of patience and later academic performance by statistically removing all the factors that create variation in patience in the first place" (p. 515).  It's a reasonable, though disputable point.  Regardless, few readers are likely to pick up on something buried in the back half of one among hundreds of endnotes.


Daniel Harris said...

"Despite my fascination with his argument, I find that when Henrich touches on topics I know something about, he tends to distort and simplify things. Maybe this is inevitable in a book of such sweeping scope. However, it does lead me to mistrust his judgment and wonder how accurate his presentation is on topics where I have no expertise."

This has been exactly my experience, both with this and his last book as well. When I read "The Secret of our Success," I was loving it until I got to the language chapter, which made me cringe for reasons that echo what you say here, and then I started worrying about what I had read earlier on as well.

Arnold said...

Eric Schwitzgebel vs Joseph Henrich or Joseph Henrich vs Eric Schwitzgebel...

Wikipedia academically likens them as similarly evolving guys...

But one has more opinions than the other...

To your introduction, NY Times has interesting story about 12th-16th century Islamic questioning influencing Western thought, Quakerism and Catholicism...

Anonymous said...

"Other researchers, such as Henrich, might not be as worried about the external validity of laboratory/internet studies."

I think implying that Henrich is not concerned with external validity and that he is the type of researcher to too quickly accept contrived internet studies is uncalled for and unfounded. Henrich has conducted field research across the world with extremely hard to get populations, has pioneered the merging of different social sciences and their methods, and has recently been promoting the use of big data to capture more real world behaviors at scale. He reviews many such studies throughout the book, and I think it's pretty clear that he is not a standard Mturk and lab researcher.

Put differently, I don't think anyone with any familiarity of Henrich's past work would suggest these things. If anything, he's probably known as being abnormally concerned about external validity and the need for real world data compared to other social scientists.

Adam said...

Hi Eric, interesting post. A few thoughts:
1) ‘Gell-Mann amnesia’ is a useful term gaining some currency online; it refers to the phenomenon of noticing something is misleading on a topic you’re familiar with, but then failing to generalise your assessment of that sources accuracy to other areas you are less familiar with. Evidently you’re already trying to correct for it!
2) I wonder if you should be sceptical of your scepticism. It seems to me that when assessing someone's reliability, the question shouldn’t just be (1) ‘am I able to identify some potential problems with their presentation of these studies?’ but rather (2) ‘Does their characterisation of the studies match how they are generally interpreted by the relevant experts?’ The former risks blowing out into footnotes upon footnotes and is too large an opportunity to rationalise rejecting studies that have results we each don’t like – e.g. I’m able to identify some concerns about the contrary studies you raised challenging the traditional interpretation of the marshmallow test, (gathering data and only later deciding to ignore only children of mothers who completed college, and the ceiling effects, seem particularly to be a cause for concern) and some other interpretations have been published: . Given that education seems to increase degree of polarisation on particular ideological matters, plausibly because more educated people have more capacity to generate plausible ways of discounting evidence that doesn’t fit their world view, I’m inclined to think we highly educated people should have a very strong prior towards 2, and given Heinrich was a professor of psychology, the concerns you raise here don’t yet make me too worried (disclaimer: haven’t read the book).
3) Isn’t controlling for parent’s education, background, vocabulary etc. when trying to measure the effect of self-control on future outcomes a bit like trying to measure the effect of IQ on outcomes after controlling for years of education and income? I.e. we risk controlling for the very thing we’re trying to measure, given parents with a certain education are likely to enable their kids to develop more self control.

Eric Schwitzgebel said...

Thanks for the comments, folks!

Anon 09:32: External validity regarding transport to non-WEIRD populations, yes of course! It’s what he’s perhaps most famous for (before this book). But that’s very different from ecological validity and external validity with respect to transfer from lab/internet to real life. The latter is my critique.

Anon 09:53: Thanks for all those points. And yes, fair enough on being skeptical about my own skepticism.”!

David Potts said...
This comment has been removed by the author.
David Potts said...

"Its prominent placement near the beginning of the book furthermore suggests that Henrich regards this test as part of the general theoretical foundation on which psychological work like his appropriately builds."

I don't think Henrich looks at the relation between theory and data this way at all. See this interesting article in which he argues that part of the problem with research in psychology and other fields is a lack of guidance by theory. Part of the reason we fell into the replication mess is that without a theoretical frame to help discern which research findings should be expected, which should not and thus require extra scrutiny, where to look for predicted but previously unrecognized phenomena, and so forth, we are left with only intuition, folk theory, and p-values to guide us. This has not worked out very well.

I'm not saying this answers your concerns about his treatment of the marshmallow test, btw. But I think this part of his perspective should be considered. I am impressed with what he has accomplished, and I think this is part of the reason.

Enoch Lambert said...

Perhaps the very effort to cover so many topics, data, methods, etc. is a further reason to lessen trust. The probability that a given person has been able to consider each with requisite care and scrutiny is not great. Especially if the standard is the level at which one has scrutinized that which one knows best!

chinaphil said...

I remember a few years ago you wrote about having multiple arguments pointing towards a conclusion vs. having a single chain of logical inference (however tightly argued) leading to a conclusion. There you were supportive of the multiple argument. I wonder if you see Henrich as following a multiple argument strategy here? Because to the extent that he is, we might be able to forgive some of the lack of tightness in each strand of the argument, and accept that the book as a whole constitutes a reason to see the conclusion as an interesting thing to think about - with the caveat that some of these arguments must be subject to more rigorous examination and testing later on?

Howie said...

Hi Eric

You are questioning how variables interact and whether they can be controlled; Heinrich is arguing that cultural variables aren't being controlled.
Two larger issues for me are: first, how do we know psychological variables? Physicists have their way of seeing: they discard Aristotelian realism; second, Heinrich is supporting his claims I guess (I have the book and have browsed it) by theoretical arguments. So how does psychological theory interact with experiments? Many psychologists do both, but can we argue on purely theoretical grounds?

Arnold said...

To understand the 'My trust is impaired stance?'; I searched everywhere and finally Googled...


It seems to speak to "what to trust", oneself or others...

Thanks for extraordinary reads...

Eric Schwitzgebel said...

Thanks for the continuing comments, folks!

David Potts: Interesting perspective, and thanks for the link. I can see that angle. The flip side is that it could easily become "I'll believe even shoddy research when it fits with my favorite theory, while research that doesn't fit will have to pass a high bar." To some extent, that's a perfectly reasonable attitude. But I'm not sure that the problem with psychology is that psychologists don't already have *enough* of that attitude and need to be encouraged even farther in that direction.

Enoch / chinaphil: These are maybe complementary angles on the same issue. On the one hand, spreading oneself thin (of which I am also guilty) diminishes one's credibility and expertise in any one area; on the other hand, a wide view that finds converging evidence can be compelling. It's probably good that we have a diversity of researchers who differ in the breadth of their work, so that each can correct the other.

Howie: A super complicated question.

Arnold: Or the Bible!

John said...

I'm curious if you felt that way when you were reading Steven Pinker's Better Angels of our Nature or any of his other work.

Eric Schwitzgebel said...

I found Better Angels a little more solid in the parts I knew about, *except* for his treatment of traditional, non-agrarian cultures. But I'm open to correction, and I did criticize that book on other grounds here:

Matt said...

On the marshmallow test, another often unconsidered part is that, sometimes even kids only want one marshmallow. (I always thought they were sort of okay, even as a kid. One might be fine, but doing any work at all for two? Why bother, if I don't really like them that much?) Maybe that only fits a small portion of the respondents - but if it's enough to throw things off a bit, it could still matter.

Howard B said...

Correct me if I'm wrong, but your differences sound similar to that book about Hobbes and the air pump, highlighting Hobbes's reservations with experiments of his day. Henrich is an anthropologist- his case relies on an uncanny feeling or intuition- call it a spidey sense if you will