Review: Faking Science (Diederik Stapel)

Jan 12, 2015

https://www.goodreads.com/book/show/24298516-faking-science

FakingScience-20141214 <-- PDF

In general, this book was a fun and quick read. It gets somewhat repetitive with his descriptions of how bad he feels for his actions and how he was mistreated by some, stalked by media etc. It is worth reading if you care about psychology as a field. Surely there are more Stapel's out there. The best solution for weeding them out and keeping them away is: 1) replication studies both by the same and independent authors, 2) open data so fraudulent data can more easily be spotted, 3) more meta-analyses of published studies that examine publication bias and estimate effect sizes given the bias (currently not a widespread practice), 4) more statistics! :)

I preferred to drink wine, and was inclined to totally embrace social psychology, but I did drink beer from time to time, and deep insideI felt that social psychology had a pretty optimistic view of the degree to which human behavior could be influenced. It seemed to suggest that you could get people to behave in the way you wanted just by engineering the right environment. Hadn’t social psychologists learned anything from all the failed attempts of the past to build the ideal society? Didn’t they know about all the failed hippy communities, communes, sects, and kibbutzim? If those had taught us one thing, it’s that you can’t construct a society from scratch. Thinking about it, the idea that “you are your environment” was ridiculous. In any given situation—a school, a church, a neighborhood— you’ll always find lots of different people, behaving in lots of different ways. So where did those differences come from? How could social psychology explain them, if not by reference to personality factors? Somewhere, deep inside each of us, doesn’t there have to be some core element that accounts for these differences? In the end, aren’t we all just, simply, our… selves?
I thought I might find an answer to these questions during the classes on personality studies. In the social psychology classes that I’d taken, a series of teachers, all social psychologists themselves, had enthusiastically promoted the “situationist” viewpoint, explaining how behavior is principally affected by environmental factors, and so I was expecting the same kind of thing in the personality classes, with personality psychologists mounting a vigorous defense of the idea that personality factors were the main determinants of behavior. So I was rather shocked when this didn’t happen.
During the personality studies classes in Amsterdam, we were introduced, somewhat ironically, by our lecturer (usually dressed in a flowing coat and with a carefully combed curl in his blond hair) to Benjamin Kouwer’s book The Game of Personality. In this book, Kouwer puts forward the proposition that there is no such thing as personality. What we call personality is, according to Kouwer, an empty and meaningless concept, that we have been attempting to catch in our hands with any number of useless theories, because things would be so much nicer and more convenient if it actually existed. This was the first and only time that I heard a teacher spend an entire series of classes defending the position that what he was supposed to be teaching us about didn’t exist. It was a refreshing way of putting things into perspective, which of course raised the question of how valid the concepts that were studied by all the other types of psychology were too.
Kouwer wrote his masterwork about personality theories in 1963. The book, which opened with the ominous sentence “Man doubts nothing so much as himself and his fellow men,” is written in a hilariously sarcastic tone; Kouwer’s fellow psychologist, Barendregt, described it on its publication as a “critical examination of the chaos that reigns over everything that can be described by the term ‘personality’”. Kouwer describes, step by step, from the ancient Greeks through to modern American experimental psychologists, everything that has been written, claimed, and theorized about personality. He looks at the insights of psychologists, physicians, philosophers, sociologists, astrologers, and other types of characterologists. Theophrastus, Schopenhauer, Skinner, Freud, Jung, Sartre: nobody is safe from Kouwer’s pithy analysis. His book describes and critiques the dozens (perhaps hundreds) of theories about “the true self” that people have cobbled together over the last few thousand years, and concludes, in a withering final chapter, that The True Self simply doesn’t exist. There is no core personality, no kernel of human existence. If you have a personality test that measures “honesty”, it turns out not to be very useful for predicting whether someone will cheat at cards, crib in an exam, or invent fake data for a research project. Nobody is always honest and does the right thing; nobody is always dishonest and does the wrong thing. Nobody behaves the same way all the time.

So Stapel read a wildly outdated book, and decided that was it for personality research? Never mind all those empirical studies finding stable personality traits over time, also in animals? Disregarding all the plethora of intelligence data? This data was already sufficient in the 60's. Stapel must be pretty stupid or wildly naive.

As I dug deeper into social psychology, I found myself increasingly convinced by the idea that people’s feelings, perceptions, and behavior are principally influenced by the present situation. To use the jargon, I became a situationist, as opposed to a personologist. Apart from the idea that behavior is largely driven by situational factors, another thing that social psychology taught me is that it’s human nature to greatly underestimate the influence of the situation and overstate the effect of personality traits. Even though it’s often very clear from their behavior that people are always adapting themselves to different situations, contexts, and environments, we still tend to feel that behavior is actually a function of someone’s personality or motivation. Why? Because it gives us a feeling of control and certainty, of understanding. John does something. Why? Because he’s John.

Stapel apparently does not notice the contradiction between situationist and human nature. Are some traits genetically determined/influenced or not? Earlier, he talks about how it makes sense in an evolutionary view:

This account fits well with an evolutionary perspective. In order to survive in constantly changing circumstances, people have to keep adapting. In fact survival is, more or less, the same as adaptation. People play different roles; as they do so, they change, and grow, and their personality takes on new facets. Although the physical body that seems to contain the personality may remain recognizable, the personality itself cannot be fully grasped, because it’s always changing with the context. It’s like the ship of Theseus: when all the wooden parts of the ship have been replaced over time, is it still the same ship? Nothing is left of the original ship that Theseus first sailed out of the harbor, but it’s still his ship.

Now, mr. Stapel, how do things evolve? Through genetics surely. Which means that traits both: 1) are heritable to some degree, 2) vary between people. It cannot be all situation, there would be nothing for selection to work with. Evolution requires genetically caused diversity in traits to work.

Why do people use stereotypes, no matter how inappropriate, to explain and judge how others behave? Because we’re lazy, and social categories (gays, Mexicans, Muslims) and the stereotypes that go along with them (effeminate, lazy, terrorists) help to make the world seem simpler. We like our shortcuts. We know that “chair” goes with “sit”, so it’s nice to be able to put together other pairings like“woman/emotional”, “man/competitive”, “child/innocent”, “soldier/tough”, or “professor/smart”, even if they’re not always (if at all) correct.

Stereotypes are not inappropriate. Stereotypes are just beliefs about group statistics, typically means. No one really believes that there is no within group variance, i.e. groups are not categorical. I can find no evidence of the often lamented black and white thinking about groups. For much more about stereotypes, see: Lee Jussim's excellent book.

Scientific research is a bit like solving a jigsaw puzzle. You think up how the puzzle should look, find the pieces, identify the holes, cut new pieces, and see if everything fits. At the level of millimeters all this fine detail can sometimes feel a bit pointless, but the aim is to have each larger section of the puzzle, an inch or a few inches across, look good. When you’ve finished all the cutting and trimming and fitting, when you take a step back and look at the whole picture, you don’t see which pieces are original and which you had to improvise. All you see is a single, coherent picture. It’s a nice feeling to something that you imagined come to life before you. It’s great to come up with an idea, develop it theoretically, and validate it empirically. It’s absolutely fantastic when you discover— perhaps after a bit of trial and error—that your idea works.
But sometimes it doesn’t work. Sometimes an experiment goes wrong. Sometimes—in fact, pretty often—the results don’t come out the way you hoped. Sometimes reality just doesn’t want to go along with your theoretical analysis, no matter how logical and carefully formulated that might be.It’s frustrating, but that’s how it is. Sometimes you’re just wrong; you have to go back tothe drawing board and try again, a bit harder this time. But if you don’t find what you were expecting, while everybody else seems to have no trouble, it’s even more frustrating. Let’s say the literature is full of discussions about effect X—for example, if you havepeople read a text in which the word “friendly” occurs a few times, their opinions of others become more positive, but if you replace “friendly” with the names of individual people who are perceived to be friendly, like “Gandhi” or “Mandela”, their opinions of others become more negative. You’d like to show this “X effect” yourself. So you read the literature on X very carefully, you do exactly what the “Methods” section of the article says you should do, and you get… Y. Not X. Oh. Now what? Well, let’s run it again. Nope, still Y. Now what? Back to the literature, read it again twice, check everything, change the materials a bit, run it again. Y. Now what?
I was doing something wrong. Clearly, there was something in the recipe for the X effect that I was missing. But what? I decided toask the experts, the people who’d found the X effect and published lots of articles about it. Maybe they could send me their materials? I wrote some letters. To my surprise, in most cases I received a prompt and comprehensive reply. My colleagues from around theworld sent me piles of instructions, questionnaires, papers, and software. Now I saw what was going on. In most of the packages there was a letter, or sometimes a yellow Post-It note stuck to the bundleof documents, with extra instructions: “Don’t do this test on a computer. We tried that and it doesn’t work. It only works if you use pencil-and-paper forms.” “This experiment only works if you use ‘friendly’ or ‘nice’. It doesn’t work with ‘cool’ or ‘pleasant’ or ‘fine’. I don’t know why.” “After they’ve read the newspaper article, give the participants something else to do for three minutes. No more, no less. Three minutes, otherwise it doesn’t work.” “This questionnaire only works if you administer it to groups of three to five people. No more than that.” I certainly hadn’t encountered these kinds of instructions and warnings in the articles and research reports that I’d been reading. This advice was informal, almost under-the-counter, but it seemed to be a necessary part of developing a successful experiment. Had all the effect X researchers deliberately omitted this sort of detail when they wrote up their work for publication? I don’t know.Perhaps they did; or perhaps they were just following the rules of the trade: You can’t bore your readers with every single detail of the methodology you used. That would be ridiculous. Perhaps

And the answer just screams to anyone who has read his Schmidt and Hunter: These results are all based on small studies with large amounts of publication bias. The negative results don't get published. These findings are not reliable and the true effect size, if there at all, are not very large. They do not understand validity generalization. His chapter 4 is a long story of how not to do science.

A few weeks ago I was in the paper—in fact, I was in all the papers. I had published a study which showed that messy streets lead to greater intolerance. In a messy environment, people are more likely to resort to stereotypes of others because trash makes you want to clear it up, and the use of stereotypeslets you feel like you’re clearing things up. Stereotypes bring clarity to a messy world. Women are emotional, men are aggressive, New Yorkers are in a hurry, Southerners are hospitable. Stereotypes make the world predictable, and we like that, especially if the world currently looks dirty and unkempt.
The publication of this study caused a sensation. It was published in the most prestigious journal of them all, Science, and it made headlines around the world. The idea that physical disorder activates the need for mental order and so leads to stereotyping and intolerance was innovative and exciting. It might explain why there’s more interpersonal conflict in run-down neighborhoods and it suggested an elegant way to combat racism and other forms of discrimination: clear up the mess, throw out the trash.
The coolest aspect of the study was the way in which it combined careful laboratory research with field studies that people could relate to. In the lab we had students sit in front of a computer and look at photographs, words,and symbols that depicted greater or lesser degrees of disorder, before asking them to fill in some questionnaires. In the field, we stood in clean or dirty railway stations, or on messy or neat street corners, and interviewed unsuspecting passers-by about their opinions on immigrants, gay people, foreigners, men, and women. The idea was very simple, the lab work was impressive, and the field studies were models of cunning design.
What made this research especially attractive was that it followed logically from decades of research into stereotyping. Every social psychologist knows that the need for structure is one of the driving factors behind the human tendency to stereotype and discriminate against others. There were already dozens of studies showing that the need for structure (“I want certainty”) is directly coupled to the use of stereotypes. The more structure you need, the more likely you are to judge people based on preconceptions. So it was only logical that this would still hold if the need for mental order was caused by physical disorder. Anyone could have thought of that. Maybe. But I was the one who had actually come up with the idea. In fact, I hadn’t just come up with the idea; I’d come up with all the data myself. It was a clever, simple,logical, and obvious idea, but the empirical tests were completely imaginary. The lab research hadn’t been carried out. The field studies never happened.

How much money did this fraud waste? How many ghettos programs were initiated on this idea that if we just clean up the environment, the people will turn into productive citizens? I know there are lots of programs like these in Denmark. No one knows because they are local projects, so no one compiles a list of all the costs and which effects, if any, they had.

How do these people think that these environments got bad to begin with? Who break the things in ghettos?

I got the idea that physical disorder might create the need for structure when, by chance, I came across “broken windows theory” in the literature. This theory argues that there is a relationship between the bad state of repair of housing in poor neighborhoods and the other social problems to be found there. Because I knew that the need for structure is one of the main motivations for people to stereotype each other as undesirables, I immediately saw the connection. It was only logical that neighborhoods with lots of broken windows, liquor stores, empty homes and dilapidated buildings would have social problems. All that urban decay pushes people to use stereotypes and other forms of prejudice to “clean things up” in their heads, thus restoring some structure. It was a brilliant idea.
I asked one of my students, a good photographer who I knew lived in a disheveled squat, to take some pictures of houses with and without broken windows, neat streets and disorderly ones, walls with and without graffiti, and any other contrast between trash and non-trash she could think of. A few days later she brought me dozens of photos. I selected a few of them and devised a questionnaire. We showed people a set of photos—showing either orderly or disorderly scenes—and then had them answer questions about different social groupings.
There was a measurable “chaos leads to stereotyping” effect—the people who had seen the disorderly photos gave more prejudiced answers—but it wasn’t very strong.
I decided to try another approach. Instead of a succession of photos, I made a collage with a house, a tree, a car, and some people where everything looked normal, and another one where everything was out of place. That worked as well, but the effect was still very small. It worked for some stereotypes, but not for all of them.
I tried again with photos of walls and houses, but this time there was no difference at all between the groups. I had lost the effect, and I couldn’t bring it back. Such a beautiful, obvious, logical, and (especially) simple effect, and I couldn’t find it. I decided to give up. I obviously had no talent for simplicity.

More underpowered studies with small or null population effect sizes.

I’ve come up with a nice idea for a series of experiments with an American colleague, but that project isn’t making much progress. Here’s how it works: people sit at a computer screen and we show them brief flashes of either very attractive or very ugly faces. The images show up for such a short time that the participants can’t really make out the faces, but they know that they’ve seen something because of the flash. Every flash occurs in a different corner of the screen, and the subjects have to press a button to say whether they saw it on the left or the right. After that, we tell them that the task is finished, but we ask them just to sign a piece of paper, “for the record.” The idea is that seeing pictures of ugly people, even unconsciously as the image is flashed in front of them for a fraction of a second, will make people feel good about themselves, which will make their signature larger, whereas flashes of attractive people will make their signature smaller—in other words, there’ll be a contrast effect. The size of someone’s signature is a subtle, implicit way to measure how positive their self-image is. If your signature becomes bigger after seeing the flashed images, you see yourself in a more positive light, and if it becomes smaller, your self-image has become more negative.
The initial results were very promising, but the last experiment I ran to try and demonstrate this automatic, unconscious social comparison effect failed completely. That’s incomprehensible, and it still hurts. It was an elegant idea, and everyone expected it to work, given the number of comparable effects in the literature.
In the only way my painful back will allow, I stand in front of my computer in the kitchen and start to write. It’s a great story, with lots of experiments, all of which turn out fine. After just four days, the article is written. All the failures are behind me.

A wildly implausible hypothesis turns out not to be found in an experiment? How surprising. Which world are these social psychologists living in? If Stapel was a hotshot and he thought all these things, surely they are common in the field.

I drive to Zwolle and then to Groningen. I can picture the scene, a few months ago, at the start of the summer, just before the last day of school. There are dozens, if not hundreds, of students, their faces showing their concentration but also smiling, filling in my questionnaires. They’re wearing summer dresses or shorts, with thongs on their feet. They’re sitting in silence, at white tables laid out in rows of five, working hard in the name of science. Some have their tongues sticking out of their mouths, others are pressing their pencils so hard into the paper that they nearly snap, but they’re all trying their very best. I can see them: circling answers, shading boxes, making crosses. I can see it with my own eyes. I didn’t have to come here today, because they’re all here. All here, answering questions in the name of science.

Thongs on their feet?!? Mistranslation?

The October report felt to me like an attempt at anexorcism. What I discovered in it was not just a description or an explanation of something evil; it was an attempt to destroy that evil, root and branch. I saw myself portrayedas an arrogant, manipulative con artist, an evil genius, a wicked researcher who, very deliberately, following a nefarious master plan, had set out to deceive as many people as possible. Really? Sure, I’d told a whole load of lies, and I’m going to have to accept the punishment that goes with that for the rest of my life. But did I really have a plan? If I had, wouldn’t I have gone about it in a more careful, smart, calculating way? My fraud was a mess, my fake data always put together in a hurry, full of statistical errors and little quirks that made it easy to spot, if you looked even moderately carefully at it. Wouldn’t a thoughtful, rational person make a better, neater, and less obvious job of it? Apparently I’d deliberately surrounded myself with weak, easily manipulated research students so that I could make them extra-dependent on me and get them to go along with my evil little schemes. Really? In fact, all students had to go through an intensive, “any doubt—out” selection process, with three or four other people besides me, and only the best few making it through. They’d suggested that I’d managed to surround myself with poor-quality researchers. Really? Many of the people I’d published with were senior academics and leading psychologists, some of them world-famous. They said I’d gotten rid of people who’d dared to criticize me. Really? Who? Who did I get rid of, and when? What was their complaint? Did they leave because of me or for some other reason? Sure, not everyone ended up with a paid gig, but was that because they’d criticized me, or vice versa? The fact that I’d invited my colleagues to dinner at our house, organized drinks receptions or barbecues, or an occasional trip to the theater, was cited as evidence of my manipulative behavior. Seriously? We worked hard together, sometimes we let our hair down, and sometimes we tried to do a bit of team-building. Was I the only person doing this with my team? Doesn’t anyone else ever go out for the evening with people from work? Don’t other groups of researchers like to have a good time occasionally? But in my case, it seems I was doing it to butter up my colleagues, to get them on my side, and especially to get them to keep quiet if they found out that anything sketchy was going on. Really? Was I really so calculating? Did I do all that just to maintain my web of lies? Apparently so. I found out later that one of the members of the Committee, in a discussion with some of my former colleagues, had become somewhat emotional and described me “just like any other criminal”.

This is another interesting case of one persons' modus ponens being another person's modus tollens. Stapel is arguing that:

If the people he had published with were leading social psychologists, then he did not surround himself with poor-quality researchers.
The people he had published with were leading social psychologists.
So, he did not surround himself with poor-quality researchers.

Now, I'm more inclined (based on this book and other failures of social psychology) to reason the other way:

He surrounded himself with poor-quality researchers.
The people he had published with were leading social psychologists.
So, leading social psychologists were poor-quality researchers.

Just Emil Kirkegaard Things

Discussion about this post