Rorschach tests predict stuff

And maybe we should explore using them with machine learning

Apr 26, 2023

Rorschach tests are those inkblot things you sometimes see in movies. They are generally considered bad, in line with Myers-Briggs of bad early 1900s ideas in psychology.

Here's Wikipedia:

Some skeptics consider the Rorschach inkblot test pseudoscience,^[8]^[100] as several studies suggested that conclusions reached by test administrators since the 1950s were akin to cold reading.^[101] In the 1959 edition of Mental Measurement Yearbook, Lee Cronbach (former President of the Psychometric Society and American Psychological Association)^[102] is quoted in a review: "The test has repeatedly failed as a prediction of practical criteria. There is nothing in the literature to encourage reliance on Rorschach interpretations." In addition, major reviewer Raymond J. McCall writes (p. 154): "Though tens of thousands of Rorschach tests have been administered by hundreds of trained professionals since that time (of a previous review), and while many relationships to personality dynamics and behavior have been hypothesized, the vast majority of these relationships have never been validated empirically, despite the appearance of more than 2,000 publications about the test."^[103] A moratorium on its use was called for in 1999.^[104]
A 2003 report by Wood and colleagues had more mixed views: "More than 50 years of research have confirmed Lee J. Cronbach's (1970) final verdict: that some Rorschach scores, though falling woefully short of the claims made by proponents, nevertheless possess 'validity greater than chance' (p. 636). [...] Its value as a measure of thought disorder in schizophrenia research is well accepted. It is also used regularly in research on dependency, and, less often, in studies on hostility and anxiety. Furthermore, substantial evidence justifies the use of the Rorschach as a clinical measure of intelligence and thought disorder."^[105]

Not well known is that it can also predict intelligence decently well. This research topic has fallen out of fashion but there's a bunch of old research. One obscure thesis from 1967 with 0 citations (not even indexed by Google Scholar!), A study of rorschach intellectual indicators in adolescents by Melvyn Eugene Giller was already able to review 19 studies. No formal meta-analysis was attempted because it hadn't been invented yet. Before we get to that, here's the scores that people are working with. You show the subjects various odd inkblots and you record what they say, and then score this on various metrics, some of which are:

Definitions of and Rationale for Rorschach Indices of Intelligence
Assooiational Productivity (R). -- This index refers to the total number of scored associations on the test. "When the associations are scored consecutively in the ten cards, the number of the last association is R" (7, p. 212). In past research (25. 26) R has been shown to positively cor- relate with intelligence, especially verbal intelligence.
Number and Percentage of Whole Responses with good form (W, W%) . This index indicates that the subject used the entire inkblot to form an association. "A relatively high number of W represents an emphasis on the abstract forms of thinking and the higher forms of mental activity" (13. P. 259)- W is an indicator of psychic strength and creative imagination. Klopfer and Kelly also state:
Certain W's are achieved by mental inability to organize the whole card into subdivisions, an inability frequently found in certain severe forms of brain diseases. Furthermore, vague or noncommital W interpretations show a. minimum of effort to organize the stimulus material or to build up an all-inclusive concept (13. p. 259)
It is for this reason that only W responses with good form will be used in the statistical analysis.
Percentage of Major Detail Responses (D%). -- "The D response reflects the degree of importance which one places upon the most practical aspects of daily living" (23). It is a common down to earth type of response; the more average approach to a problem. "D% indicates, in general, an interest in the specific, in details, in the concrete. This may be interpreted as a systematic, everyday, common-sense application of intelligence" (13. P- 260).
Form Level (F+%). -- "The F'+% indicates the extent to which an individual is capable of structuring his world along the lines of reality; it provides us with an index to his degree of perceptual accuracy and ego control" (23). It is the principal test factor through which an individual shows his thinking from the highest centers and ability to consciously control his mental resources. In reviewing past literature this index is almost universally agreed upon as the best indicator of intelligence (4, 8, 19). Several studies (1, 11, 21) have derived significant correlations between F+% and intelligence
Percentage of Animal and Animal Detail Responses (A%) -- A response is scored as A if the content refers to any species other than man: mammals, birds, fish, invertebrates, and insects (7). Rorschach (19) described the A response as indicating mental inflexibility. A high A% is indicative of stereotypy of thought processes and negatively correlated to intelligence. This is borne out by Beck (6) who found a significant negative correlation between A% and intelligence.
Number and Percentage of Human and Human Detail Responses (H, H%). -- Beck states: "A scoring of H tells that the subject has associated with a human form. These responses may be most significant in what they tell us about the patient's percept of himself, and of persons of importance to him" (7. p. 217). A progressive increase in H may be considered as representing the striving for increased contact with one's fellow man. It reflects the increase in socialization which accompanies advance in age. Thetford et al. (24) state that the number of H responses varies directly with intelligence.
The Human Movement Response (M). -- The M response is scored when human movement, either overt or inferred, is observed, "The response, as Rorschach understands it, really reproduces movements or activities that the subject is carrying on within his mental life" (7. p. 72). "The movement score generally relates to an individual' s attitude and feelings about the inner reality of his experiences: his self concept, his tensions and conflicts surrounding the acceptance of his self, his fantasies and impulses" (13, p. 268). An M response is a function of psychic creativity which would seem to correlate with intelligence. Many studies (3, 21, 22, 24) have produced significant correlations, especially with verbal intelligence.
The Number of Content Categories (N). -- Rorschach (19) and Klopfer. and Kelly (13) believed that the variety of con- tent indicates the intelligence level in the subject. The number of content categories used is a function of experience as well as intelligence. Studies by Spaner (20) and Pauker (18) have shown N to correlate significantly with intelligence

I know, weird stuff. Nevertheless, if it works, it works. Almost all the studies reviewed by Giller are pretty unimpressive being both small and unrepresentative samples of various mentally ill people. So I looked for any modern replications, and found this one:

Wood, J. M., Krishnamurthy, R., & Archer, R. P. (2003). Three factors of the Comprehensive System for the Rorschach and their relationship to Wechsler IQ scores in an adolescent sample. Assessment, 10(3), 259-265.

Principal axis factor analyses of the Rorschach Comprehensive System (CS) in a clinical sample of 152 adolescents yielded three clearly defined factors: Synthesized Complexity (de-fined by Zf, DQ+, and F%), Productivity (defined by R, D, and Dd), and Form Quality (de-fined by X+%, F+%, and X-%). Variables on the Synthesized Complexity and Form Quality factors were generally correlated with Wechsler Full Scale IQ, Verbal IQ, and Performance IQ scores. Overall, the factors in this adolescent sample replicated factors identified in earlier studies with adults. Implications for clinical practice are discussed.

They found this:

The first 3 columns are the factor loadings. Of their Rorschach scores, 10 of them correlated at p < .05 with fullscale IQ, which is pretty impressive considering the sample size of 152. The correlations are not that small either, multiple being above .30. Recall that the typical correlation in social science is about .20, and that's including the publication bias inflation.

Anything better? Well, there is a formal meta-analysis from 2013 which studied some 700+ outcomes reported in studies. It finds a lot of validity for various things, and no publication bias in the funnel plot. The problem with the meta-analysis is that the results were all aggregated into a single pool, so we cannot see what their findings for IQ were. What a waste!

We systematically evaluated the peer-reviewed Rorschach validity literature for the 65 main variables in the popular Comprehensive System (CS). Across 53 meta-analyses examining variables against externally assessed criteria (e.g., observer ratings, psychiatric diagnosis), the mean validity was r = .27 (k = 770) as compared to r = .08 (k = 386) across 42 meta-analyses examining variables against introspectively assessed criteria (e.g., self-report). Using Hemphill's (2003) data-driven guidelines for interpreting the magnitude of assessment effect sizes with only externally assessed criteria, we found 13 variables had excellent support (r ≥ .33, p< .001; ∴ FSN > 50), 17 had good support (r ≥ .21, p < .05, FSN ≥ 10), 10 had modest support (p < .05 and either r ≥ .21, FSN < 10, or r = .15–.20, FSN ≥ 10), 13 had little (p < .05 and either r = < .15 or FSN < 10) or no support (p >.05), and 12 had no construct-relevant validity studies. The variables with the strongest support were largely those that assess cognitive and perceptual processes (e.g., Perceptual-Thinking Index, Synthesized Response); those with the least support tended to be very rare (e.g., Color Projection) or some of the more recently developed scales (e.g., Egocentricity Index, Isolation Index). Our findings are less positive, more nuanced, and more inclusive than those reported in the CS test manual. We discuss study limitations and the implications for research and clinical practice, including the importance of using different methods in order to improve our understanding of people. (Mihura 2013)

There is a comment from Wood 2014 who redid some of the results and found lower validities by using more unpublished work and focusing on the emotional outcomes. Not sure how important this is. The authors then replied to this basically saying that unpublished research they relied on has more methodological problems.

Wood 2003 (PDF) also provide a summary of the research about intelligence prediction:

Albert Binet considered including inkblots in his famous intelligence test (Zubin, Eron, & Schumer, 1965). Although he eventually abandoned the idea, his original intuition turned out to be correct. As research has shown, several Rorschach variables are correlated with intelligence test scores (for reviews, see Meyer, 1992; Wood, Krishnamurthy, & Archer, 2003). The highest correlations, which range from .30 to .40, have been found for Developmental Quality and Organizational Activity, scores that measure the degree to which responses synthesize diverse parts of a blot into a unified image. Lambda and the closely related F%, which reflect a tendency to give responses based on color and shading rather than form alone, also appear to correlate above .30 with intelligence test scores. Somewhat lower (.20 to .30) are the correlations for Form Quality, Human Movement responses, and R (the total number of responses given to the blots).
However, the best Rorschach indicator of intelligence is to be found not among these scores but in the vocabulary that the respondent uses to describe the blots (Davis, 1961; Hauser, 1963; Trier, 1958). For example, Thomas Trier of the University of California at Berkeley asked clinicians to read a group of Rorschach protocols and identify the seven most sophisticated words used by each respondent. Then, by consulting a commonly available word book, he estimated the average grade level of these words for each respondent. This simple Rorschach-based estimate of vocabulary level correlated .77 with intelligence test scores.
Although such results demonstrate that Rorschach responses can be used to estimate intelligence, modern standardized intelligence tests are definitely superior for the purpose (Davis, 1961). However, when intelligence testing is impossible, for example with an uncooperative child, inkblots may provide an acceptable substitute. The use of Rorschach-based vocabulary as an index of intelligence has been virtually ignored in the assessment literature since the 1950s, so that standardized procedures and norms are unavailable. With some scientific ground- work, however, the Rorschach might well be put on a solid footing as a rough intelligence measure, to be pulled out of the psychologist’s briefcase under pressing circumstances.

What seems to be missing from all the literature is an attempt at multiple regression or machine learning from the Rorschach data to predict stuff. Optimally, one should record everything the subjects say and feed that into some natural language processing to generate features, and then feed that into machine learning to generate predictions. Considering the findings above, I can think of two broad views one might adopt:

Rorschach tests are interesting and potentially undervalued. They measure many different psychological features at the same time and could be used to discover hidden features that are not possible using self-report data. If they can also provide a decent estimate of intelligence while being unintrusive that's an added bonus. The scoring of the tests can probably be substantially improved using machine learning approaches, and the predictions can be improved as well. Perhaps this could result in a much shorter administration time from the current 1-2 hours and not require extensive training of staff.
Rorschach tests are meh tier. While they do obtain some information from subjects, they do so in ineffective round-about ways where one could obtain the same information using standard approaches. The intelligence estimate one can get from them could be better obtained by a standard 20-30 minute test that doesn't require intensive training of staff. The various other constructs could be similarly obtained using standard assessment methods e.g. psychopathology DSM/DIIS questionnaires.

Which is more right? Maybe it's somewhere in the middle.

But isn't it totally unreliable?

If it can correlate with stuff, it cannot be totally unreliable because validity correlations cannot exceed reliability. Still, here's a meta-analysis of Rorschach reliability:

We estimated the average reliability, stability, and validity of the Minnesota Multiphasic Personality Inventory (MMPI), Rorschach Inkblot Test, and Wechsler Adult Intelligence Scale (WAIS) from articles published in the Journal of Personality Assessment and the Journal of Clinical Psychology between 1970 and 1981. Following standard psychometric theory, reliability values exceeded stability values, which exceeded validity values. Validity studies based on theory, prior research, or both showed greater effects than did studies lacking a theoretical or empirical rationale. In general, the reliability and stability of all three tests were acceptable and approximately equivalent. The convergent-validity estimates for the Rorschach and MMPI were not significantly different, but both these estimates were lower than the estimate for the WAIS. It appears that both the MMPI and Rorschach can be considered to have adequate psychometric properties if used for the purpose for which they were designed and validated.

Their results:

It seems difficult to complain about the results, but for Rorschach they are based only on a few studies.

Just Emil Kirkegaard Things

Discussion about this post

Ready for more?