How to make psychology items (questions) with ChatGPT4

State of the art September 2023

Sep 06, 2023

In the comments on my last post on measuring intelligence in online surveys, PaleoPolak asked:

> It is fairly easy to make new vocabulary items for most languages using the AI chatbots.
Can you elaborate on this? I'd be interested in trying to make one for my language (Polish).

Sure. I will be giving examples using OpenAI's ChatGPT4, but you can do it with other chatbots as well.

The simplest approach is just to ask it to make a new item similar to already existing items that the AI is familiar with. For instance:

Make me 5 vocabulary test questions similar to those on the SAT/GRE
ChatGPT
Certainly! Here are five vocabulary test questions styled similarly to those on the SAT/GRE:
1. The politician's comments were so equivocal that voters were unsure of his stance on the issue.
Which of the following words is closest in meaning to "equivocal"?
A) Mournful B) Transparent C) Ambiguous D) Spirited E) Conclusive
Answer: C) Ambiguous
2. The scientist’s research was deemed spurious, leading to a retraction of her paper from the journal.
Which of the following words is most nearly opposite in meaning to "spurious"?
A) Genuine B) False C) Cunning D) Dubious E) Erroneous
Answer: A) Genuine
3. The villagers were taken aback by the sudden tempest that ravaged their crops.
Which of the following words is closest in meaning to "tempest"?
A) Drought B) Storm C) Peace D) Sunlight E) Shelter
Answer: B) Storm
4. As an ardent supporter of the cause, she spent countless hours volunteering and advocating for change.
Which of the following words is closest in meaning to "ardent"?
A) Indifferent B) Tepid C) Enthusiastic D) Hesitant E) Detached
Answer: C) Enthusiastic
5. Despite his youthful appearance, he possessed the sagacity of someone twice his age.
Which of the following words is closest in meaning to "sagacity"?
A) Foolishness B) Ignorance C) Wisdom D) Immaturity E) Vanity
Answer: C) Wisdom
These questions are designed to test both direct vocabulary knowledge and the ability to discern meaning from context.

In this case, the format is a pick 1 of 5, and the word appears with context. The context makes it easier to infer the meaning, lowering the difficulty a bit. You can ask it for other formats too if that's what you want. For instance, the intelligence tests on openpsychometrics.org often use a pick X out of Y approach. Each item then functions as a kind of multiple item, giving more information per time spent. For instance:

Make me 3 vocabulary test questions using a format where subjects are shown 10 words, and must choose the 5 that have similar meanings.
ChatGPT
Certainly! Here are three vocabulary test questions in the format you've requested:
1. Identify the 5 words below that are most closely related to "talkative" or "wordy":
A) Laconic B) Garrulous C) Verbose D) Taciturn E) Reticent F) Loquacious G) Succinct H) Chatty I) Brevity J) Effusive
Answer: B) Garrulous, C) Verbose, F) Loquacious, H) Chatty, J) Effusive
2. Identify the 5 words below that are most closely related to "clean" or "pure":
A) Tarnished B) Pristine C) Immaculate D) Contaminated E) Sterile F) Unblemished G) Sullied H) Virginal I) Polluted J) Sanitary
Answer: B) Pristine, C) Immaculate, E) Sterile, F) Unblemished, J) Sanitary
3. Identify the 5 words below that are most closely related to "brief" or "short-lived":
A) Ephemeral B) Interminable C) Transitory D) Protracted E) Fleeting F) Momentary G) Prolonged H) Elongated I) Evanescent J) Endless
Answer: A) Ephemeral, C) Transitory, E) Fleeting, F) Momentary, I) Evanescent
This format can be especially challenging as it requires discerning nuanced differences in meaning among words that may, at first glance, seem quite similar.

With a pick 1 of X format, the subject has 1/X chance of guessing it at random. This means the pass rate (difficulty) of an test cannot normally be below unless subjects are somehow attracted to wrong options (called distractors). But with a pick-5-of-10 format, it is quite unlikely that one will get it totally right by chance, about 1 in 252 (in R, type choose(10, 5)).

How about another language? Since the bot speaks most languages, presumably it can also generate vocabulary questions in them. Let's try:

Make me 5 vocabulary test questions similar to those on the SAT/GRE. But do this in Danish.
ChatGPT
Of course! Here are five vocabulary test questions in Danish, styled similarly to those on the SAT/GRE:
1. Hans holdning var så tvetydig at mange var i tvivl om, hvad han faktisk mente.
Hvilket af følgende ord er nærmest i betydning til "tvetydig"?
A) Klar B) Ubeslutsom C) Forvirrende D) Voldsom E) Dobbeltsidig
Svar: C) Forvirrende
2. Efter mange års studier blev hans teori endelig bekræftet.
Hvilket af følgende ord er mest modsat i betydning til "bekræftet"?
A) Afvist B) Modbevist C) Anerkendt D) Bemærket E) Godkendt
Svar: A) Afvist
3. Hans udholdenhed imponerede alle på holdet.
Hvilket af følgende ord er nærmest i betydning til "udholdenhed"?
A) Lethed B) Træthed C) Styrke D) Mod E) Energi
Svar: C) Styrke
4. Den nye lov blev kritiseret for at være overfladisk.
Hvilket af følgende ord er mest modsat i betydning til "overfladisk"?
A) Dybdegående B) Kort C) Minimalistisk D) Overdreven E) Flygtig
Svar: A) Dybdegående
5. Hun blev beskyldt for at være partisk, da hun altid tog sin brors parti.
Hvilket af følgende ord er nærmest i betydning til "partisk"?
A) Neutral B) Forudindtaget C) Ubeslutsom D) Uafhængig E) Ligeglad
Svar: B) Forudindtaget
I hope these work for your needs!

If you paid attention above, you see that the AI actually was a little sneaky. Some of the correct options were not synonyms, but antonyms! You can tell it to avoid that:

In your second English question above, you actually made it the form of an antonym instead of a synonym. Can you redo it but using synonyms?
ChatGPT
Absolutely, I apologize for the oversight. Here's a revised version of the second question, focusing on synonyms:
2. Efter mange års studier blev hans teori endelig bekræftet.
Hvilket af følgende ord er nærmest i betydning til "bekræftet"?
A) Godkendt B) Accepteret C) Valideret D) Overvejet E) Overvåget
Svar: C) Valideret
Thank you for pointing it out.
---
You did it right, but for the Danish question. I asked for the second English question, the one about "spurious".
ChatGPT
My apologies for the oversight. Let's correct that:
2. The scientist’s research was deemed spurious, leading to a retraction of her paper from the journal.
Which of the following words is closest in meaning to "spurious"?
A) Fraudulent B) Authentic C) Cunning D) Dubious E) Erroneous
Answer: A) Fraudulent
Thank you for pointing that out, and I appreciate your patience.

In terms of quality of the questions, some of them seem fairly dubious to me because there are too many words with close meanings. For instance, for spurious there's fraudulent, dubious, and erroneous. I would have picked spurious, but looking at the thesaurus, it seems fraudulent is a bit closer in the meaning of "counterfeit, fake". I think in science writing, the meaning would be closer to false positive, fluke, or erroneous.

Let's try Polish:

Thanks. Now do one in Polish.
ChatGPT
Certainly! Here's a vocabulary test question in Polish, styled similarly to those on the SAT/GRE:
1. Jego odpowiedź była niejednoznaczna, co spowodowało wiele pytań.
Które z poniższych słów jest najbliższe w znaczeniu do słowa "niejednoznaczna"?
A) Jasna B) Złożona C) Bezsensowna D) Jednoznaczna E) Wieloznaczna
Odpowiedź: E) Wieloznaczna
I hope this meets your needs!

I don't know if this works, but let me know in the comments. You can also post your own results.

Knowledge tests

Moving beyond vocabulary, let's try general knowledge questions:

Now for something else. Can you make me 5 general knowledge questions using the pick 1 of 5 format. They should concern things that the average adult in Western countries might know.
ChatGPT
Certainly! Here are five general knowledge questions based on the "pick 1 of 5" format:
1. Who wrote the play "Romeo and Juliet"?
A) Charles Dickens B) Geoffrey Chaucer C) Jane Austen D) William Shakespeare E) F. Scott Fitzgerald
Answer: D) William Shakespeare
2. Which planet in our solar system is known as the "Red Planet"?
A) Venus B) Jupiter C) Mars D) Neptune E) Mercury
Answer: C) Mars
3. In which year did World War II end?
A) 1939 B) 1941 C) 1945 D) 1950 E) 1955
Answer: C) 1945
4. What is the primary ingredient in guacamole?
A) Potato B) Tomato C) Carrot D) Avocado E) Lettuce
Answer: D) Avocado
5. Which animal is known for its distinctive black and white stripes?
A) Cheetah B) Giraffe C) Zebra D) Tiger E) Leopard
Answer: C) Zebra
I hope these questions suit your needs!

These seem fine, though Romeo and Juliet is Anglo-loaded. They are also too easy perhaps, let's improve that:

These are good. Can you make me 5 more but make them much harder.
ChatGPT
Of course! Here are five more challenging general knowledge questions based on the "pick 1 of 5" format:
1. Which philosopher is associated with the allegory of the cave?
A) Søren Kierkegaard B) Friedrich Nietzsche C) John Locke D) Immanuel Kant E) Plato
Answer: E) Plato
2. The "Hundred Years' War" was fought between which two countries?
A) France and Germany B) Spain and Portugal C) England and France D) The Ottoman Empire and Byzantine Empire E) Russia and Sweden
Answer: C) England and France
3. Which element has the chemical symbol 'Ag'?
A) Argon B) Gold C) Aluminum D) Silver E) Arsenic
Answer: D) Silver
4. In which literary work would you find the land of "Lilliput"?
A) Gulliver's Travels B) The Odyssey C) The Chronicles of Narnia D) Moby-Dick E) The Lord of the Rings
Answer: A) Gulliver's Travels
5. Which mathematical concept is defined as a number's distance from zero on the number line?
A) Median B) Absolute value C) Variance D) Slope E) Coefficient
Answer: B) Absolute value
I hope these questions provide the level of challenge you're looking for!

These are better. I didn't know (4), and (3) was difficult (A or C?).

What about science?

These were good. Can you make me 5 more hard questions, but only concerning science.
ChatGPT
Of course! Here are five challenging science-based questions in the "pick 1 of 5" format:
1. Which of the following is NOT a fundamental force of nature?
A) Electromagnetic B) Gravitational C) Nuclear strong D) Nuclear weak E) Centrifugal
Answer: E) Centrifugal
2. What is the primary function of ribosomes in a cell?
A) Energy production B) DNA replication C) Waste removal D) Protein synthesis E) Cell division
Answer: D) Protein synthesis
3. Which gas makes up the majority of Earth's atmosphere?
A) Oxygen B) Hydrogen C) Carbon Dioxide D) Argon E) Nitrogen
Answer: E) Nitrogen
4. In quantum mechanics, the Heisenberg uncertainty principle primarily deals with the impossibility of precisely determining which two paired properties simultaneously?
A) Position and speed B) Energy and temperature C) Time and momentum D) Charge and mass E) Position and momentum
Answer: E) Position and momentum
5. Which of the following is not a type of white blood cell?
A) Lymphocyte B) Neutrophil C) Erythrocyte D) Monocyte E) Basophil
Answer: C) Erythrocyte (This is a red blood cell.)
I hope these questions meet the level of difficulty you're seeking!

These are difficult, especially for a layperson.

Nonverbal tests?

Since ChatGPT4 cannot generate images, it is hard to generate figure matrices unless these are presented in some kind of code. But first, let's try number series:

Now, can you make me 5 number series questions. They should be in the format where a series of numbers are shown, and one has to fill in the last number.
ChatGPT
Of course! Here are five number series questions where the respondent is tasked with deducing the pattern and identifying the next number in the series:
1. 2, 4, 8, 16, 32, ___
Answer: 64 (Each number is multiplied by 2.)
2. 5, 10, 20, 25, 50, 55, ___
Answer: 110 (The pattern alternates between multiplying by 2 and adding 5.)
3. 1, 1, 2, 3, 5, 8, 13, ___
Answer: 21 (Each number is the sum of the two preceding ones, known as the Fibonacci sequence.)
4. 6, 3, 6, 12, 6, 24, 6, ___
Answer: 48 (The pattern alternates: the number is either 6 or it's doubled from the previous non-6 number.)
5. 11, 22, 33, 44, 55, ___
Answer: 66 (Each number adds 11 to the previous one.)
These questions can challenge one's pattern recognition and logical reasoning skills, and they're commonly found in various intelligence or aptitude tests.

These seem good too. Let's try the matrices in ASCII format:

Now, a more difficult task for you. Can you make me a Raven's matrix type test, but using ASCII? Just one question.
ChatGPT
Certainly, a Raven's Progressive Matrix test typically involves pattern recognition and completion in a 3x3 grid. In ASCII, this will be quite basic due to the limitations, but I'll give it a shot.
Here's a simple Raven's matrix type problem using ASCII:
mathematica
Row 1: |O| |X| |O,X|
Row 2: |X,O| |O| |X|
Row 3: |O,X| |X,O| |___|
What should replace the blank (|___|) to complete the pattern?
A) X
B) O
C) O,X
D) X,O
Answer: B) O
Explanation: The pattern, when reading from left to right, alternates between "O" and "X", with the middle column reversing the order seen in the first column for rows 2 and 3. Based on this, the missing entry should be "O".
Note: The representation here is quite limited due to ASCII's constraints, but I hope it gives you an idea!
---
That doesn't seem to make sense. Are you sure?
ChatGPT
I apologize for the oversight; you're right. Let me attempt to create another ASCII-based Raven's Matrix question for you:
mathematica
Row 1: |-| |O| |X|
Row 2: |O| |X| |-|
Row 3: |X| |-| |___|
What should replace the blank (|___|) to complete the pattern?
A) X
B) O
C) -
D) O,X
Answer: B) O
Explanation: The pattern involves rotating the symbols clockwise in each subsequent square. Based on this rotation, the missing entry in the last box should be "O".
Thank you for pointing out the error, and I appreciate your patience.

Neither of these make sense to me, so it seems this is beyond it's ability for now.

Can it do intuitive trick questions?

That was still bad. Nevermind. Let's try something else. There are some questions designed to have an intuitive but wrong answer, called the cognitive reflection test. Can you make me 3 questions of that type?
ChatGPT
Certainly! The Cognitive Reflection Test (CRT) often contains problems that have an initial, intuitive (but incorrect) response, but upon further reflection have a different, correct answer. Here are three questions in the spirit of the CRT:
1. A ball and a bat together cost $1.10. The bat costs $1 more than the ball. How much does the ball cost?
Intuitive answer: $0.10
Correct answer: $0.05 (Because the bat then costs $1.05, totaling $1.10)
2. If 5 machines can produce 5 gadgets in 5 minutes, how long would it take 100 machines to produce 100 gadgets?
Intuitive answer: 100 minutes
Correct answer: 5 minutes (Each machine produces a gadget in 5 minutes, so the number of machines and gadgets scales proportionally.)
3. In a lake, there's a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half the lake?
Intuitive answer: 24 days
Correct answer: 47 days (Because it doubles in size daily, so it would be half the size the day before it covers the lake.)
I hope these questions are what you were looking for!
---
No, you just gave me the CRT questions. Can you make me 3 new questions, not seen before anywhere?
ChatGPT
I apologize for the oversight. I'll create three new questions in the spirit of the CRT:
1. A pair of shoes and a cap together cost $60. The shoes are $50 more expensive than the cap. How much does the cap cost?
Intuitive answer: $10
Correct answer: $5 (Because the shoes then cost $55, totaling $60)
2. A bookstore has a buy-2-get-1-free deal on books. If Tom bought 6 books, how many books did he pay for?
Intuitive answer: 4 books
Correct answer: He paid for every 2 out of every 3 books, so he paid for 4 books. (The intuitive answer is actually correct in this case, which can trip up someone expecting a twist.)
3. In a pond, there's a cluster of algae. Every day, the cluster triples in size. If it takes 20 days for the cluster to cover the entire pond, how long would it take for the cluster to cover one-ninth of the pond?
Intuitive answer: 18 days
Correct answer: 18 days (Because it triples daily, so it would be a third of the size two days before covering the pond and one-ninth three days before. This one is tricky because the intuitive answer is, in fact, the correct one.)
Creating novel CRT-style questions is challenging, and while these questions are designed in the spirit of the CRT, they may not match its difficulty or nuanced trickery.

Not very well.

Personality?

Let's try neuroticism:

These were good. Can you make me 5 questions that measure neuroticism?
ChatGPT
Certainly! Neuroticism is one of the "Big Five" personality traits and generally refers to emotional instability or vulnerability to psychological distress. People with high neuroticism might experience emotions like anxiety, anger, envy, and sadness more intensely than those with lower neuroticism.
Here are five questions that can help gauge neuroticism:
1. How often do you find yourself worrying about things that most other people don’t seem worried about?
A) Very rarely or never B) Occasionally C) About half the time D) Frequently E) Almost always
2. When a stressful situation arises, how often do you find it difficult to calm down?
A) Very rarely or never B) Occasionally C) About half the time D) Frequently E) Almost always
3. How often do you find yourself feeling insecure or inferior to those around you?
A) Very rarely or never B) Occasionally C) About half the time D) Frequently E) Almost always
4. How frequently do you feel anxious or fearful, even if there's no immediate reason to feel that way?
A) Very rarely or never B) Occasionally C) About half the time D) Frequently E) Almost always
5. How often do you dwell on or overthink past mistakes or embarrassing moments?
A) Very rarely or never B) Occasionally C) About half the time D) Frequently E) Almost always
For each of these questions, scoring might go as follows:
A) 1 point B) 2 points C) 3 points D) 4 points E) 5 points
A higher total score would indicate a higher level of neuroticism. It's important to note that these questions are for general reference and cannot replace a comprehensive psychological assessment. If using them for any kind of research or evaluation, make sure to follow ethical guidelines and get expert advice on their validity and reliability.

Seems fine.

What about mental health more generally:

These seem good. Can you make me a 10 question test for measuring overall mental health?
ChatGPT
Certainly. However, it's crucial to emphasize that self-report questionnaires or tests can only provide an overview or indication of mental well-being. For a thorough and accurate assessment, a professional evaluation by a mental health professional is essential. Any test or questionnaire, including the one below, should be used with caution and not as a definitive diagnostic tool.
Here's a 10-question general mental health questionnaire:
1. Over the past two weeks, how often have you felt down, depressed, or hopeless?
A) Not at all B) Several days C) Over half the days D) Nearly every day
2. How often do you feel little interest or pleasure in doing things you usually enjoy?
A) Not at all B) Several days C) Over half the days D) Nearly every day
3. How often do you feel nervous, anxious, or on edge?
A) Not at all B) Several days C) Over half the days D) Nearly every day
4. How frequently do you find yourself worrying about different things?
A) Not at all B) Several days C) Over half the days D) Nearly every day
5. Over the past two weeks, how often have you felt you had trouble concentrating on things, like reading or watching television?
A) Not at all B) Several days C) Over half the days D) Nearly every day
6. How often do you experience disturbances in your sleep, such as trouble falling asleep, waking up in the middle of the night, or sleeping too much?
A) Not at all B) Several days C) Over half the days D) Nearly every day
7. Over the past two weeks, how often have you felt that things were too much for you to handle?
A) Not at all B) Several days C) Over half the days D) Nearly every day
8. How often do you feel isolated or distant from others, even when you're with them?
A) Not at all B) Several days C) Over half the days D) Nearly every day
9. Over the past two weeks, how often have you had thoughts of harming yourself in any way?
A) Not at all B) Several days C) Over half the days D) Nearly every day
10. How often do you feel overwhelmed by your emotions, finding it hard to cope with them?
A) Not at all B) Several days C) Over half the days D) Nearly every day
Scoring:
A) 0 points B) 1 point C) 2 points D) 3 points
A higher total score indicates a higher likelihood of potential mental health concerns.
Note: If someone indicates frequent thoughts of self-harm (as in question 9) or other severe symptoms, they should seek professional help immediately. This test is in no way a substitute for proper clinical evaluation and diagnosis.

Bing

ChatGPT4 is just one chatbot. A popular alternative is Microsoft's Bing, which is free when you install Skype. So let's try it out:

Make me 5 vocabulary test questions similar to those on the SAT/GRE
Bing, 2:19 AM
Here are five vocabulary test questions similar to those on the SAT/GRE. For each question, choose the word or phrase that best completes the sentence.
1.
The _______ of the novel was so convoluted that many readers gave up after the first chapter.
A) plot
B) theme
C) style
D) tone
2.
The teacher praised the student for his _______ answer, which demonstrated his deep understanding of the topic.
A) succinct
B) verbose
C) ambiguous
D) erroneous
3.
The politician's speech was full of _______ statements that contradicted his previous positions on various issues.
A) candid
B) eloquent
C) fallacious
D) pragmatic
4.
The scientist was _______ by her peers for her groundbreaking research on climate change.
A) acclaimed
B) denounced
C) ignored
D) ridiculed
5.
The ancient temple was _______ with elaborate carvings and paintings that depicted the history and mythology of the civilization.
A) adorned
B) defaced
C) restored
D) vandalized
2:21 AM
Can you also give me the right answers.
Bing, 2:21 AM
Sure, I can give you the right answers. The correct choices for the five vocabulary test questions are:
1. A) plot
2. A) succinct
3. C) fallacious
4. A) acclaimed
5. A) adorned
I hope you enjoyed the questions and learned some new words. blush

So these aren't actually synonyms, but rather filling in an appropriate missing word. Still, they are pretty reasonable.

But can it do matrix questions? I will spare you the gibberish it gave me. I also tried getting it to convert it to latex format, but the document couldn't even compile.

Conclusions

Current AI Chatbots can make multiple item types with ease: vocabulary (multiple languages), knowledge/information (general or specific), number series, and probably others. It seems to fail entirely at figure-like formats. The fact that one has to sometimes go back and forth with it means that it cannot yet completely replace human intelligence in this area, but it can certainly speed things up. You could probably make questions for a language you don't speak, but it's hard to be sure whether they are good or not without a native speaker or running expensive testing.

Emmett Flynn

Sep 6, 2023

I just wanted to point out a correction for one of the problems as well as some patterns which could justify a clear answer in one of the "ASCII matrices." For number series #4, the multiplier between 3 and 12 is not 2 as claimed but rather 4. Making that 3 into 6, while a bit confusing given the alternating 6s, would match the pattern provided by ChatGPT.

As for the first "ASCII matrix," you could view the 3x3 grid as a portion of a larger lattice of negative slope diagonals consisting of the same symbols/expressions in each diagonal. Since the main diagonal already consists of O, the bottom right would have to be O. Otherwise, you could pattern match to the completed rows and columns. Row 1 ( O -> X -> O, X) could be used to deduce that O should go in the bottom right by filling in column 3. Likewise, column 1 (O -> X, O -> O, X) could be used to deduce that O should go in the bottom right by filling in row 3. These answers just so happen to agree with ChatGPT's solution, but I too was perplexed by its method of arriving at this solution.

Obviously someone could concoct a convoluted enough pattern (or none at all) which would only differentiate itself with access to a larger window of the lattice but matching on this particular subset (i.e., an extension in mathematics), so it's not possible to conclusively prove that it must be the solution I came up with. However, I would like to think that the existence of multiple patterns converging on the same solution is reasonable evidence.

Expand full comment

4 replies

Peter Kriens

Sep 8, 2023

I tried to create a maven matrix using the data analyst/code interpreter.

ChatGPT first had nonsense results, using the same symbol for each cell. However, it often works best to decompose the problem so I first tried to get her to describe the solution to a raven matrix and then make an image of it. This did give a reasonable result. I think.

https://github.com/pkriens/foobar/issues/1

6 more comments...

Just Emil Kirkegaard Things

Discussion about this post