AI versus (humans+AI) versus humans

In a large representative sample of humans compared to GPT-4: "the creative ideas produced by AI chatbots are rated more creative [by humans ]than those created by humans… Augmenting humans with AI improves human creativity, albeit not as much as ideas created by ChatGPT alone” pic.twitter.com/rJDUZxJ4iL
— Ethan Mollick (@emollick) February 2, 2025

A recent large-scale experiment rigorously compared human performance to that of AI—specifically ChatGPT-4 and Bard—in creative and strategic tasks. In open-ended creative challenges (e.g., describing future societies or inventing novel products), ChatGPT-4 consistently produced responses rated as significantly more creative than those generated by humans. Although humans using AI assistance (HumanPlusAI) improved their creative output relative to solo efforts, they still did not match ChatGPT-4’s performance. Notably, raters demonstrated “algorithm aversion,” penalizing responses they believed to be AI-generated—even though they often misidentified the source—while gender differences emerged, with female participants showing reduced creativity when directly competing against AI.

In a strategic setting using a 24-round Rock-Paper-Scissors game, both humans and ChatGPT-4 adapted their strategies against a biased opponent; however, humans outperformed ChatGPT-4 by capitalizing more effectively on dominant moves. The study underscores that while generative AI (particularly ChatGPT-4) can surpass average human creativity and adapt in real-time strategic scenarios, its benefits in human–AI collaboration remain nuanced. Moreover, the experimental design—although robust—leaves open questions about the generalization of these findings to more complex creative processes, broader strategic decision-making, ethical implications, and the long-term impact of integrating AI into professional and creative domains.

AI Generated Summary of the Paper

Researchers conducted a large-scale experiment (over 4,000 participants) to compare human and AI (ChatGPT-4 and Bard) performance in (1) creative tasks and (2) strategic tasks. The creative tasks involved generating open-ended, “divergent” responses (e.g., “Describe a future city or society”). The strategic task was a 24-round Rock-Paper-Scissors game where participants (or AI) had to adapt to an opponent’s bias.

Main Findings

AI Creativity vs. Human Creativity
- ChatGPT-4 consistently produced the highest-rated creative responses—significantly higher than the average human.
- Bard lagged behind both humans and ChatGPT-4 in creativity ratings.
- Humans who did use AI tools (the “HumanPlusAI” condition) scored higher than humans without AI access but still below ChatGPT-4 alone. The paper posits that humans may not always prompt AI to its full creative potential.
Effects of Competition and Augmentation
- Merely telling humans they were competing against AI (“HumanAgainstAI”) did not reduce overall creativity on average; however, it did have a negative effect on female participants, who scored lower when competing with AI.
- When humans augmented their work with AI (“HumanPlusAI”), it improved their average creativity but did not match ChatGPT-4’s pure output.
Judging AI Outputs
- Raters often could not reliably tell which entries were AI-generated vs. human-generated.
- When they thought a text was AI-generated, they tended to rate it lower (a form of “algorithm aversion”), yet ChatGPT-4’s submissions still came out on top, partly because many raters guessed incorrectly.
Strategic Skills in Rock-Paper-Scissors
- ChatGPT-4 did adapt its strategy to an off-equilibrium (biased) opponent, indicating it can learn during multi-round interactions.
- Humans performed better in the “biased-opponent” scenario because they more readily exploited a simple dominant move (Paper). ChatGPT-4 chose a more balanced strategy than strictly necessary, so it scored fewer points against a predictable opponent.

Most Plausible Parts

ChatGPT-4’s superior average creativity aligns with a growing body of evidence in other tasks (e.g., brainstorming, short-story writing).
Bard’s relatively lower performance is plausible, given how different Large Language Models can vary in training data and tuning.
Algorithm aversion is well-documented: people often downgrade AI-generated content even if they cannot correctly identify it.

What Is Proven

The experiments demonstrate, in the specific tasks tested, that ChatGPT-4 outputs were rated more creative than average human outputs. This is robust: the study used multiple rater groups, including research assistants and a large online pool.
Humans given access to AI produced more creative answers than those without access, but not more creative than ChatGPT-4 alone.
In the Rock-Paper-Scissors scenario, ChatGPT-4 does adapt from round to round, but humans still earned more points against a biased opponent.

Most Important Point

The single most important takeaway is that ChatGPT-4 clearly outperforms average humans in generating creative ideas under the study’s conditions. However, human–AI collaboration does not necessarily yield creativity superior to the AI alone. In parallel, humans still hold an advantage in certain strategic or off-equilibrium scenarios where fully exploiting a predictable opponent can outperform the AI’s more balanced approach.

What Can Be Learned

AI’s Creative Superiority in Specific Tasks
- Demonstrated Performance:
  ChatGPT-4’s outputs are rated as more creative than those produced by average humans. This finding supports the idea that, when properly prompted, generative AI can excel at open-ended creative tasks.
- Effective Prompting Matters:
  Humans using AI (the “HumanPlusAI” condition) improve their creative output relative to working alone, but they still underperform compared to the AI on its own. This suggests that the way prompts are structured or the guidance provided can crucially influence the final output.
Human Perception and Bias
- Algorithm Aversion:
  Raters tend to give lower scores to texts they believe are AI-generated, even though they often cannot reliably distinguish between human and AI work. This indicates a bias against AI that might affect its adoption in creative and professional settings.
- Gender Dynamics:
  The study finds that competition with AI appears to lower creativity ratings for female participants in certain conditions. This points to underlying differences in how social context or competition influences creative performance across genders.
Strategic Adaptation in Simple Games
- Adaptive Learning:
  In the Rock-Paper-Scissors game, both humans and ChatGPT-4 adjusted their strategies when facing a biased (non-equilibrium) opponent. However, humans exploited the predictable pattern (e.g., by favoring a dominant move) more effectively, earning more points.
- Limits of AI in Strategy:
  While ChatGPT-4 can learn and adapt during interaction, its balanced approach in the game may prevent it from fully capitalizing on the opponent’s biases. This highlights that even advanced AIs may have limits in real-time strategic exploitation compared to human intuition in some settings.
Implications for Human–AI Collaboration
- Synergy Isn’t Always Additive:
  The research suggests that simply combining human and AI efforts does not automatically result in a “super creative” outcome. There is a nuanced interplay where the human’s guidance or input might inadvertently constrain the AI’s raw creative potential.

What Points the Content may be Ignoring

Complexity and Generalization of Creative Tasks
- Real-World Creativity:
  The tasks used (e.g., describing a future society or inventing something new) are relatively constrained. The study does not address whether the findings would hold in more complex or multi-dimensional creative endeavors—such as sustained creative processes in industries like advertising, design, or scientific research.
- Long-Term Creative Processes:
  It remains unclear how iterative or prolonged human–AI collaboration would evolve over time. The paper focuses on one-off tasks rather than ongoing, dynamic creative processes.
Broader Context of Strategic Decision-Making
- Simplistic Game Model:
  The strategic task is limited to a 24-round Rock-Paper-Scissors game. This simple game might not capture the complexity of strategic reasoning needed in real-world decision-making contexts (e.g., business strategy, negotiations, or military planning).
- Depth of Strategic Learning:
  Although the study shows ChatGPT-4 adapts to a biased opponent, it does not explore whether similar adaptation occurs in more complex, uncertain, or multi-agent strategic environments.
Ethical, Societal, and Practical Considerations
- Ethical Implications:
  The paper briefly touches on algorithm aversion but does not dare I say – delve – into the broader ethical or societal consequences of AI-generated creativity. For example, how might reliance on AI affect employment in creative industries or alter cultural perceptions of originality?
- Long-Term Societal Impact:
  The potential long-term effects of integrating AI into creative and strategic processes are not discussed. This includes the impact on skills development, shifts in labor market dynamics, or potential exacerbation of social inequalities.
- User Experience and Interface Design:
  The study does not examine how differences in user interfaces or the user experience with different AI tools (e.g., ChatGPT vs. Bard) might affect outcomes. Interface design can influence both performance and the perception of AI outputs.
Diversity of AI Models and Domains
- Model Specificity:
  While the paper compares ChatGPT-4 and Bard, it doesn’t explore a broader range of AI systems. There may be valuable insights to be gained by examining how different architectures, training datasets, or tuning parameters influence creative and strategic performance.
- Domain-Specific Applications:
  The research is set in a controlled experimental environment. How these findings translate into industry-specific applications (such as legal, medical, or artistic domains) is not addressed.

Direct link to the article: https://t.co/YzsSyAhslg

AI versus (humans+AI) versus humans

Main Findings

Most Plausible Parts

What Is Proven

Most Important Point

What Can Be Learned

What Points the Content may be Ignoring

Untitled image

GPT-Researcher: How is AI being integrated into Canadian legal firms?

Building Semantic XOR Ensembles: Logging, Bias-Proof Judging, and Iterative Model Ratings for Multi-LLM Systems

Medical prompt Starting point for complex issues

ComfyUI on AMD GPU with DirectML

Genspark.ai report “Recent Discoveries in Egyptology”

Leave a Reply Cancel reply

Main Findings

Most Plausible Parts

What Is Proven

Most Important Point

What Can Be Learned

What Points the Content may be Ignoring

Similar Posts

Leave a Reply Cancel reply