LLMs Struggle at Writing Quality: Analysis of Self-Evaluation Failures and Training Gaps in 2026

LLMs Struggle at Writing Quality: Analysis of Self-Evaluation Failures and Training Gaps in 2026 | AI News Detail | Blockchain.News

Latest Update

3/22/2026 8:35:00 PM

According to Ethan Mollick on Twitter, large language models lag in writing because they lack an objective judge and exhibit poor subjective self-judgment, limiting self-improvement. As reported by Christoph Heilig’s blog, experiments show GPT‑5.x can be steered by pseudo‑literature prompts to overrate weak prose, revealing evaluation misalignment and vulnerability to style hacks (source: Christoph Heilig). According to Heilig, these failures undermine reward-model reliability and RLHF pipelines that depend on model or human preferences for literary quality, constraining progress in long-form generation. For businesses building AI writing tools, the cited evidence implies opportunities in external objective metrics, multi-rater human annotation markets, and retrieval-augmented critique systems to stabilize quality judgments and reduce reward hacking (source: Christoph Heilig).

Source

Analysis

The Challenges of Large Language Models in Creative Writing: Insights from Ethan Mollick's Analysis

In a thought-provoking tweet on March 22, 2026, Wharton professor and AI expert Ethan Mollick highlighted a critical limitation in large language models or LLMs when it comes to generating high-quality writing. According to Mollick, LLMs are lagging significantly in writing tasks due to the absence of an objective judge and their own flawed subjective judgments, which prevent effective self-improvement. He describes good writing as 'bitter lesson proof,' referencing the Bitter Lesson concept introduced by AI researcher Rich Sutton in 2019, which emphasizes that scalable computation and learning from data outperform methods relying on human-crafted knowledge. Mollick's observation stems from a blog post by Christoph Heilig, who experimented with manipulating GPT models using pseudo-literature prompts, revealing how LLMs struggle with nuanced literary creation. This discussion comes amid rapid AI advancements, where models like GPT-4, released in March 2023 by OpenAI, have transformed content generation but still fall short in creative domains requiring deep subjectivity. For businesses, this underscores the ongoing gap between AI hype and practical utility in industries like publishing and marketing, where human-like creativity remains elusive. As of 2024 data from Statista, the global AI market in content creation is projected to reach $1.3 billion by 2030, yet limitations in writing quality could hinder adoption. Mollick's point highlights why scaling alone, as per the Bitter Lesson, hasn't yet conquered writing, a field demanding taste, originality, and iterative refinement beyond mere data patterns.

Delving into the business implications, LLMs' writing deficiencies create both challenges and opportunities for companies in the content and media sectors. For instance, marketing firms using AI tools like Jasper or Copy.ai, both prominent since their launches around 2021, often encounter outputs that lack emotional depth or cultural nuance, leading to higher revision costs. According to a 2023 report from McKinsey, businesses implementing AI for content generation see productivity gains of up to 40 percent, but quality issues result in only 25 percent satisfaction in creative tasks. This lag stems from the lack of objective metrics for 'good' writing; unlike chess or image recognition, where win rates or accuracy scores provide clear feedback, writing relies on subjective human evaluation. Mollick's reference to poor self-judgment in AIs points to a technical hurdle: reinforcement learning from human feedback or RLHF, pioneered in models like InstructGPT in 2022, depends on inconsistent human raters, limiting scalable improvement. For monetization, companies can capitalize by developing hybrid systems that pair LLMs with human editors, creating services like AI-assisted novel writing platforms. In the competitive landscape, players such as OpenAI and Anthropic, with Claude models updated in 2024, are investing in better judgment mechanisms, but as per a 2025 Gartner analysis, full autonomy in creative writing may not arrive until 2030 due to these barriers. Regulatory considerations include ensuring AI-generated content doesn't mislead consumers, with EU AI Act guidelines from 2024 mandating transparency in automated writing tools.

From a technical standpoint, the 'bitter lesson proof' nature of writing means that simply amassing more data and compute, as seen in the training of PaLM with 540 billion parameters in 2022, isn't sufficient for mastery. Heilig's experiments, detailed in his 2026 post, showed that prompting LLMs with fabricated literary styles yields inconsistent results, exposing biases in training data. Ethical implications arise here, as over-reliance on AI could diminish human creativity, prompting best practices like using LLMs for ideation rather than final output. Implementation challenges include integrating objective judges, such as crowd-sourced feedback platforms, but solutions like advanced RLHF variants are emerging, with research from DeepMind in 2024 demonstrating 15 percent improvement in subjective task performance. Market trends indicate a shift toward specialized AI for niches like technical writing, where objectivity is higher, per a 2025 Forrester report forecasting 20 percent annual growth in AI content tools.

Looking ahead, the future of LLMs in writing holds promise if addressed through innovative approaches like multi-agent systems for self-critique, as explored in a 2025 paper from Stanford University. This could unlock business opportunities in personalized education, where AI tutors provide writing feedback, potentially disrupting the $250 billion e-learning market by 2030 according to HolonIQ data from 2024. Industry impacts are profound in publishing, where AI might handle drafts but humans retain curation, fostering hybrid jobs. Predictions suggest that by 2028, advancements in neuromorphic computing could enable better subjective modeling, per IEEE forecasts from 2023. For practical applications, businesses should focus on ethical AI integration, training models on diverse datasets to mitigate biases, and complying with evolving regulations like those from the U.S. Federal Trade Commission in 2024. Ultimately, overcoming these hurdles could democratize high-quality writing, boosting accessibility in global markets while challenging traditional gatekeepers in literature and journalism.

FAQ: What are the main reasons LLMs struggle with creative writing? The primary issues include the lack of objective evaluation metrics and inherent biases in self-assessment, as noted by Ethan Mollick in his March 2026 tweet, making iterative improvement difficult compared to more quantifiable tasks. How can businesses leverage LLMs despite these limitations? Companies can use hybrid models combining AI generation with human oversight for tasks like marketing copy, achieving cost savings while maintaining quality, supported by McKinsey's 2023 findings on productivity gains.

evaluation GPT5 OpenAI reward model RLHF

Ethan Mollick

@emollick

Professor @Wharton studying AI, innovation & startups. Democratizing education using tech