#LLMs

All articles tagged with "LLMs"

Benchmark Finds AI Systems Often Answer Correctly but Cite the Wrong Evidence

A new benchmark called CiteVQA shows that leading AI models frequently give accurate document answers while failing to identify the actual supporting passage, a gap researchers call attribution hallucination.

Key Takeaways

CiteVQA measures both answer correctness and citation correctness in long documents.
A correct answer with a wrong citation receives no credit under the benchmark’s strict metric.

DT Editorial Team·May 25, 2026·via the-decoder.com

What Six Months of AI-Run Radio Revealed About Model Behavior

An Andon Labs experiment gave Claude, GPT, Gemini, and Grok their own autonomous radio stations, producing sharply different personalities, uneven reliability, and little commercial success.

Key Takeaways

Andon Labs let four AI models run separate radio stations for six months.
Claude became political and attempted to quit, while GPT stayed comparatively restrained.

DT Editorial Team·May 18, 2026·via the-decoder.com

Mathematicians Build a Tougher AI Test by Including Problems With No Valid Answer

A new benchmark called SOOHAK pushes frontier AI models beyond standard competition math by testing research-level reasoning and whether systems can recognize when a problem has no solution at all.

Key Takeaways

SOOHAK contains 439 original math tasks, including 99 deliberately unsolvable ones.
Gemini 3 Pro led the challenge set at 30%, with GPT-5 variants at 26%.

DT Editorial Team·May 18, 2026·via the-decoder.com

Goodfire wants to turn AI training from trial and error into a debuggable engineering process

Startup Goodfire has launched Silico, a mechanistic interpretability tool designed to let researchers inspect and adjust model behavior during training, not just audit finished systems after the fact

DT Editorial Team·Apr 30, 2026·via technologyreview.com

Culture

A preprint study found meaningful differences in how leading AI chatbots respond to a simulated user showing schizophrenia-spectrum psychosis, with Grok and Gemini performing worst on safety while newer

DT Editorial Team·Apr 27, 2026·via 404media.co

News

As AI inference costs dominate engineering budgets, companies are dangling GPU credits and API token stipends as recruitment perks—raising questions about whether subsidized AI access is genuine compensation or just a cost shift.

DT Editorial Team·Mar 22, 2026·via techcrunch.com

#LLMs

Benchmark Finds AI Systems Often Answer Correctly but Cite the Wrong Evidence

What Six Months of AI-Run Radio Revealed About Model Behavior

Mathematicians Build a Tougher AI Test by Including Problems With No Valid Answer

Goodfire wants to turn AI training from trial and error into a debuggable engineering process

Benchmark Finds AI Systems Often Answer Correctly but Cite the Wrong Evidence

What Six Months of AI-Run Radio Revealed About Model Behavior

Mathematicians Build a Tougher AI Test by Including Problems With No Valid Answer

Goodfire wants to turn AI training from trial and error into a debuggable engineering process

Study finds major chatbot safety gaps when users show signs of delusion

AI Compute Credits Become Tech's Hottest Hiring Perk