Large Language Models are increasingly used to generate technical content: documentation, reports, and even conference presentations. The results are often fluent, confident, and well-structure, which makes their mistakes harder to spot.
In this talk, we run a simple experiment, an LLM is asked to generate an entire presentation on General Relativity, covering gravitational time dilation, gravitational waves, and black holes, using real scientific sources. The output looks convincing. It has equations, misconceptions, and citations. And yet, several explanations are subtly but fundamentally wrong.
General Relativity is an unforgiving domain. Concepts that sound intuitive, like “light slows down in gravity”, “gravitational waves are ripples in space”, “black holes suck everything in”, fail as soon as you frame them in terms of measurements, observables, and invariants. This makes physics an ideal stress test for AI-generated explanations.
Using the generated slides as a case study, we show:
In this talk, we run a simple experiment, an LLM is asked to generate an entire presentation on General Relativity, covering gravitational time dilation, gravitational waves, and black holes, using real scientific sources. The output looks convincing. It has equations, misconceptions, and citations. And yet, several explanations are subtly but fundamentally wrong.
General Relativity is an unforgiving domain. Concepts that sound intuitive, like “light slows down in gravity”, “gravitational waves are ripples in space”, “black holes suck everything in”, fail as soon as you frame them in terms of measurements, observables, and invariants. This makes physics an ideal stress test for AI-generated explanations.
Using the generated slides as a case study, we show:
- where LLMs consistently succeed (structure, narrative, pedagogy),
- where they fail (measurement-based reasoning and physical constraints),
- and how to design agent pipelines that combine AI generation with deterministic validation and human review.
