I asked Claude to describe a photo of my whiteboard covered in architectural diagrams. It not only read the diagrams - it identified a race condition I'd missed.
I asked GPT to watch a video of my React app and suggest improvements. It caught three UX issues and two performance problems from a 30-second clip.
I asked Gemini to listen to a product meeting recording and extract action items. It nailed context that pure transcription would have missed.
This is the multimodal moment. AI that sees, hears, and reads simultaneously - not as separate features, but as integrated understanding.
It changes how we build.
This Week in AI
GPT-5 Launches with Native Multimodality
What's new: OpenAI's GPT-5 treats text, images, audio, and video as unified input - not separate modalities with different models.
The capability: - Process documents with charts, tables, and images as single context - Watch videos and reason about content over time - Listen to audio while viewing related materials - Generate responses across modalities based on what you provide
The difference from GPT-4: Not "GPT-4 + Vision + Voice" bolted together, but a single model trained to reason across modalities simultaneously.
The pricing: Premium tier only for now. Multimodal processing at scale remains expensive.
Anthropic Releases "Claude Observe"
What's new: Claude can now watch your screen in real-time and provide assistance based on what it sees.
How it works: - Share your screen (with consent prompts) - Claude watches as you work - Proactive suggestions based on context - Can reference anything visible on screen in conversation
The use cases: - Code review as you type - Document editing with visible context - Debugging with full application state visible - Learning new tools with AI watching over your shoulder
The privacy: All processing happens in real-time - no video stored. But the surveillance implications are... worth considering.
Google's Gemini 2.5 Adds "World Understanding"
What's new: Gemini can now process physical environment inputs - camera feeds, sensor data, location context - and reason about the real world.
The demo: A Gemini-powered robot navigated an unfamiliar office, found the kitchen, made coffee, and delivered it to a person it identified from a photo.
The reality check: Demo environments are controlled. Real-world robustness is unproven.
The direction: AI that understands physical environments, not just digital documents. The path to useful robotics.
Deep Dive: Why Multimodality Matters
The Limitation of Text-Only AI
Human communication is multimodal. We point, we draw, we show. Forcing everything through text is unnatural and lossy.
Context is often visual. Explaining a bug in code is harder than showing it. Describing a design is worse than showing a mockup.
Some information is inherently non-textual. Diagrams, charts, interfaces, physical objects - describing them in text loses information.
What Native Multimodality Enables
Natural interaction: Show, don't describe. Point, don't explain.
Richer understanding: AI can see what you see, hear what you hear. Shared context improves communication.
New capabilities: Tasks that require visual-spatial reasoning, temporal understanding, or physical context become possible.
The Developer Implications
New interface patterns: - Drag and drop files into chat - Annotate screenshots for context - Share screen for pair programming - Record and reference audio/video
New application architectures: - Processing pipelines that handle mixed modalities - Storage and retrieval for multimodal context - Generation across modalities (text describing images, images from text)
New challenges: - Cost management (multimodal = more expensive) - Latency (video processing takes time) - Privacy (more data types = more exposure) - Testing (how do you test visual understanding?)
The Trend: "Show, Don't Tell" Interfaces
What's emerging: Applications designed for multimodal input, not text-with-attachments.
Examples: - Bug reports that accept screen recordings as primary input - Design tools where you sketch and describe simultaneously - Documentation that mixes screenshots, code, and explanation - Support systems that understand what customers are looking at
The shift: Text input becomes one option among several, not the default.
The challenge: Building interfaces that feel natural across modalities, not just stacking input boxes.
Tool of the Week: LlamaIndex Multimodal
Framework for building applications that process multiple modalities together.
What it does: - Unified parsing for PDFs (text + images + tables) - Image understanding with structured extraction - Video processing with temporal chunking - Retrieval that handles mixed-modality documents
Who it's for: Developers building applications that need to understand documents as humans do - not just extract text.
The limitation: Multimodal processing is compute-intensive. Scale carefully.
What We're Reading
"The End of Text-First Interfaces" - Design perspective on how multimodal AI changes interaction patterns.
"Multimodal RAG: Beyond Text Chunks" - Technical deep dive on retrieval strategies when documents contain images, tables, and diagrams.
"When AI Can See" - Philosophical exploration of what visual understanding means for machine cognition.
One More Thing
For decades, computers understood text. Then they understood images. Then audio. Each was a breakthrough.
Now they understand all of them together - the way humans do.
This isn't just incremental improvement. It's a qualitative shift in what's possible. The applications we build in the next few years will assume multimodal understanding as baseline, not feature.
We're not quite there yet. Cost and latency still limit deployment. But the trajectory is clear.
The question isn't whether to design for multimodal AI. It's how soon.
See you next week.
Zero to One: Tech Frontiers: Understanding AI technology and building the future.
Enjoyed this issue?
Subscribe to Zero to One: Tech Frontiers for AI insights delivered to your inbox.
Subscribe on LinkedIn