Integrating vision, language, code, and structured data into unified representations for holistic world understanding.
Natural language in any form - documents, conversations, instructions
Photos, diagrams, charts, screenshots, documents
Programming languages with structural understanding
Tables, JSON, databases, spreadsheets
PDFs, presentations, mixed-format files
Equations, formulas, mathematical notation
Multi-modal AI refers to systems that can process and understand multiple types of input - text, images, audio, video, code - in an integrated way. Rather than separate models for each modality, multi-modal systems form unified representations that enable reasoning across different input types.
Humans understand the world through multiple senses simultaneously - we don't process vision and language separately. For AI to achieve human-like general intelligence, it must integrate information across modalities into coherent understanding.