Computer Vision
FundamentalsA field of AI that enables machines to interpret and extract meaningful information from images, video, and other visual inputs.
Computer vision is the branch of artificial intelligence concerned with teaching machines to understand visual data. It encompasses tasks ranging from basic image classification (is this a cat or a dog?) to complex scene understanding (what objects are present, where are they, what are they doing, and how do they relate to each other?).
Core computer vision tasks include image classification, object detection (locating and labeling objects within an image), semantic segmentation (assigning a class label to every pixel), instance segmentation (distinguishing between individual objects of the same class), pose estimation, depth estimation, and optical character recognition. Each task has its own family of architectures and techniques, though modern approaches increasingly use shared backbone networks.
The field was transformed by deep learning, starting with convolutional neural networks that dramatically outperformed hand-engineered feature extractors on image recognition benchmarks. More recently, vision transformers have shown that the attention mechanism originally designed for text can be equally effective for images when given enough training data. The latest frontier models are natively multimodal, processing text and images through the same architecture from training onward rather than bolting a vision encoder onto a language model after the fact. This early-fusion approach, used in models like Qwen 3.5 and Gemini, produces richer visual understanding because the model learns joint representations rather than translating between separate visual and textual feature spaces.
Related Terms
Last updated: March 5, 2026