This AI system learned to understand videos by watching YouTube
Humans understand events in the world contextually, performing what’s called multimodal reasoning across time to make inferences about the past, present, and future. Given text and an image that seems innocuous when considered separately — e.g., “Look how many people love you” and a picture of a barren desert — people recognize that these elements take on potentially hurtful connotations when they’re paired or juxtaposed, for example.