The immense potential and challenges of multimodal AI
Unlike most AI systems, humans understand the meaning of text, videos, audio, and images together in context. For example, given text and an image that seem innocuous when considered apart (e.g., “Look how many people love you” and a picture of a barren desert), people recognize that these elements take on potentially hurtful connotations when they’re paired or juxtaposed.