In News
- OpenAI, the creator of the revolutionary chatbot ChatGPT, has unveiled a new generative artificial intelligence (GenAI) model that can convert a text prompt into video, an area of GenAI that was thus far fraught with inconsistencies.
SORA
- The model, called Sora, can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.
- Sora is capable of creating “complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background.
- The company also claimed that the model can understand how objects “exist in the physical world”, and “accurately interpret props and generate compelling characters that express vibrant emotions.
Significance
- While generation of images and textual responses to prompts on GenAI platforms have become significantly better in the last few years, text-to-video was an area that had largely lagged, owing to its added complexity of analysing moving objects in a three-dimensional space.
- While videos are also a series of images and could, therefore, be processed using some of the same parameters as text-to-image generators, they also have their unique set of challenges.
- The model has a deep understanding of language, enabling it to accurately interpret prompts and generate compelling characters that express vibrant emotions.
- Sora can also create multiple shots within a single generated video that accurately persist characters and visual style.
Other Model
- Other companies too have ventured into the text-to-video space.
- Google’s Lumiere, which was announced recently, can create five-second videos on a given prompt, both text- and image-based.
- Other companies like Runway and Pika have also shown impressive text-to-video models of their own.
Shortcomings
- OpenAI says that the current model of Sora has weaknesses.
- It may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect.
- For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark.
Source- Indian Express
“The model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory.