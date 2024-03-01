Alibaba’s Institute for Intelligent Computing has introduced an innovative AI system named “EMO,” standing for Emote Portrait Alive. This groundbreaking tool specializes in animating single portrait photos, producing lifelike videos where individuals appear to speak or sing. Unlike conventional techniques that utilize 3D face models or blend shapes, EMO employs a direct audio-to-video synthesis approach, setting it apart as a pioneering advancement in the field.

EMO transforms audio waveforms into video frames to capture subtle facial movements and individualized nuances linked to natural speech. In their research paper, Alibaba’s researchers detailed their training methodology. They curated an extensive and diverse audio-video dataset, comprising over 250 hours of footage and more than 150 million images. This vast dataset encompasses a wide range of content, including speeches, films, television clips, and singing performances in various languages, like Chinese and English.

The researchers highlighted that the diverse range of speaking and singing videos ensures the training material encapsulates a broad spectrum of human expressions and vocal styles, providing a strong foundation for EMO’s development. The research paper stated, “Experimental results show that EMO can generate convincing speaking videos as well as singing videos in diverse styles, surpassing existing state-of-the-art techniques in terms of breadth and realism.”

However, the researchers acknowledged several limitations of their approach. Firstly, it is more time-consuming compared to methods not reliant on diffusion models. Secondly, because the model lacks explicit control signals to govern the character’s motion, it may inadvertently generate additional body parts, such as hands, potentially resulting in artifacts in the video.

Nevertheless, the results presented by the researchers closely resemble reality. The AI tool accurately achieves lip-sync, which is noteworthy. It will be intriguing to observe whether Alibaba integrates the tool into its AI offerings or if it remains solely a research project.