【资讯】Sora到底有何神奇之处？它要构建的是“懂规则”的数据驱动物理引擎

CXO UNION（CXO联盟）

2024年 2月 18日

2024年2月16日凌晨（美国时间2月15日），OpenAI发布了“文生视频”的工具Sora，再次震撼科技圈。与当前平均只有4秒视频生成长度，且呈现抖动和失真的视频效果相比，Sora能够根据简单的提示词生成60秒的连贯视频，创造出生动的角色表情和复杂的运镜，视频具有高度的逼真性和叙事效果，使Sora一经发布就引起了包括马斯克、周鸿祎等科技大佬在内的高度关注，连英伟达人工智能研究院Jim Fan也表示，“Sora是一个数据驱动的物理引擎。”

提示：一位时尚的女人走在东京的街道上，街道上到处都是温暖的发光霓虹灯和动画城市标志。她身穿黑色皮夹克，红色长裙，黑色靴子，背着一个黑色钱包。她戴着墨镜，涂着红色口红。她自信而随意地走路。街道潮湿而反光，营造出五颜六色的灯光的镜面效果。许多行人四处走动。 CXO UNION-CXO联盟（cxounion.cn）

一、Sora是一款划时代的文生视频产品

2023年，多模态大模型在应用端百花齐放，文生图和文生视频工具如雨后春笋般爆发，作为内容创作者，笔者也曾付费开通了多款AI视频工具，但操作起来比较复杂，生成效果难说满意。因此，也在期待一款真正好用的AI文生视频工具出现，满足我这种视频创作功底较弱的人群需求，却未料到现实来的如此之快。

AI文生视频最早主要基于GAN（生成式对抗网络）模型和 VAE（变分自编码器）进行视频生成。不过这两个框架生成的视频仅适用静态、单一画面，且分辨率极低，应用范围狭窄。AI文生视频技术发展到现在，主要形成了两条技术路线：一条是基于Transformer模型，这也是文本和图像生成中应用最多的大模型底座；另一条则是在视频领域应用更广的“扩散模型”，这也是当前AI文生视频主流技术路线，在语义理解、内容丰富性上更有优势。

据OpenAI官网公开技术报告，Sora 采用的是 Diffusion Transformer 模型，一种结合了扩散模型和 Transformer 架构的模型，它通过 Transformer 的编解码器处理带噪声的图像输入，在每一步中预测出更清晰的图像版本，一次性生成整个视频或扩展生成的视频，使其更长，同时解决了一个具有挑战性的问题，即确保主体即使暂时离开视野也保持不变。

实际上，阿里巴巴和字节跳动两大行业巨头去年就已经推出了AI 视频生成工具，采用的是扩散模型技术路线，两个工具的共同点都是将静态图变成动态视频。其中阿里巴巴智能计算研究院开发的Animate Anyone，只需要用户提供一个静态的角色图像，再加上一些动作、姿势，便可将其动画化；字节跳动则是和新加坡国立大学联合推出了Magic Animate，同样是静态图生成动态视频，相比阿里多了一股“社会摇”的味道。

据计算机业内人士实际测评，Animate Anyone和Magic Animate文生视频效果难说理想，前者难以生成稳定的手部运动生成结果，导致扭曲和运动模糊，侧面和背面图片的生成效果也逊于人物正面图片生成效果；后者生成效果有失真，离实际落地应用还有很长一段距离。如果想打开市场，AI 文生视频应用需要具备一定时长、优良的画质、创意逻辑以及还原指令要求的能力，计算复杂度显著提升，对算力的需求也更高。

在OpenAI官网公布的48个演示视频来看，Sora似乎已突破AI文生视频存在的技术难题，仅仅根据一句简单的提示语就能生成时间长、稳定性好以及清晰度高的60秒视频，绝对算得上当前行业翘楚。

二、Sora在训练中掌握世界“物理规则”

根据OpenAI官网技术文档，他们正在教人工智能理解和模拟运动中的物理世界，目标是训练模型，帮助人们解决需要现实世界交互的问题。Sora就是这种思路下的杰作。Sora对语言有深刻理解，能够准确地解释提示并生成表达生动情感的引人注目的角色，能生成具有多个角色、特定类型的运动以及主题和背景的准确细节的复杂场景，了解用户在提示中要求的内容以及这些东西在物理世界中的存在方式。 CXO UNION-CXO联盟（cxounion.cn）

OpenAI官网公布的48个演示视频中有一个视频文字提示是：“逼真的特写视频，两艘海盗船在一杯咖啡中航行时相互争斗。”不妨就此来分析一下，Sora到底是如何根据这条提示词生成视频。

提示：逼真的特写视频，两艘海盗船在一杯咖啡中航行时相互争斗。

首先，Sora模拟器实例化两个精美的三维资产，即装饰各异的海盗船，这里必须在潜在空间的隐式中解决文本到3D的问题。其次，保持三维物体在航行和避开对方的路径时始终保持动画效果。第三，“懂”咖啡的流体力学，甚至是船只周围形成的泡沫，这里的流体模拟是计算机图形学的一个完整子领域，需要非常复杂的算法和方程。第四，逼真度，这里需要软件算法，也需要硬件支持，NVIDIA GPU光线追踪技术能实现接近真实世界的渲染效果。值得注意的是，这里模拟器还需要考虑杯子的体积与海洋相比较小，必须采用了倾斜移位摄影技术，营造出一种‘微小’的感觉。

看似简单，实则相当复杂。OpenAI虽然没有公布Sora的技术文档，但却介绍了它的基本原理，即Sora通过学习视频来理解现实世界的动态变化，并用计算机视觉技术模拟这些变化，从而创造出新的视觉内容。就是通过不断地学习训练来了解世界的“物理规律”。难怪英伟达人工智能研究院Jim Fan说：“Sora是一个数据驱动的物理引擎。”

2023年11月，美国AI初创公司Pika labs也发布了第一个文生视频产品Pika 1.0，只需要在视频编辑器中写下提示词，即可生产高质量的视频，或者对视频元素进行编辑和修改。但与Sora相比，Pika 1.0所缺失的就是通过学习训练积累了解世界规则的能力。 CXO UNION-CXO联盟（cxounion.cn）

就在Sora发布的这几天，各个行业媒体也在深挖Sora的技术细节，总结了Sora模型除了文本到视频生成能力、语言理解能力、静态图像生成视频能力，最大的创新是实现了复杂场景和角色生成能力、多镜头生成能力以及物理世界模拟能力，后面这三点也是其他AI文生视频工具所不具备的。

可以说，Sora的出现预示着一个全新的视觉叙事时代的到来，它能够将人们的想象力转化为生动的动态画面，将文字的魔力转化为视觉的盛宴。在这个由数据和算法编织的未来，Sora正以其独特的方式，重新定义着我们与数字世界的互动。

三、Sora将成为行业变革的颠覆性力量

继文本、图片的自动生成之后，“视频生成”这个被称为最难攻克的“山头”也被AI拿下。市场观察人士认为，Sora的出现将重塑视频行业，催变社交平台。与场景可视化有关的应用，如仿真、建模、特效、广告等都将面临颠覆性的变革，想象力的表达获得前所未有的空间，叙事范式将超越传统的界限。

360集团创始人周鸿祎表示，Sora的发布是OpenAI利用它的大语言模型优势，把LLM和Diffusion结合起来训练，让Sora实现对现实世界的理解和对世界的模拟两层能力。他认为有强劲的大模型做底子，基于对人类语言、知识和世界模型的了解，再叠加很多其他的技术，就可以创造各个领域的超级工具，比如生物医学、蛋白质和基因研究，在包括物理、化学、数学的学科研究上发挥作用。 CXO UNION-CXO联盟（cxounion.cn）

在新工业网看来，任何技术的发展与演进都有一个循序渐进的过程。目前，以Sora为代表的AI文生视频技术还不足以达到长时间的连续性要求，在模拟复杂物理现象和理解复杂因果关系方面，还存在失准的缺陷，但随着多模态大模型能力的增强以及AI技术的不断迭代，与视频化相关的AI文生视频场景替代将成为可能。

就工业来说，能想象的替代场景就包括产品建模与仿真、工厂建模与仿真、工业元宇宙、工业数字孪生以及产品培训与运维等场景，至少应将AI文生视频技术作为一种使能技术，通过移植或嵌入等手段，研究在特定场景中应用的可能性。甚至有一些航空界的专家提出，在诸如航空器应用环境与功能性能模拟、系统敏感性与有效性评定、广义人机界面构建与优化等场景都可以采用AI文生视频技术，必要时可组成团队进行专门化研究。

OpenAI表示，通过扩大视频生成模型的规模，他们有望构建出能够模拟物理世界的通用模拟器。同时也指出，Sora还存在不能准确地模拟许多基本相互作用的物理过程，如玻璃破碎等。在某些交互场景中不能总是产生正确的对象状态变化，比如吃东西，这都是下一步研究和突破的重点。

我们认为，未来的AI文生视频发展除了不断优化自然语言处理、视觉处理以及融合画面合成技术，还有跨学科的技术融合是必须跨过的坎，开启通用人工智能在某些应用场景的尝试，Sora作为能够理解和模拟现实世界的基础模型与能力，有机会成为通用人工智能发展道路上的重要里程碑。

翻译：

What’s so amazing about Sora? It wants to build a data-driven physics engine that “knows the rules.

In the early morning of February 16, 2024 (February 15, United States time), OpenAI released the “Vincenne video” tool Sora, once again shocking the technology circle. Compared with the current average video generation length of only 4 seconds and the video effect of jitter and distortion, Sora can generate 60 seconds of coherent video based on simple prompt words, create vivid character expressions and complex mirrors, and the video has a high degree of fidelity and narrative effect. So that Sora once released caused a high degree of attention including Musk, Zhou Hongyi and other science and technology leaders, even Nvidia Artificial Intelligence Research Institute Jim Fan also said, “Sora is a data-driven physics engine.” CXO UNION-CXO联盟（cxounion.cn）

Sora Vincennes video demonstration effect

Hint: A stylish woman walks down a Tokyo street filled with warm glowing neon lights and animated city signs. She was wearing a black leather jacket, a long red dress, black boots and carrying a black purse. She was wearing dark glasses and red lipstick. She walked confidently and casually. The streets are damp and reflective, creating a mirror-like effect of colorful lights. Many pedestrians are walking around.

Sora is an epoch-making Vincennes video product

In 2023, the multi-modal large model blooms in the application side, Vincennes diagram and Vincennes video tools mushroomed, as a content creator, the author has also paid to open a number of AI video tools, but the operation is more complex, the generation effect is difficult to be satisfied. Therefore, I am also looking forward to the emergence of a really good AI video tool to meet the needs of people with weak video creation skills, but I did not expect the reality to come so fast.

AI Vincennes video was first mainly based on GAN (Generative Adversarial Network) model and VAE (variational autoencoder) for video generation. However, the video generated by these two frameworks is only suitable for static, single screen, and the resolution is very low, and the application range is narrow. The development of AI Vinson video technology has mainly formed two technical routes: one is based on Transformer model, which is also the most used large model base in text and image generation; The other is a wider application of the “diffusion model” in the field of video, which is also the current mainstream AI video technology route, which has more advantages in semantic understanding and content richness. CXO UNION-CXO联盟（cxounion.cn）

According to the technical report published on the official website of OpenAI, Sora uses the Diffusion Transformer model, a model that combines the diffusion model and Transformer architecture. It processes the noisy image input through the Transformer codec. Predict a sharper version of the image at each step, generate the entire video at once or extend the generated video to make it longer, while solving the challenging problem of ensuring that the subject remains unchanged even when temporarily out of view.

In fact, Alibaba and ByteDance two major industry giants have launched AI video generation tools last year, using the diffusion model technology route, the common point of the two tools is to turn static graphs into dynamic videos. The Animate Anyone, developed by the Alibaba Institute of Intelligent Computing, only requires the user to provide a static character image, plus some actions and postures, to animate it; Bytedance and the National University of Singapore jointly launched Magic Animate, the same static graph generated dynamic video, compared to Ali more than a “social shake” taste.

According to the actual evaluation of the computer industry, the video effect of Animate Anyone and Magic Animate is not ideal. The former is difficult to generate stable hand motion generation results, resulting in distortion and motion blur, and the generation effect of side and back pictures is worse than that of the front pictures. The latter generation effect has distortion, and there is still a long distance from the actual landing application. If you want to open the market, AI Vinson video applications need to have a certain length of time, excellent picture quality, creative logic and the ability to restore instruction requirements, the computational complexity is significantly increased, and the demand for computing power is also higher.

From the 48 demo videos published on OpenAI’s official website, Sora seems to have broken through the technical problems existing in AI Vincencene videos, and can generate 60-second videos with long time, good stability and high clarity just according to a simple prompt, which is definitely the current industry leader.

Sora mastered the “physical rules” of the world in training

According to OpenAI’s website technical documentation, they are teaching AI to understand and simulate the physical world in motion, with the goal of training models to help people solve problems that require real-world interaction. Sora is a masterpiece in this vein. Sora has a deep understanding of language, is able to accurately interpret prompts and generate compelling characters that express vivid emotions, can generate complex scenes with multiple characters, specific types of movements, and exact details of themes and backgrounds, understands what the user is asking for in a prompt and how those things exist in the physical world. CXO UNION-CXO联盟（cxounion.cn）

“Realistic close-up video of two pirate ships fighting each other while sailing over a cup of coffee,” reads one of the 48 demo videos posted on OpenAI’s website. Let’s take a look at how Sora generates videos based on this prompt word.

Sora Vincennes video demonstration effect

Tip: Realistic close-up video of two pirate ships fighting each other as they sail over a cup of coffee.

First, the Sora emulator instantiates two beautiful 3D assets, namely a decorated pirate ship, where the text-to-3D problem must be solved implicitly in the potential space. Second, keep 3D objects animated as they navigate and avoid each other’s paths. Third, to “understand” the fluid dynamics of coffee, or even the foam that forms around a boat, fluid simulation here is a whole subfield of computer graphics that requires very complex algorithms and equations. Fourth, fidelity, which requires software algorithms, but also need hardware support, NVIDIA GPU ray tracing technology can achieve close to the real world rendering effect. It is worth noting that the simulator also needs to consider the small volume of the cup compared to the ocean, and must use tilt shift photography technology to create a “tiny” feeling.

It seems simple, but it’s actually quite complicated. While OpenAI did not publish Sora’s technical documentation, it did introduce its basic principle that Sora learns video to understand the dynamics of the real world and uses computer vision technology to simulate those changes to create new visual content. It is through continuous learning and training to understand the “physical laws” of the world. No wonder Jim Fan of Nvidia’s AI Institute says, “Sora is a data-driven physics engine.”

In November 2023, American AI startup Pika labs also released the first Vincenne video product, Pika 1.0, which only needs to write prompt words in the video editor to produce high-quality videos, or edit and modify video elements. But what is missing in Pika 1.0 compared to Sora is the ability to learn the rules of the world through learning and training.

Just a few days after the release of Sora, various industry media are also digging deep into the technical details of Sora, summarizing the Sora model in addition to text-to-video generation capabilities, language understanding capabilities, and static image video generation capabilities, the biggest innovation is the realization of complex scene and character generation capabilities, multi-shot generation capabilities, and physical world simulation capabilities. The latter three points are also not available in other AI Vincennes video tools.

It can be said that the emergence of Sora heralds the arrival of a new era of visual narration, which can transform people’s imagination into vivid dynamic pictures, and transform the magic of words into visual feast. In a future woven by data and algorithms, Sora is, in its own way, redefining how we interact with the digital world. CXO UNION-CXO联盟（cxounion.cn）

Sora will become a disruptive force for industry change

After the automatic generation of text and pictures, “video generation”, which is known as the most difficult “mountain”, has also been taken down by AI. Market watchers believe that Sora’s emergence will reshape the video industry and accelerate the transformation of social platforms. Applications related to scene visualization, such as simulation, modeling, special effects, advertising, etc., will face subversive changes, the expression of imagination will gain unprecedented space, and narrative paradigms will transcend traditional boundaries.

Zhou Hongyi, founder of 360 Group, said that the release of Sora is OpenAI using its advantages of large language model, combining LLM and Diffusion training, so that Sora can realize the understanding of the real world and the simulation of the world. He believes that with a strong foundation of large models, based on the understanding of human language, knowledge and models of the world, combined with many other technologies, you can create super tools in various fields, such as biomedical, protein and gene research, and play a role in the study of disciplines including physics, chemistry and mathematics.

In the view of the new industry network, the development and evolution of any technology has a gradual process. At present, AI Vincencene video technology represented by Sora is not enough to meet the long-term continuity requirements, and there are still defects of misalignment in simulating complex physical phenomena and understanding complex causality. However, with the enhancement of multi-modal large model capabilities and the continuous iteration of AI technology, it will be possible to replace AI Vincene video scenes related to video.

As far as industry is concerned, the alternative scenarios that can be imagined include product modeling and simulation, factory modeling and simulation, industrial metauniverse, industrial digital twins, and product training and operation and maintenance scenarios, at least AI video technology should be used as an enabling technology, through transplantation or embedding means, to study the possibility of application in specific scenarios. Some experts in the aviation industry have even proposed that AI Vinson video technology can be used in scenarios such as aircraft application environment and functional performance simulation, system sensitivity and effectiveness assessment, and generalized human-machine interface construction and optimization, and a team can be formed to conduct specialized research if necessary. CXO UNION-CXO联盟（cxounion.cn）

By scaling up the video generation model, OpenAI says it hopes to build general-purpose simulators capable of simulating the physical world. It is also pointed out that Sora also exists many physical processes that cannot accurately simulate fundamental interactions, such as glass breakage. In some interaction scenarios, the correct object state changes cannot always be produced, such as eating, which is the focus of further research and breakthroughs.

We believe that in the future development of AI Vinson video, in addition to continuously optimizing natural language processing, visual processing and integrated picture synthesis technology, interdisciplinary technology integration is a barrier that must be crossed, opening the attempt of general artificial intelligence in some application scenarios. Sora, as a basic model and ability to understand and simulate the real world, Has the opportunity to become an important milestone in the development of general artificial intelligence.

由CXO UNION-CXO联盟（cxounion.cn）转载而成，来源于新工业网；编辑/翻译：CXO UNIONCXO联盟小U。

如需加入CXO UNION（CXO联盟）高管社群，请联系社群小伙伴哦~

免责声明: 本网站(http://www.cxounion.cn/)内容主要来自原创、合作媒体供稿和第三方投稿，凡在本网站出现的信息，均仅供参考。本网站将尽力确保所提供信息的准确性及可靠性，但不保证有关资料的准确性及可靠性，读者在使用前请进一步核实，并对任何自主决定的行为负责。本网站对有关资料所引致的错误、不确或遗漏，概不负任何法律责任。

本网站刊载的所有内容(包括但不仅限文字、图片、LOGO、音频、视频、软件、程序等) 版权归原作者所有。任何单位或个人认为本网站中的内容可能涉嫌侵犯其知识产权或存在不实内容时，请及时通知本站，予以删除。

如需加入CXO UNION（CXO联盟）高管社群，请联系社群小伙伴哦~

【资讯】Sora到底有何神奇之处？它要构建的是“懂规则”的数据驱动物理引擎

一、Sora是一款划时代的文生视频产品

二、Sora在训练中掌握世界“物理规则”

三、Sora将成为行业变革的颠覆性力量

翻译：

What’s so amazing about Sora? It wants to build a data-driven physics engine that “knows the rules.

Sora is an epoch-making Vincennes video product

Sora mastered the “physical rules” of the world in training

Sora will become a disruptive force for industry change

Search

Popular Posts

2024数字化灯塔案例评选申报开启！

2024 X-Award星盘奖申报通道已开启！

2024 N-Award星云奖申报通道已开启！

医疗IT领导者正在重新评估AI治理、安全和架构

没有控制，人工智能代理的成本可能超过员工

信任鸿沟：为什么你的运营模式是AI战略面临的最大风险

Categories

【资讯】Sora到底有何神奇之处？它要构建的是“懂规则”的数据驱动物理引擎

一、Sora是一款划时代的文生视频产品

二、Sora在训练中掌握世界“物理规则”

三、Sora将成为行业变革的颠覆性力量

翻译：

What’s so amazing about Sora? It wants to build a data-driven physics engine that “knows the rules.

Sora is an epoch-making Vincennes video product

Sora mastered the “physical rules” of the world in training

Sora will become a disruptive force for industry change

Related Posts

Search

Popular Posts

2024数字化灯塔案例评选申报开启！

2024 X-Award星盘奖申报通道已开启！

2024 N-Award星云奖申报通道已开启！

医疗IT领导者正在重新评估AI治理、安全和架构

没有控制，人工智能代理的成本可能超过员工

信任鸿沟：为什么你的运营模式是AI战略面临的最大风险

Categories