Behind the scenes of Sora’s viral video: We were fooled, it took tons of post-production to achieve this effect

Latest Release1yrs ago (2024)release Lyan23

208 0 0

「The fear is not that AI will eliminate mankind, but that those using AI are too clever.」

In February of this year, OpenAI launched the artificial intelligence text-to-video large model Sora, and released the first batch of video clips, starting an AI-generated video wave. Currently, Sora is not yet open for public testing, only some visual artists, designers, filmmakers, and others have access to Sora. They have published some short videos generated by Sora, and its consistent, realistic generation effects are impressive.

Recently, Shy Kids, a Canadian multimedia production company dubbed as the “Punk Rock Pixar,” released a short video using Sora, titled “Air Head,” which quickly gained widespread attention on social media.

Behind the scenes of Sora's viral video: We were fooled, it took tons of post-production to achieve this effect

Reportedly, this beautifully produced short film was completed by three people in less than two weeks. Sidney Leeder served as the producer, Walter Woodman as the screenwriter and director, and Patrick Cederberg was responsible for post-production.

This week, respected visual effects supervisor Mike Seymour interviewed Patrick Cederberg about the production process and technical challenges of “Air Head,” and published an article on fxguide detailing the role Sora played in the actual video production process and the issues it encountered.

Patrick Cederberg

Patrick said, “Sora is a very powerful tool. We are already thinking about how to integrate it into the existing film production process. But Sora is still in the testing phase and there will be ‘accidents’ during the production process. For example, the color of the balloons will change each time they are generated, there are some defects in the camera, etc. To achieve the best performance, a lot of post-production is still needed.”

AI-generated video is more than just an advanced version of an image generator. More accurately, they may be an important step we are taking towards general artificial intelligence (AGI). As the Sora development team said in an interview this week, the current AI video model is still in its early stages.

OpenAI research scientist and Sora project leader Tim Brooks said: I think the current position of Sora is like the GPT-1 stage of the new paradigm of visual models.

How was “Air Head” completed? Machine Heart translated and organized Mike Seymour’s article without changing the original intention. Below is the original article:

User Interface (UI)

Sora’s user interface allows users to enter a text prompt, which ChatGPT then converts into a longer string to trigger the generation of a video clip. Currently, there is no other input method – multimodal input has not been implemented yet. This is important because while Sora has been praised for the object consistency in its generated results, there is currently no method to help match the content of two shots (i.e., two generations). Even if the same prompt is run twice, the generated results will be different.

Patrick explained, “What we try to do is give super detailed descriptions in our prompts, like explaining the costume of the character, the type of balloons. This is the way we achieve consistency. From one shot to another, or one generation to the next, there is no full control of consistency yet.”

A single video clip can indeed showcase Sora’s astounding technology, but using these clips depends on your understanding of implicit or explicit shot generation.
Suppose you ask Sora to do a long shot tracking in the kitchen, with a banana on the table. In this case, it will rely on its implicit understanding of “banana attributes” to generate a video showcasing a banana. Through training, Sora has already learned some banana attributes: like “yellow”, “curved”, “with a darker end”, and so on. There are no actual record images of the banana. There’s no “banana database”, but a smaller, compressed, hidden “latent space” that describes what a banana is, each run will showcase a new interpretation of the latent space. Your prompt relies on an implicit understanding of banana attributes.

Creating Characters

To create “Air Head”, the team generated multiple video clips based on a rough script, but there was no explicit way to ensure that the yellow balloon head remains the same in each shot. Sometimes, when the prompt wants a yellow balloon, the generated result might not even be yellow. Sometimes, there might be a face embedded in the balloon, or it seems like a face was painted on the front of the balloon. As many balloons in real life have strings, the balloon man called Sonny in the generated results often has a string at the front of his clothes. This is because Sora implicitly associates the string with balloons, so these have to be removed in post-production.

Resolution

“Air Head” only used shots created by Sora, but many of them have been color-graded, processed, and stabilized, and all shots have been upscaled in resolution. The clips that the team handled were initially generated at a lower resolution, then they were upscaled using AI tools outside of Sora or OpenAI.

“You can adopt a resolution of 720p. I believe that 1080p is already available, but it takes some time to render. For speed, all the shots of ‘Air Head’ were made in 480p, and then Topaz was used to increase the resolution,” Patrick explained.

On the subject of keyframes, Patrick explains, “In actual generation, there’s some temporal control over when different actions occur, but it’s not precise, it’s even a bit like luck – it’s still uncertain whether Sora can really achieve this.” However, Shy Kids used the earliest version of the model, and Sora is still in continuous development.

In addition to choosing the resolution, Sora also allows users to choose the aspect ratio, such as portrait mode or landscape mode (or square). This is very useful in shots that pan up from Sonny’s jeans to his balloon head. Unfortunately, Sora cannot natively render such movements, always expecting the main focus of the shot – the balloon head to appear in the shot. Therefore, the team rendered this shot in portrait mode, and then manually created the panning effect through post-production cropping.

Camera direction

For many generative AI tools, the metadata attached to the training data is a valuable source of information, such as camera metadata. For instance, if training is performed on static photos, camera metadata will provide lens size, aperture values, and many other pieces of information crucial to model training.

In film shots, concepts such as “tracking”, “panning”, “tilting”, “pushing”, and other terms or concepts cannot be captured by metadata.

Describing shots is crucial for film production, Patrick notes, “This wasn’t initially in Sora. Different people describe film shots in different ways. OpenAI researchers didn’t really think like filmmakers before artists started using this tool.”

Shy Kids knew they were using an early version of Sora, but “The initial version of Sora was a bit random in terms of the camera angle.” Whether Sora can fully understand the prompt is unknown. OpenAI researchers are only focused on visual generation, perhaps not considering how a storyteller will use it.

“Sora is improving, the generation control is not fully in place yet. Input “the camera is panning”, I think six times out of ten you will get the desired result,” Patrick said.

This is not a single case issue, almost all AI video generation companies are facing the same problem, Runway AI might be the most advanced in providing descriptions of camera movements, but the quality and length of the Runway rendering clips are not as good as Sora.

Rendering time

Video clips can be rendered in different lengths of time periods, such as 3 seconds, 5 seconds, 10 seconds, 20 seconds, up to a minute. The rendering time depends on the time of day (e.g., morning, afternoon, evening) and the demand for cloud services.

Patrick explained, “Generally speaking, each rendering takes about 10 to 20 minutes. Based on my experience, the rendering length I choose has little effect on the rendering time. If the rendering length is 3 to 20 seconds, the rendering time often does not vary too much within the range of 10 to 20 minutes.”

Although all the footage is generated by Sora, “Air Head” still required a significant amount of post-production work. For example, sometimes there would be a face on the balloon man Sonny, as if drawn on with a marker, these defects will be removed in post-production.

Original material vs final product – 300:1

Shy Kids’ approach is to do post-production and editing as if making a documentary, that is, there are a lot of shots, and you need to weave a story from these materials, rather than strictly following the script to shoot. Although the short film has a script, the team needed to adapt flexibly.

“It’s like getting a bunch of shots and then trying to edit it into the voiceover in an interesting way,” Patrick explained.

For the final 90 second clip that appears in the film, Patrick estimates that they generated “hundreds of 10 to 20 second clips”. He added, “I guess the ratio of original material to final product is about 300:1.”

Weird “slow motion”

Many of the clips generated for “Air Head” seem to have been filmed in slow motion, although this was not requested in the prompt. The reason for this is not clear, but many clips had to be time-adjusted to make it look like they were shot in real time. This seems to be related to the training data.

It’s worth noting that Shy Kids used the keyword “35 mm film” in their prompt and found that it gave them a certain degree of consistency they were looking for.

Copyright Issues

OpenAI tries to respect copyrights, not allowing the generation of content that might violate copyright or rights of likeness. For example, if a user’s prompt is similar to “35mm film, a man walks forward with a lightsaber in a futuristic spaceship,” Sora will not allow the generation of this clip because it’s too close to “Star Wars.”

Patrick recalls that when they first wanted to just test Sora, “I thoughtlessly typed in ‘Aronofsky type of shot,’ and was told I couldn’t do that.” Sora will reject such prompts due to copyright issues.

It’s worth noting that all videos generated by Sora are silent; the voice of the main character, Sonny, in “Air Head” is Patrick’s own voice.

The Shy Kids team has already started to create a self-aware and perhaps slightly ironic sequel for “Air Head.” However, for practical projects like film making, it might take some time before Sora can achieve the precision needed by creators.