OpenAI disrupts the world: GPT-4 is completely free, real-time audiovisual interaction impresses everyone, directly entering the science fiction era.
ChatGPT has only been around for 17 months, and OpenAI has introduced a super AI from a sci-fi movie, and it’s completely free for everyone to use.
While other tech companies are still catching up with the multimodal capabilities of large models and putting text summarization, P-picture and other functions into mobile phones, the far ahead OpenAI directly made a big move. The product they released even amazed their own CEO, who exclaimed: it’s just like in the movies.
Today’s host is OpenAI’s CTO, Mira Murati, and she said that today’s main focus will be on three things:
-
First, from now on, OpenAI’s approach to product development will prioritize free access to enable more people to use their products. -
Second, OpenAI has therefore released a desktop version of the program and a refreshed UI, which is simpler and more natural to use. -
Third, following GPT-4, a new version of the large model has arrived, dubbed GPT-4o. The special thing about GPT-4o is that it brings GPT-4 level intelligence to everyone in an extremely natural way, including free users.
Recently, ChatGPT has become available without registration, and today it has added a desktop application, OpenAI’s goal is to allow people to use it anywhere, anytime without interruptions, and to integrate ChatGPT into your workflow. This AI is an actual productively tool now.
At the scene, an engineer from OpenAI demonstrated some of the main abilities of the new model using an iPhone. The most important of these is real-time voice dialogue. Mark Chen said, “I’m a bit nervous as it’s my first time at a livestream product launch.” ChatGPT suggested, “Why don’t you take a deep breath?”
Alright, I’ll take a deep breath.
If you’ve ever used a voice assistant like Siri, you can see a noticeable difference here. First, you can interrupt the AI at any time, you don’t have to wait for it to finish before you can continue to the next round of dialogue. Second, you don’t have to wait, the model responds extremely fast, even faster than a human response. Third, the model is able to fully understand human emotions and can also express various emotions.
Next is visual capability. Another engineer wrote an equation on paper and asked ChatGPT not to provide the direct answer, but to explain how to solve it step by step. It seems to have great potential in teaching problem-solving.
The result of running the code was a temperature curve graph. They asked ChatGPT to respond in a single sentence to every question related to this graph.
OpenAI also responded to some real-time questions raised by netizens on platforms like Twitter. For example, real-time voice translation, the phone can be used as a translation machine to translate between Spanish and English.
It seems that GPT-4o can already achieve real-time video understanding.
Next, let’s take a detailed look at the blockbuster released by OpenAI today.
The Omnimodel GPT-4o
The first introduction is about GPT-4o, where ‘o’ stands for Omnimodel.
For the first time, OpenAI has integrated all modalities in a model, greatly enhancing the usability of large models.
OpenAI CTO Muri Murati said that GPT-4o provides “GPT-4 level” intelligence but improves text, visual, and audio capabilities based on GPT-4, and it will be iteratively rolled out in the company’s products in the next few weeks.
“The reasoning of GPT-4o spans across voice, text, and visuals,” Muri Murati said. “We know these models are becoming more complex, but we want the interaction experience to be more natural and simpler. You don’t have to worry about the user interface, just focus on collaborating with GPT.”
GPT-4o matches the performance of GPT-4 Turbo on English text and code but significantly improves non-English text performance. Its API speed is also faster, reducing costs by 50%. Compared to existing models, GPT-4o is particularly outstanding in visual and audio understanding.
It can respond to audio input in as little as 232 milliseconds, with an average response time of 320 milliseconds, which is similar to humans. Prior to the release of GPT-4o, users who have experienced the voice dialogue capability of ChatGPT have perceived the average delay of ChatGPT to be 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4).
This voice response mode is a pipeline composed of three independent models: a simple model transcribes audio to text, GPT-3.5 or GPT-4 receives text and outputs text, and the third simple model converts the text back to audio. But OpenAI found this method meant that GPT-4 would miss a lot of information, such as the inability to directly observe tone, multiple speakers, or background noise, and it couldn’t output laughter, singing or express emotions.
With GPT-4o, OpenAI trained a new model end-to-end across text, visual, and audio. This means that all inputs and outputs are handled by the same neural network.
“Technically, OpenAI has found a method that can directly map audio to audio as a primary modality and real-time video to transformer. This requires some new research on tokenization and architecture, but overall it’s a data and system optimization issue (as most things are),” commented Nvidia scientist Jim Fan.
OpenAI President Greg Brockman also got into the action, not only making two GPT-4o units hold a real-time conversation, but also creating an impromptu song. While the melody was a bit touching, the lyrics covered room decor style, character clothing features, and various incidents.
In addition, GPT-4o’s ability to understand and generate images significantly surpasses any existing model, making many previously impossible tasks hassle-free.
For example, you could ask it to help print OpenAI’s logo on a coaster.
Simultaneously, GPT-4o also has the capability to generate 3D visual content, capable of performing 3D reconstruction from six generated images.
A member of OpenAI’s technical team stated on X that the mysterious model, which caused widespread discussion on LMSYS Chatbot Arena, “im-also-a-good-gpt2-chatbot,” is a version of GPT-4o.
Additionally, free users can also have the following options —
Analyzing data and creating charts:
However, depending on usage and needs, the number of messages that free users can send with GPT-4o is limited. When this limit is reached, ChatGPT automatically switches to GPT-3.5 so users can continue the conversation.
In addition, OpenAI will introduce a new version of the voice mode GPT-4o alpha in ChatGPT Plus in the coming weeks and roll out more new audio and video features of GPT-4o via the API to a small set of trusted partners.
Of course, through multiple rounds of model testing and iteration, GPT-4o has some limitations in all modalities. In these imperfect areas, OpenAI is working to improve GPT-4o.
It can be imagined that the opening of the GPT-4o audio mode will definitely bring about various new risks. On the issue of safety, GPT-4o has built-in safety in cross-modality design through techniques such as filtering training data and refining model behavior after training. OpenAI has also created a new safety system to provide protection for voice output.
The new desktop app simplifies the user workflow
For both free and paid users, OpenAI has also launched a new ChatGPT desktop application for macOS. With a simple keyboard shortcut (Option + Space), users can immediately ask questions to ChatGPT, and users can also directly capture screen screenshots in the application and discuss them.
Altman: You open source, we are free
After the release, OpenAI CEO Sam Altman published a long-awaited blog post, detailing the journey when promoting GPT-4o:
In our release today, I want to emphasize two things.
First, a key part of our mission is to provide people with powerful AI tools for free (or at a discounted price). I am very proud to announce that we offer the best model in ChatGPT for free, with no ads or anything similar.
When we founded OpenAI, our initial idea was: we want to create artificial intelligence and use it to create various benefits for the world. Now the situation has changed, it seems that we will create artificial intelligence, and then others will use it to create all kinds of amazing things, and we will all benefit from it.
Of course, we are a company and will invent a lot of things to charge, which will help us provide free, excellent artificial intelligence services to billions of people (we hope).
Second, the new voice and video modes are the best computing interaction interface I have ever used. It feels like the AI from movies, and I’m still a bit surprised that it is real. It turns out that achieving human-level response times and expression capabilities is a huge leap.
The original ChatGPT hinted at the possibilities of the language interface, and this new thing (the GPT-4o version) feels fundamentally different—it’s fast, smart, fun, natural, and helpful.
For me, interacting with computers has never been a very natural thing, and so it is. But when we add (optional) personalization, access to personal information, allow AI to take action on behalf of people, etc., I can indeed see an exciting future where we can do more with computers than ever before.
Finally, a big thank you to the team for their tremendous efforts to achieve this goal!
“The idea is that as AI becomes more advanced and is embedded in every aspect of our lives, having a large language model like GPT-7 may be more valuable than money. You own a part of productivity,” Altman explained.
The launch of GPT-4o might be the beginning of OpenAI’s efforts in this direction.
Yes, it’s just a beginning.
Finally, the video “Guessing May 13th’s announcement.” shown in the OpenAI blog today is almost completely coinciding with a warm-up video for Google’s I/O conference tomorrow. This is undoubtedly a huge slap in the face for Google. I wonder if Google feels tremendous pressure after watching OpenAI’s release today?