OpenAI disrupts the world: GPT-4 is completely free, real-time audiovisual interaction impresses everyone, directly entering the science fiction era.

Content1yrs ago (2024)release Lyan23

244 0 0

ChatGPT has only been around for 17 months, and OpenAI has introduced a super AI from a sci-fi movie, and it’s completely free for everyone to use.

It’s too shocking!
While other tech companies are still catching up with the multimodal capabilities of large models and putting text summarization, P-picture and other functions into mobile phones, the far ahead OpenAI directly made a big move. The product they released even amazed their own CEO, who exclaimed: it’s just like in the movies.

OpenAI disrupts the world: GPT-4 is completely free, real-time audiovisual interaction impresses everyone, directly entering the science fiction era.

On the early morning of May 14, OpenAI introduced its new generation flagship generative model GPT-4o and desktop App at its first “Spring New Product Launch”, and demonstrated a series of new capabilities. This time, technology has disrupted the product form, and OpenAI has taught a lesson to tech companies around the world with its actions.

Today’s host is OpenAI’s CTO, Mira Murati, and she said that today’s main focus will be on three things:

First, from now on, OpenAI’s approach to product development will prioritize free access to enable more people to use their products.
Second, OpenAI has therefore released a desktop version of the program and a refreshed UI, which is simpler and more natural to use.
Third, following GPT-4, a new version of the large model has arrived, dubbed GPT-4o. The special thing about GPT-4o is that it brings GPT-4 level intelligence to everyone in an extremely natural way, including free users.

After this update to ChatGPT, the large model can receive any combination of text, audio, and image as input, and generate any combination of text, audio, and image output in real-time – this is the interaction method of the future.

Recently, ChatGPT has become available without registration, and today it has added a desktop application, OpenAI’s goal is to allow people to use it anywhere, anytime without interruptions, and to integrate ChatGPT into your workflow. This AI is an actual productively tool now.

GPT-4o is a brand new large model oriented for the future human-machine interaction paradigm. It has the capability to understand three modalities, text, audio, and image. It responds very quickly and with emotions, and is very human-like.
At the scene, an engineer from OpenAI demonstrated some of the main abilities of the new model using an iPhone. The most important of these is real-time voice dialogue. Mark Chen said, “I’m a bit nervous as it’s my first time at a livestream product launch.” ChatGPT suggested, “Why don’t you take a deep breath?”
Alright, I’ll take a deep breath.

ChatGPT instantly replied, “That won’t do, you’re breathing too heavily.”

If you’ve ever used a voice assistant like Siri, you can see a noticeable difference here. First, you can interrupt the AI at any time, you don’t have to wait for it to finish before you can continue to the next round of dialogue. Second, you don’t have to wait, the model responds extremely fast, even faster than a human response. Third, the model is able to fully understand human emotions and can also express various emotions.

Next is visual capability. Another engineer wrote an equation on paper and asked ChatGPT not to provide the direct answer, but to explain how to solve it step by step. It seems to have great potential in teaching problem-solving.

ChatGPT said, “Whenever you’re frustrated with math, I’m right there with you.”

Next, they attempted to test GPT-4o’s coding ability. They interacted with the desktop version of ChatGPT on a computer using voice to have it explain what a segment of code does, what a certain function does – ChatGPT answered all smoothly.

The result of running the code was a temperature curve graph. They asked ChatGPT to respond in a single sentence to every question related to this graph.

In which month is it the hottest? Is the Y-axis Celsius or Fahrenheit? It can answer all these questions.

OpenAI also responded to some real-time questions raised by netizens on platforms like Twitter. For example, real-time voice translation, the phone can be used as a translation machine to translate between Spanish and English.

It seems that GPT-4o can already achieve real-time video understanding.

Next, let’s take a detailed look at the blockbuster released by OpenAI today.

The Omnimodel GPT-4o

The first introduction is about GPT-4o, where ‘o’ stands for Omnimodel.

For the first time, OpenAI has integrated all modalities in a model, greatly enhancing the usability of large models.

OpenAI CTO Muri Murati said that GPT-4o provides “GPT-4 level” intelligence but improves text, visual, and audio capabilities based on GPT-4, and it will be iteratively rolled out in the company’s products in the next few weeks.

“The reasoning of GPT-4o spans across voice, text, and visuals,” Muri Murati said. “We know these models are becoming more complex, but we want the interaction experience to be more natural and simpler. You don’t have to worry about the user interface, just focus on collaborating with GPT.”

GPT-4o matches the performance of GPT-4 Turbo on English text and code but significantly improves non-English text performance. Its API speed is also faster, reducing costs by 50%. Compared to existing models, GPT-4o is particularly outstanding in visual and audio understanding.

It can respond to audio input in as little as 232 milliseconds, with an average response time of 320 milliseconds, which is similar to humans. Prior to the release of GPT-4o, users who have experienced the voice dialogue capability of ChatGPT have perceived the average delay of ChatGPT to be 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4).

This voice response mode is a pipeline composed of three independent models: a simple model transcribes audio to text, GPT-3.5 or GPT-4 receives text and outputs text, and the third simple model converts the text back to audio. But OpenAI found this method meant that GPT-4 would miss a lot of information, such as the inability to directly observe tone, multiple speakers, or background noise, and it couldn’t output laughter, singing or express emotions.

With GPT-4o, OpenAI trained a new model end-to-end across text, visual, and audio. This means that all inputs and outputs are handled by the same neural network.

“Technically, OpenAI has found a method that can directly map audio to audio as a primary modality and real-time video to transformer. This requires some new research on tokenization and architecture, but overall it’s a data and system optimization issue (as most things are),” commented Nvidia scientist Jim Fan.

GPT-4o can perform real-time reasoning across text, audio, and video, which is an important step towards more natural human-machine interaction (even human-machine-machine interaction).

OpenAI President Greg Brockman also got into the action, not only making two GPT-4o units hold a real-time conversation, but also creating an impromptu song. While the melody was a bit touching, the lyrics covered room decor style, character clothing features, and various incidents.

In addition, GPT-4o’s ability to understand and generate images significantly surpasses any existing model, making many previously impossible tasks hassle-free.

For example, you could ask it to help print OpenAI’s logo on a coaster.

After this period of technical breakthroughs, OpenAI has likely perfectly resolved the issues with ChatGPT generating fonts.

Simultaneously, GPT-4o also has the capability to generate 3D visual content, capable of performing 3D reconstruction from six generated images.

This is a poem, GPT-4o can format it in a handwritten style.

Even more complex formatting styles can be handled.

Working with GPT-4o, all you need to do is input a few paragraphs of text, and you can get a continuous sequence of comic panels.

And the following gameplay should surprise many designers.

This is a stylized poster evolved from two life photos.

There are also some niche features, such as “text to art font.”

GPT-4o performance evaluation results

A member of OpenAI’s technical team stated on X that the mysterious model, which caused widespread discussion on LMSYS Chatbot Arena, “im-also-a-good-gpt2-chatbot,” is a version of GPT-4o.

On the more challenging prompt set – especially in terms of coding: GPT-4o shows a particularly significant performance improvement compared to the best model from OpenAI in the past.

Specifically, in multiple benchmark tests, GPT-4o achieved GPT-4 Turbo level performance in text, reasoning, and coding intelligence, while reaching new heights in multilingual, audio, and visual capabilities.

Reasoning Improvement: GPT-4o set a new high score of 87.2% on the 5-shot MMLU (common sense questions). (Note: Llama3 400b is still in training)

Audio ASR Performance: GPT-4o significantly improved the speech recognition performance for all languages compared to Whisper-v3, especially for resource-limited languages.

GPT-4o achieved a new level of SOTA in voice translation, and surpassed Whisper-v3 in the MLS benchmark test.

The M3Exam benchmark test is both a multi-language and visual evaluation benchmark, composed of multiple-choice standardized tests from various countries/regions, and includes graphics and charts. In all language benchmark tests, GPT-4o is stronger than GPT-4.

In the future, improvements in model capabilities will enable more natural, real-time speech conversations, and will be able to dialogue with ChatGPT through live video. For instance, users could show ChatGPT a live sports match and ask it to explain the rules.

ChatGPT users will get more advanced features for free. Over one hundred million people are using ChatGPT every week and OpenAI announced that GPT-4o’s text and image features are now available for free in ChatGPT, with up to five times the message limit for Plus users.

Now if we open ChatGPT, we find that GPT-4o is already available for use.

When using GPT-4o, ChatGPT free users can now access the following features: experience GPT-4 level intelligence; users can get responses from the model and the internet.

Additionally, free users can also have the following options —

Analyzing data and creating charts:

Having conversations with the captured photos:

Upload files for help with summaries, writing, or analysis:

Finding and using GPTs and GPT app stores:

Building a more helpful experience using memory features.

However, depending on usage and needs, the number of messages that free users can send with GPT-4o is limited. When this limit is reached, ChatGPT automatically switches to GPT-3.5 so users can continue the conversation.

In addition, OpenAI will introduce a new version of the voice mode GPT-4o alpha in ChatGPT Plus in the coming weeks and roll out more new audio and video features of GPT-4o via the API to a small set of trusted partners.

Of course, through multiple rounds of model testing and iteration, GPT-4o has some limitations in all modalities. In these imperfect areas, OpenAI is working to improve GPT-4o.

It can be imagined that the opening of the GPT-4o audio mode will definitely bring about various new risks. On the issue of safety, GPT-4o has built-in safety in cross-modality design through techniques such as filtering training data and refining model behavior after training. OpenAI has also created a new safety system to provide protection for voice output.

The new desktop app simplifies the user workflow

For both free and paid users, OpenAI has also launched a new ChatGPT desktop application for macOS. With a simple keyboard shortcut (Option + Space), users can immediately ask questions to ChatGPT, and users can also directly capture screen screenshots in the application and discuss them.

Now, users can also directly have a voice conversation with ChatGPT from the computer, and the audio and video functions of GPT-4o will be launched in the future. You can start a voice conversation by clicking on the headphone icon in the bottom right corner of the desktop application.

Starting today, OpenAI will launch the macOS app to Plus users and will make the app more widely available in the coming weeks. In addition, OpenAI will launch a Windows version later this year.

Altman: You open source, we are free

After the release, OpenAI CEO Sam Altman published a long-awaited blog post, detailing the journey when promoting GPT-4o:

In our release today, I want to emphasize two things.

First, a key part of our mission is to provide people with powerful AI tools for free (or at a discounted price). I am very proud to announce that we offer the best model in ChatGPT for free, with no ads or anything similar.

When we founded OpenAI, our initial idea was: we want to create artificial intelligence and use it to create various benefits for the world. Now the situation has changed, it seems that we will create artificial intelligence, and then others will use it to create all kinds of amazing things, and we will all benefit from it.

Of course, we are a company and will invent a lot of things to charge, which will help us provide free, excellent artificial intelligence services to billions of people (we hope).

Second, the new voice and video modes are the best computing interaction interface I have ever used. It feels like the AI from movies, and I’m still a bit surprised that it is real. It turns out that achieving human-level response times and expression capabilities is a huge leap.

The original ChatGPT hinted at the possibilities of the language interface, and this new thing (the GPT-4o version) feels fundamentally different—it’s fast, smart, fun, natural, and helpful.

For me, interacting with computers has never been a very natural thing, and so it is. But when we add (optional) personalization, access to personal information, allow AI to take action on behalf of people, etc., I can indeed see an exciting future where we can do more with computers than ever before.

Finally, a big thank you to the team for their tremendous efforts to achieve this goal!

It’s worth mentioning that last week, Altman said in an interview that while universal basic income is hard to achieve, we can achieve ‘universal basic compute’. In the future, everyone can get GPT’s computing power for free, which can be used, resold, or donated.

“The idea is that as AI becomes more advanced and is embedded in every aspect of our lives, having a large language model like GPT-7 may be more valuable than money. You own a part of productivity,” Altman explained.

The launch of GPT-4o might be the beginning of OpenAI’s efforts in this direction.

Yes, it’s just a beginning.

Finally, the video “Guessing May 13th’s announcement.” shown in the OpenAI blog today is almost completely coinciding with a warm-up video for Google’s I/O conference tomorrow. This is undoubtedly a huge slap in the face for Google. I wonder if Google feels tremendous pressure after watching OpenAI’s release today?

# Content # Latest Release # Learning # News # ChatGPT # GPT-4o # OpenAI # Sam Altman

The copyright of the article belongs to the author, please do not reprint without permission.