Skip to content
artificial emotional intelligence

Major ChatGPT-4o update allows audio-video talks with an “emotional” AI chatbot

New GPT-4o model can sing a bedtime story, detect facial expressions, read emotions.

Benj Edwards and Kyle Orland | 114
Story text

On Monday, OpenAI debuted GPT-4o (o for "omni"), a major new AI model that can ostensibly converse using speech in real time, reading emotional cues and responding to visual input. It operates faster than OpenAI's previous best model, GPT-4 Turbo, and will be free for ChatGPT users and available as a service through API, rolling out over the next few weeks, OpenAI says.

OpenAI revealed the new audio conversation and vision comprehension capabilities in a YouTube livestream titled "OpenAI Spring Update," presented by OpenAI CTO Mira Murati and employees Mark Chen and Barret Zoph that included live demos of GPT-4o in action.

OpenAI claims that GPT-4o responds to audio inputs in about 320 milliseconds on average, which is similar to human response times in conversation, according to a 2009 study, and much shorter than the typical 2–3 second lag experienced with previous models. With GPT-4o, OpenAI says it trained a brand-new AI model end-to-end using text, vision, and audio in a way that all inputs and outputs "are processed by the same neural network."

OpenAI Spring Update.

"Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations," OpenAI says.

During the livestream, OpenAI demonstrated GPT-4o's real-time audio conversation capabilities, showcasing its ability to engage in natural, responsive dialogue. The AI assistant seemed to easily pick up on emotions, adapted its tone and style to match the user's requests, and even incorporated sound effects, laughing, and singing into its responses.

OpenAI CTO Mira Murati seen debuting GPT-4o during OpenAI's Spring Update livestream on May 13, 2024.
OpenAI CTO Mira Murati seen debuting GPT-4o during OpenAI's Spring Update livestream on May 13, 2024.
OpenAI CTO Mira Murati seen debuting GPT-4o during OpenAI's Spring Update livestream on May 13, 2024. Credit: OpenAI

The presenters also highlighted GPT-4o's enhanced visual comprehension. By uploading screenshots, documents containing text and images, or charts, users can apparently hold conversations about the visual content and receive data analysis from GPT-4o. In the live demo, the AI assistant demonstrated its ability to analyze selfies, detect emotions, and engage in lighthearted banter about the images.

Additionally, GPT-4o exhibited improved speed and quality in more than 50 languages, which OpenAI says covers 97 percent of the world's population. The model also showcased its real-time translation capabilities, facilitating conversations between speakers of different languages with near-instantaneous translations.

OpenAI first added conversational voice features to ChatGPT in September 2023 that utilized Whisper, an AI speech recognition model, for input and a custom voice synthesis technology for output. In the past, OpenAI's multimodal ChatGPT interface used three processes: transcription (from speech to text), intelligence (processing the text as tokens), and text to speech, bringing increased latency with each step. With GPT-4o, all of those steps reportedly happen at once. It "reasons across voice, text, and vision," according to Murati. They called this an "omnimodel" in a slide shown on-screen behind Murati during the livestream.

OpenAI announced that GPT-4o will be accessible to all ChatGPT users, with paid subscribers having access to five times the rate limits of free users. GPT-4o in API form will also reportedly feature twice the speed, 50 percent lower cost, and five-times higher rate limits compared to GPT-4 Turbo. (Right now, GPT-4o is only available as a text model in ChatGPT, and the audio/video features have not launched yet.)

In Her, the main character talks to an AI personality through wireless earbuds similar to AirPods.
In Her, the main character talks to an AI personality through wireless earbuds similar to AirPods.
In Her, the main character talks to an AI personality through wireless earbuds similar to AirPods. Credit: Warner Bros.

The capabilities demonstrated during the livestream and numerous videos on OpenAI's website recall the conversational AI agent in the 2013 sci-fi film Her. In that film, the lead character develops a personal attachment to the AI personality. With the simulated emotional expressiveness of GPT-4o from OpenAI (artificial emotional intelligence, you could call it), it's not inconceivable that similar emotional attachments on the human side may develop with OpenAI's assistant, as we've already seen in the past.

Murati acknowledged the new challenges posed by GPT-4o's real-time audio and image capabilities in terms of safety, and stated that the company will continue researching safety and soliciting feedback from test users during its iterative deployment over the coming weeks.

"GPT-4o has also undergone extensive external red teaming with 70+ external experts in domains such as social psychology, bias and fairness, and misinformation to identify risks that are introduced or amplified by the newly added modalities," says OpenAI. "We used these learnings [sic] to build out our safety interventions in order to improve the safety of interacting with GPT-4o. We will continue to mitigate new risks as they’re discovered."

Updates to ChatGPT

Also on Monday, OpenAI announced several updates to ChatGPT, including a ChatGPT desktop app for macOS, which began to roll our to a few testers who subscribe to ChatGPT Plus today and will become "more broadly available" in the coming weeks, according to OpenAI. OpenAI is also streamlining the ChatGPT interface with a new home screen and message layout.

And as we mentioned briefly above, when using the GPT-4o model (once it becomes widely available), ChatGPT Free users will have access to web browsing, data analytics, the GPT Store, and Memory features, which were previously limited to ChatGPT Plus, Team, and Enterprise subscribers.

Listing image: Getty Images

Photo of Benj Edwards
Benj Edwards and Kyle Orland Senior AI Reporter
Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.
114 Comments
Staff Picks
d
I have a pretty simple test that has been demolishing GPT4 the last couple of weeks..

I gave it each of my animals and their feeding schedules (with reptiles it can be Monday/Sunday, First Sunday, etc) all of which it understands and reasons about perfectly! It has added my pets and their feeding schedule to it's memory however when asked "Do I feed any of my pets today?" it consistently gets THE CURRENT DAY wrong, and gives the wrong answer.

I have insisted that it add a memory to always Search and check the correct day before giving a feeding schedule - sometimes it does do a search - and ignores the answer, providing the wrong date.

---

I was actually thinking it wouldn't achieve it, despite the hang up being so simple.. GPT4o is actually 5/5 on this test~! One time it actually wrote a Python script to get the correct date before answering! What a good little rule follower. =]
Mr. Perfect
I grew up in the US, live right next door to the US, and dislike that kind of chirpiness in general. But even so, it sounds way over the top. It's not that you'd never hear that level from an actual person, but it would indicate that the person was insincere (above and beyond formulaic "how are you" when you don't actually care), and not very good at acting and/or gauging their audience.

From your name I'm guessing maybe you come from somewhere other than the US and aren't quite calibrated to see it as excessive even for the US. But I admit I could also be the one who's miscalibrated, especially because I'm old. And it's true 's true that women are really expected to lay it on pretty thick in some situations.
Yeah, the AI in the demo sounded like an advertisement voiceover or someone using their "customer service voice" in a retail setting. Normal people don't bop around chirping at each other in tones like that and it's honestly somewhere in the uncanny valley for me. Hopefully this was meant to make the demo seem more engaging and normal interactions will use a more natural speaking voice.
7
Is no one going to mention how in the live demo, Chat GPT-4o kept interrupting Mark and being interrupted by seemingly non-conversational audio? (around 10:30 mark) Sure, it can pick up emotional states, but it can't tell when you're done speaking, or when you're not speaking at all? The demo made it seem like if there was the slightest background audio, e.g., from the live audience laughing, ChatGPT-4o would get "interrupted", even though Mark wasn't speaking. And it seemed like Mark was trying his best to not let ChatGPT interrupt him... so awkward.