Unlocking Seamless Multimodal Interaction: Introducing GPT-4o
We are introducing GPT-4o, our latest flagship model designed to process audio, vision, and text simultaneously in real-time. GPT-4o, where "o" stands for "omni", represents a significant leap towards more natural human-computer interaction. It can accept any combination of text, audio, and image inputs and generate corresponding outputs in any of these formats. The model can respond to audio inputs within an average of 320 milliseconds, comparable to human response times in conversations. GPT-4o matches the performance of GPT-4 Turbo in English text and code, with significant improvements in non-English text, while being faster and 50% cheaper in API usage. It excels particularly in understanding vision and audio compared to existing models.
Previously, using Voice Mode with GPT-3.5 or GPT-4 resulted in average latencies of 2.8 seconds and 5.4 seconds, respectively. With GPT-4o, we've streamlined the process by training a single new model to process all modalities—text, vision, and audio. This means that all inputs and outputs are handled by the same neural network, eliminating the loss of information seen in previous models. Despite this advancement, we are still exploring the capabilities and limitations of GPT-4o, especially across 20 representative languages chosen for the new tokenizer's compression across different language families.
GPT-4o incorporates safety measures across all modalities, including filtering training data and refining the model's behavior through post-training. We've evaluated its safety according to our Preparedness Framework and have undergone extensive external evaluations to identify and mitigate risks. Despite these efforts, GPT-4o's audio modalities present new risks, which we are actively addressing through technical infrastructure improvements and safety measures.
Initially, GPT-4o's audio outputs will be limited to preset voices and will adhere to existing safety policies. Text and image capabilities are being rolled out today in ChatGPT, with GPT-4o available in the free tier and for Plus users with higher message limits. A new version of Voice Mode with GPT-4o will be introduced in alpha within ChatGPT Plus in the coming weeks. Developers can access GPT-4o in the API as a text and vision model, offering twice the speed, half the price, and five times higher rate limits compared to GPT-4 Turbo.