Press "Enter" to skip to content

ChatGPT Can Now See, Hear, And Speak As OpenAI Begins To Roll Out Voice & Image Features

New Delhi: OpenAI has announced that it is rolling out new voice and image capabilities in ChatGPT for more intuitive type of interface. It will allow users to have a voice conversation or show ChatGPT what they are talking about. Until now, ChatGPT is only limited to text form where you can give information only in text input.

“Snap a picture of a landmark while traveling and have a live conversation about what’s interesting about it,” OpenAI blog said.

OpenAI is rolling out the new update in coming two weeks for ChatGPT plus and Enterprise users. Voice feature will only available on iOS and Android while images will be available on all platforms.

How to start voice conversation in phone

Step 1: To get started with voice, head to Settings → New Features on the mobile app and opt into voice conversations. 

Step 2: Then, tap the headphone button located in the top-right corner of the home screen and choose your preferred voice out of five different voices.

Step 3: The new voice capability is powered by a new text-to-speech model, capable of generating human-like audio from just text and a few seconds of sample speech.

Step 4: We collaborated with professional voice actors to create each of the voices. We also use Whisper, our open-source speech recognition system, to transcribe your spoken words into text.

Chat about images

You can now show ChatGPT one or more images. Troubleshoot why your grill won’t start, explore the contents of your fridge to plan a meal, or analyze a complex graph for work-related data. To focus on a specific part of the image, you can use the drawing tool in our mobile app.

How to start image option

Step 1: To get started, tap the photo button to capture or choose an image. If you’re on iOS or Android, tap the plus button first. 

Step 2: You can also discuss multiple images or use our drawing tool to guide your assistant.

Step 3: Image understanding is powered by multimodal GPT-3.5 and GPT-4. These models apply their language reasoning skills to a wide range of images, such as photographs, screenshots, and documents containing both text and images.

Source: Zee News