Published: 10 februari 2025

Visual ChatGPT: A Complete Guide

ChatGPT is an advanced language model that uses natural language processing with machine learning to do cool stuff. All users have to do is type a prompt into the chat box, and ChatGPT’s AI chatbot will reply with a relevant response to your query. For the last few months, the AI tool has seen massive popularity among everyone, from content creators to students to computer coders. But ChatGPT lacks one feature many want – the ability to generate digital art.

Visual Foundation Models (VFM) scratch this itch and there are a few around, you might have seen DALL-E2, Stable Diffusion or Midjourney.

These are all examples of visual foundation models capable of receiving an image prompt and transforming it into visual content.

Microsoft recently announced that ChatGPT would also be getting the art generator treatment in a bid to push the boundaries of what is possible with the artificial intelligence chatbot.

This new model has been dubbed Visual ChatGPT. Keep reading to find out what it is, why it is, what it can do and how to use it.

Are you in a rush?

Skip to a section using the table below:

What is Visual ChatGPT?

Visual ChatGPT combines the key features of OpenAI’s ChatGPT with a series of VFM, 22 in fact.

This allows the model to receive and generate images in response to user requests, something that couldn’t be done with ChatGPT, as it was only limited to text inputs and outputs.

ChatGPT could generate a prompt for other AI image generators like Stable Diffusion or Midjourney, but not for itself. Visual ChatGPT changes this and opens up the door for multi-modal interactions.

On top of this, Visual ChatGPT offers similar features to image editing software (like Adobe Photoshop). For example, basic image editing tasks like cropping a photo, changing the background colour or removing an object.

In conclusion, Visual ChatGPT can understand and process language and images.

What Are the Features & Benefits of Visual ChatGPT?

Visual ChatGPT brings about some brand-new features for the AI-powered chatbot that include advanced editing features, making it a very powerful tool.

Let’s look at some of these now:

In addition to text prompts, users can submit images or describe an image they want to generate and the model will create it.
Because it integrates several VFM, Visual ChatGPT can handle complex image prompts that require greater processing power.
It uses advanced algorithms for editing images, such as edge detection, object detection, line detection, HED detection and image condition.
Removing and replacing objects within a photo.
Describe and summarise an image by text.
Change the styling and aspects of images.
Provides a free alternative to professional image editing software like Adobe Photoshop.
It understands both your chat and visual context. So, if you submit an image of a beach with a person lying on a towel and ask it, “What is the person doing?” Visual ChatGPT will analyse the prompt and the image to draw a relevant response. For example, it might reply by saying, “He is sunbathing”.
Due to its massive collection of training images, over time, Visual ChatGPT has learned to give accurate responses to users.

How Visual ChatGPT Works

The above diagram shows the inner workings of Visual ChatGPT directly from Microsoft’s GitHub repository.

Now, to almost anyone other than a computer scientist, this won’t make a lot of sense, so for everyone else, here’s a description in layman’s terms:

1. User Input

Visual ChatGPT lets you choose from two input types: text and image.

Using one or both of these gives your input context so the model can generate a more accurate response.

For example, you might use a textual input that describes an image you want to generate, but by adding an image input as well, the model can draw more context from your message and create something better.

2. Textual Encoding

Transformer-based neural networks, called text encoders, basically assign meaning to the words in your text prompt so it can generate an appropriate response.

Using Visual ChatGPT’s training data, it will analyse your words and make a calculated guess as to what you mean.

This is surprisingly accurate most of the time due to the sheer amount of data these AI models have been trained on, they know a lot.

3. Image Encoding

Similar to text encoding in principle, image encoding aims to transmit data from an image into understandable terms for the computer model.

This is done by compressing and extracting high-level features that the model’s computer vision can identify, then passing the data through to the next stage.

4. Multimodal Fusion

This is where it gets super technical.

In this stage of processing, the textual input and image input are concatenated (linked together in a chain or series) or added together to create a whole representation of the input.

This is then passed through one or more fusion layers that combine information from the text and image inputs.

In other words, multimodal fusion takes your inputs and transforms the data into a definitive command.

Read this for a full breakdown.

5. Decoding

Decoding is the reverse of encoding. Where encoders transform information into readable computer formats for processing and transmission, decoders convert this back into readable language.

Decoders often use a probability method to determine which output will best suit the user input.

It’s kind of like how protective text guesses what you’re trying to say based on what you’re typing and the conversation history.

6. Output

And finally, the output represents the actual written response you get from Visual ChatGPT.

The response you get depends upon the computer algorithm carefully selecting a probable response that best suits your query and choosing the one with the highest probability.

How to Use Visual ChatGPT

Because Microsoft has made Visual ChatGPT open-source, you can access and use the application from a number of different places.

One option is to run it from your own desktop via Python
or you can use a website interface to interact with the AI model without the fuss of installing it.

Let’s look at both options:

Steps to run Visual ChatGPT on your system

Follow these steps from Microsoft’s GitHub repository, you’ll need to download Python.

Clone the repo git clone https://github.com/microsoft/visual-chatgpt.git
Go to directory
cd visual-chatgpt
Create a new environment
conda create -n visgpt python=3.8
Activate the new environment
conda activate visgpt
Prepare the basic environments
pip install -r requirements.txt
Prepare your private OpenAI key (for Linux)
export OPENAI_API_KEY={Your_Private_Openai_Key}
Prepare your private OpenAI key (for Windows)
set OPENAI_API_KEY={Your_Private_Openai_Key}
Start TaskMatrix !
# You can specify the GPU/CPU assignment by “–load”, the parameter indicates which
# Visual Foundation Model to use and where it will be loaded to
# The model and device are separated by underline ‘_’, the different models are separated by comma ‘,’
# The available Visual Foundation Models can be found in the following table
# For example, if you want to load ImageCaptioning to cpu and Text2Image to cuda:0
# You can use: “ImageCaptioning_cpu,Text2Image_cuda:0”
Advice for CPU Users
python visual_chatgpt.py –load ImageCaptioning_cpu,Text2Image_cpu
Advice for 1 Tesla T4 15GB (Google Colab)
python visual_chatgpt.py –load “ImageCaptioning_cuda:0,Text2Image_cuda:0”
Advice for 4 Tesla V100 32GB
python visual_chatgpt.py –load “Text2Box_cuda:0,Segmenting_cuda:0,
Inpainting_cuda:0,ImageCaptioning_cuda:0,
Text2Image_cuda:1,Image2Canny_cpu,CannyText2Image_cuda:1,
Image2Depth_cpu,DepthText2Image_cuda:1,VisualQuestionAnswering_cuda:2,
InstructPix2Pix_cuda:2,Image2Scribble_cpu,ScribbleText2Image_cuda:2,
SegText2Image_cuda:2,Image2Pose_cpu,PoseText2Image_cuda:2,
Image2Hed_cpu,HedText2Image_cuda:3,Image2Normal_cpu,
NormalText2Image_cuda:3,Image2Line_cpu,LineText2Image_cuda:3″

This is a demo of Visual ChatGPT:

Using Visual ChatGPT online

Head to a website that runs the Visual ChatGPT model like Stable Diffusion.
Using the chatbox in the right-hand corner, enter your OpenAI API key, then your text prompt and/or image URL. You might ask for a description of an image or for the model to generate a new image based on your prompt.
Visual ChatGPT will now process your request using the 22 visual foundation models.
Wait for Visual ChatGPT to generate its response.
Wahoo! Gaze in awe at your AI-generated masterpiece.

How Does it Differ From AI Image Generators?

The big difference between Visual ChatGPT and AI image generators is the fact that the former can understand text inputs and highly complex queries.

It can process several tasks simultaneously and give feedback on images upon request, describing and even modifying elements of an image.

What Could Visual ChatGPT Be Used For?

There are many real-world applications for this incredible AI.

Similar to how the advancements in AI-based tools allowed businesses to enhance the chatbot experience across customer service industries, Visual ChatGPT will take this one step further.

1. Customer service teams

How often do companies ask for images?

Then you have to find it, upload it to an email and wait for a reply. Oh, and make sure you submit it during working hours, too!

It’s laborious and takes too long.

With Visual ChatGPT, customers can upload images instantly, at any time of the day, and the computer will scan your media to offer faster solutions.

2. E-commerce

There’s much that can be done in e-commerce to streamline business operations and fine-tune the customer journey.

For example, a customer could generate an image of a product purely based on a description.

Visual ChatGPT could act as a virtual assistant, probing the customer for suggestions and displaying appropriate products and services, as well as suggesting alternatives based on the chat context and history.

3. Healthcare

It could help in diagnosing patients remotely by analysing images and videos sent by the user.

Visual ChatGPT could identify anomalies or irregularities in a patient’s record using historical imagery or conversation, highlighting areas of concern for doctors and other healthcare professionals.

4. Social Media

Businesses can find suitable collaborators on social media by using Visual ChatGPT to analyse content and visuals.

this can help them determine whether the person aligns with the business’s values and brand identity.

The AI tool could also be used to analyse trending topics, patterns and user behaviours to help marketing strategies for their target audience.

5. Education

Visual ChatGPT could be used to provide additional learning resources, such as images and videos that illustrate complex ideas.

It could be used to develop the linguistics of students, advising on grammar, spelling and word choice.

Or to help teach a new language!

6. Creatives

Photographers, videographers, editors, content creators, writers – much of the creative field could benefit from the ability to edit and share images for free and quickly.

Wrapping Up

Microsoft’s open-source AI tool Visual ChatGPT marks a milestone in technology that will be referenced in years to come.

Combining the wordsmithery of ChatGPT with the image-generating capabilities of VFM is a powerful mix that has plenty of uses for businesses and individuals alike.

As the model develops and receives feedback, it could shape up to be the sort of tech that sees widescale implementation across every industry.

I guess we’ll have to wait and see.

*sources

FAQ's

How does Visual ChatGPT work?
Essentially, the model combines two things: First, it uses ChatGPT to understand a person’s textual prompt. Second, it combines this with what’s called VFM (visual foundation models) to understand how to transform your text prompt into an image.
Is Visual ChatGPT free?
Yes, but you do have to obtain an OpenAI API key which does require a bit of faffing. We have a guide on this in the article.
Why is Visual ChatGPT so good?
Because it can handle multiple, highly complex requests at once. It also produces more accurate responses as the model cross-references information from 22 VFM. Visual ChatGPT can also give real-time feedback on images and make edits on request.