ChatGPT is an advanced language model that uses natural language processing with machine learning to do cool stuff. All users have to do is type a prompt into the chat box, and ChatGPT's AI chatbot will reply with a relevant response to your query. For the last few months, the AI tool has seen massive popularity among everyone, from content creators to students to computer coders. But ChatGPT lacks one feature many want - the ability to generate digital art.
Visual Foundation Models (VFM) scratch this itch and there are a few around, you might have seen DALL-E2, Stable Diffusion or Midjourney.
These are all examples of visual foundation models capable of receiving an image prompt and transforming it into visual content.
Microsoft recently announced that ChatGPT would also be getting the art generator treatment in a bid to push the boundaries of what is possible with the artificial intelligence chatbot.
This new model has been dubbed Visual ChatGPT. Keep reading to find out what it is, why it is, what it can do and how to use it.
Are you in a rush?
Skip to a section using the table below:
Visual ChatGPT combines the key features of OpenAI’s ChatGPT with a series of VFM, 22 in fact.
This allows the model to receive and generate images in response to user requests, something that couldn’t be done with ChatGPT, as it was only limited to text inputs and outputs.
ChatGPT could generate a prompt for other AI image generators like Stable Diffusion or Midjourney, but not for itself. Visual ChatGPT changes this and opens up the door for multi-modal interactions.
On top of this, Visual ChatGPT offers similar features to image editing software (like Adobe Photoshop). For example, basic image editing tasks like cropping a photo, changing the background colour or removing an object.
In conclusion, Visual ChatGPT can understand and process language and images.
Visual ChatGPT brings about some brand-new features for the AI-powered chatbot that include advanced editing features, making it a very powerful tool.
Let’s look at some of these now:
The above diagram shows the inner workings of Visual ChatGPT directly from Microsoft's GitHub repository.
Now, to almost anyone other than a computer scientist, this won't make a lot of sense, so for everyone else, here's a description in layman’s terms:
Visual ChatGPT lets you choose from two input types: text and image.
Using one or both of these gives your input context so the model can generate a more accurate response.
For example, you might use a textual input that describes an image you want to generate, but by adding an image input as well, the model can draw more context from your message and create something better.
Transformer-based neural networks, called text encoders, basically assign meaning to the words in your text prompt so it can generate an appropriate response.
Using Visual ChatGPT’s training data, it will analyse your words and make a calculated guess as to what you mean.
This is surprisingly accurate most of the time due to the sheer amount of data these AI models have been trained on, they know a lot.
Similar to text encoding in principle, image encoding aims to transmit data from an image into understandable terms for the computer model.
This is done by compressing and extracting high-level features that the model’s computer vision can identify, then passing the data through to the next stage.
This is where it gets super technical.
In this stage of processing, the textual input and image input are concatenated (linked together in a chain or series) or added together to create a whole representation of the input.
This is then passed through one or more fusion layers that combine information from the text and image inputs.
In other words, multimodal fusion takes your inputs and transforms the data into a definitive command.
Read this for a full breakdown.
Decoding is the reverse of encoding. Where encoders transform information into readable computer formats for processing and transmission, decoders convert this back into readable language.
Decoders often use a probability method to determine which output will best suit the user input.
It's kind of like how protective text guesses what you’re trying to say based on what you’re typing and the conversation history.
And finally, the output represents the actual written response you get from Visual ChatGPT.
The response you get depends upon the computer algorithm carefully selecting a probable response that best suits your query and choosing the one with the highest probability.
Because Microsoft has made Visual ChatGPT open-source, you can access and use the application from a number of different places.
Let’s look at both options:
Follow these steps from Microsoft’s GitHub repository, you’ll need to download Python.
This is a demo of Visual ChatGPT:
The big difference between Visual ChatGPT and AI image generators is the fact that the former can understand text inputs and highly complex queries.
It can process several tasks simultaneously and give feedback on images upon request, describing and even modifying elements of an image.
There are many real-world applications for this incredible AI.
Similar to how the advancements in AI-based tools allowed businesses to enhance the chatbot experience across customer service industries, Visual ChatGPT will take this one step further.
How often do companies ask for images?
Then you have to find it, upload it to an email and wait for a reply. Oh, and make sure you submit it during working hours, too!
It’s laborious and takes too long.
With Visual ChatGPT, customers can upload images instantly, at any time of the day, and the computer will scan your media to offer faster solutions.
There’s much that can be done in e-commerce to streamline business operations and fine-tune the customer journey.
For example, a customer could generate an image of a product purely based on a description.
Visual ChatGPT could act as a virtual assistant, probing the customer for suggestions and displaying appropriate products and services, as well as suggesting alternatives based on the chat context and history.
It could help in diagnosing patients remotely by analysing images and videos sent by the user.
Visual ChatGPT could identify anomalies or irregularities in a patient's record using historical imagery or conversation, highlighting areas of concern for doctors and other healthcare professionals.
Businesses can find suitable collaborators on social media by using Visual ChatGPT to analyse content and visuals.
this can help them determine whether the person aligns with the business’s values and brand identity.
The AI tool could also be used to analyse trending topics, patterns and user behaviours to help marketing strategies for their target audience.
Visual ChatGPT could be used to provide additional learning resources, such as images and videos that illustrate complex ideas.
It could be used to develop the linguistics of students, advising on grammar, spelling and word choice.
Or to help teach a new language!
Photographers, videographers, editors, content creators, writers - much of the creative field could benefit from the ability to edit and share images for free and quickly.
Microsoft’s open-source AI tool Visual ChatGPT marks a milestone in technology that will be referenced in years to come.
Combining the wordsmithery of ChatGPT with the image-generating capabilities of VFM is a powerful mix that has plenty of uses for businesses and individuals alike.
As the model develops and receives feedback, it could shape up to be the sort of tech that sees widescale implementation across every industry.
I guess we’ll have to wait and see.
How does Visual ChatGPT work?
Essentially, the model combines two things: First, it uses ChatGPT to understand a person's textual prompt. Second, it combines this with what's called VFM (visual foundation models) to understand how to transform your text prompt into an image.
Is Visual ChatGPT free?
Yes, but you do have to obtain an OpenAI API key which does require a bit of faffing. We have a guide on this in the article.
Why is Visual ChatGPT so good?
Because it can handle multiple, highly complex requests at once. It also produces more accurate responses as the model cross-references information from 22 VFM. Visual ChatGPT can also give real-time feedback on images and make edits on request.
*sources