This article will answer all of your big questions about AI image captioning, specifically:
- What is image captioning?
- Why is image captioning important in the context of website builders?
- How are we planning to automate this for Zyro customers?
So, let’s get started.
The What
What is image captioning?
Essentially, AI image captioning is a process that feeds an image into a computer program and a text pops out that describes what is in the image.
More precisely, image captioning is a collection of techniques in Natural Language Processing (NLP) and Computer Vision (CV) that allow us to automatically determine what the main objects in an image are and generate a descriptive text about those objects.
The Why
Okay, so now we know what image captioning is. But before we delve into concrete techniques, let’s talk about why we would want to do that (considering that Zyro is a website builder).
We can answer that question in three letters – SEO (read: search engine optimization).
In short, SEO is a process that makes your website more visible on search engines such as Google. Implementing the best SEO practices will help your website to rank higher on search engines and increase high-quality traffic to your site.
Here’s a nice picture summarizing the gist of SEO:

If you want to read more about it, check out this ultimate guide to SEO.
Needless to say that if you’re building a website or an online store, one of pre-requisites for success is that people can actually find your website.
There are many things that factor into a good ranking website, such as:
- The speed of your website
- The accessibility of your website
- The content on your website
Of course, there are many more factors, but let’s go back to the topic at hand – image captioning.
If your website has images and those images have alternative text (also called alt text), your website will be more accessible.
Alt text is primarily designed for accessibility purposes, like helping visually impaired people ‘see’ images with screen reading tools.
And while alt text is extremely useful in that regard, it also has actual implications for SEO.
Think about it like this: a search engine can’t actually see the images on your website. So, to decide if your website is filled with relevant images or spam images, you have to describe what is in the image to the search engine.
If you do it correctly, the bots at Google that rank websites should take one look at your website and say: “Hey, this website is super cool, let’s rank it higher.”
And search engine ranking is crucial for your business success. If your website has a lot of images (product pictures, for example), it might be a long and tiring task to edit the alternative text for them all individually.
That’s where AI comes in.
The How
Now that you know why captioned images are important for your website, let’s talk about how Zyro is planning to automate this task for our customers.
Right off the bat, you need to know one thing:
Image captioning is not a simple task.
Disclaimer: we currently do not have this feature, however we’ve tried out several different methods and have a good, working proof of concept (PoC) and we are planning to implement it as soon as possible.
Until then, let’s talk about our experience in building this PoC and compare the different open source methods.
Let’s start by thinking how this process works for us, humans.

Consider the image above. If we are told to describe it, maybe we’d describe it as “a dog lying on the pavement” or “a small dog lying with a red ball.”
But, how are we doing this?
While forming the description in our head, we are looking at the image. At the same time, we are looking to create a meaningful sequence of words.
Keep in mind that this is a fundamental problem in the domain of machine learning because we need to replicate the connections between vision and language and the understanding of images.
Now, let’s get our feet wet. The first part (the seeing of an image) is handled by a special neural network that handles images (such as a CNN) and the second part (the generation of a sequence of words) is handled by another (such as an RNN).
Combining them together we can make a system that provides captions for images.
But before we dive in, I should mention the process of tackling these problems at Zyro. We first try out the state-of-the-art pre-trained models that are available and open source. Then, we go from there.
If it needs more training for our purposes, we will train it. Otherwise, we’re good to go.
For image captioning, we tried three SOTA models with differing results.
1. Show and Tell
The first model that we encountered seemed promising. It was sitting in the top positions for many benchmarks for quite a long time.
It was developed at Google and open sourced as a Tensorflow library, while the seeing part of this model was done by the Inception V3 model (a convolutional neural network).
The image encoder is part of the whole system that would tell you what objects were in the image (a dog, a ball, pavement). Those objects would then have to be related to one another somehow and their descriptions would need to be expanded (a red ball, black pavement, etc). This part would then be handled by a recurrent neural network.

However, the initial results were not that good, as you can tell.

2. Self-critical Sequence Training
This model was very easy to implement and it is currently competing for first place on state-of-the-art benchmarks. The pre-trained model worked quite well out of the box.
It uses a combination of LSTMs (a form of recurrent neural networks) for the language part and deep CNNs for the vision part. The catch of this model is that it uses what is called self-critical sequence training (SCST) which is a form of reinforcement learning that greatly improves the scores on the benchmarking datasets.
The results produced were great:


The authors of this model wrote great code that works nearly perfectly straight out of the box. So, if we did decide to go with this model, it certainly would save us a lot of time.
3. OSCAR
This final model came from Microsoft and looked set to be one of the main contenders in terms of the benchmarks.
The OSCAR technique uses a pretrained Faster R-CNN model to detect objects in the image and represent them as a set of visual region features. These features then get an associated object tag.

The caption can be then represented as a word embedding using a pretrained BERT model.
Crucially, these three pieces of information are then used to train OSCAR. But, though OSCAR is one of top models on benchmarks, the pretrained model did not satisfy our needs.
Here, the caption for the first image was: “A couple of animals are standing in a field.”
Meanwhile, the second picture’s caption read: “A man is standing on a street.”

For the needs of Zyro customers we would either need to wait for the OSCAR+ pretrained model (which should bring improvements to the current OSCAR model) or train the current model ourselves further.
So, Which Model Did We Choose?
The problem of captioning images correctly is a long-standing problem in AI and it is certainly a fascinating one.
Right now, for our needs, the Self-critical Sequence Training for Image Captioning seems the most promising.
We are very excited about the new research that comes out of this field and can’t wait to offer our customers this handy tool as soon as we can.
Join the conversation
Your email address will not be published. All fields are required.