A dataset to improve automatic captioning systems

A new dataset can help scientists develop automatic systems that generate richer.


For those with visual impairments, captions describing or explaining charts can increase retention and knowledge of the data they display. They also serve as a more accessible medium. However, present methods for automatically producing such captions find it challenging to describe the perceptual or cognitive features that make charts distinctive.

To address this problem, MIT scientists have introduced VisText, a dataset to improve automatic captioning systems. With the aid of this technology, scientists may train a machine-learning model to adjust the complexity and kind of content in a chart caption in accordance with user requirements.

With VisText, scientists want to work on the thorny problem of chart auto-captioning. These automated solutions could assist in adding captions to online charts that are currently without them, increasing accessibility for those with visual impairments.

Angie Boggust, a graduate student in electrical engineering and computer science at MIT, said, “We’ve tried to embed a lot of human values into our dataset so that when we and other researchers are building automatic chart-captioning systems, we don’t end up with models that aren’t what people want or need.”

VisText is a dataset of charts and associated captions that could be used to train machine-learning models to generate accurate, semantically rich, customizable captions. VisText additionally displays charts as scene graphs to address the drawbacks of using images and data tables. Scene graphs, which may be retrieved from a chart image, feature additional image context and all the chart data.

They created a dataset with more than 12,000 charts. Each displayed as a data table, image, scene graph, and related captions. Each chart has two captions, one at a lower level that defines the building of the chart (such as the axis ranges) and the other at a higher level that discusses statistics, linkages in the data, and complex trends.

The scientists used an automated system to create low-level captions and crowdsourced higher-level captions from human workers.

Fellow graduate student Benny J. Tang said, “Our captions were informed by two key pieces of prior research: existing guidelines on accessible descriptions of visual media and a conceptual model from our group for categorizing semantic content. This ensured that our captions featured important low-level chart elements like axes, scales, and units for readers with visual disabilities while retaining human variability in how captions can be written.”

After collecting chart images and captions, the scientists used VisText to train five machine-learning auto-caption models. They were interested in learning how the quality of the caption varied by the various representations, including the image, data table, and scene graph.

Their findings demonstrated that scene graph-trained models outperformed data table-trained models in performance. The researchers contend that since scene graphs are simpler to derive from current charts, they may be a more helpful representation.

Additionally, they separately trained models using low-level and high-level captions. They were able to teach the model to adjust the intricacy of the caption’s content using a technique known as semantic prefix tuning.

Additionally, they conducted a qualitative analysis of captions created using their top-performing method and identified six categories of frequent mistakes. A directional error, for instance, happens when a model predicts a trend is declining when it is increasing.

Boggust says, “This fine-grained, robust qualitative evaluation was important for understanding how the model made its errors. For example, using quantitative methods, a directional error might incur the same penalty as a repetition error, where the model repeats the same word or phrase. But a directional error could be more misleading to a user than a repetition error. The qualitative analysis helped them understand these types of subtleties.”

“These sorts of errors also expose limitations of current models and raise ethical considerations that researchers must consider as they develop auto-captioning systems.”

Journal Reference:

  1. Benny Tang, Angie Boggust, Arvind Satyanarayan. VisText: A Benchmark for Semantically Rich Chart Captioning. Paper
- Advertisement -

Latest Updates