Action recognition has improved dramatically with massive-scale video datasets. Yet, these datasets are accompanied by issues related to curation cost, privacy, ethics, bias, and copyright. So, MIT scientists are turning to synthetic datasets.
These are made by a computer that uses 3D models of scenes, objects, and humans to quickly produce many varying clips of specific actions — without the potential copyright issues or ethical concerns that come with real data.
Is synthetic data good as real data?
A team of scientists at MIT, the MIT-IBM Watson AI Lab, and Boston University sought to answer this question. They created a synthetic dataset of 150,000 video clips that represented a variety of human actions and trained machine-learning models using this dataset. They then displayed six datasets of films taken from the actual world to these models to test how well they could pick up on the actions in those recordings.
Scientists found that the synthetically trained models performed even better than models trained on real data for videos that have fewer background objects.
This discovery may aid in using synthetic datasets by scientists to assist models in performing more accurately on actual tasks. To reduce some of the ethical, privacy, and copyright concerns associated with using actual datasets, it can also assist researchers in determining which machine-learning applications are most suited for training with synthetic data.
Rogerio Feris, principal scientist, and manager at the MIT-IBM Watson AI Lab said, “The ultimate goal of our research is to replace real data pretraining with synthetic data pretraining. There is a cost in creating an action in synthetic data, but once that is done, you can generate unlimited images or videos by changing the pose, lighting, etc. That is the beauty of synthetic data.”
Scientists started by compiling a new Synthetic Action Pre-training and Transfer (SynAPT), using three publicly available datasets of synthetic video clips that captured human actions. It contains almost 150 action categories, with 1,000 video clips per category.
Three machine learning models were pretrained to recognize the actions using the dataset after it had been created. Pretraining is the process of teaching a model one task in advance of teaching it another. The pretrained model can use the parameters it has already learned to help it learn a new task with a new dataset faster and more efficiently. This is modeled after how people learn, which is to reuse past information when we know something new. The pretrained model has been tested using six datasets of real video clips, each capturing classes of actions that were different from those in the training data.
It was surprising for scientists to see that all three synthetic models outperformed models trained with actual video clips on four of the six datasets. Their accuracy was highest for datasets that contained video clips with “low scene-object bias.” It means the model cannot recognize the action by looking at the background or other objects in the scene — it must focus on the action itself.
Feris said, “In videos with low scene-object bias, the temporal dynamics of the actions is more important than the appearance of the objects or the background, and that seems to be well-captured with synthetic data.”
“High scene-object bias can act as an obstacle. The model might misclassify an action by looking at an object rather than the action itself. It can confuse the model.”
Co-author Rameswar Panda, a research staff member at the MIT-IBM Watson AI Lab, said, “Building off these results, the researchers want to include more action classes and additional synthetic video platforms in future work, eventually creating a catalog of models that have been pretrained using synthetic data.”
“We want to build models which have very similar or even better performance than the existing models in the literature, but without being bound by any of those biases or security concerns.”
Sooyoung Jin, a co-author and CSAIL postdoc, said, “They also want to combine their work with research that seeks to generate more accurate and realistic synthetic videos, which could boost the performance of the models.”
“We use synthetic datasets to prevent privacy issues or contextual or social bias, but what does the model learn? Does it learn something that is unbiased?”
Co-author Samarth Mishra, a graduate student at Boston University (BU), said, “Despite there being a lower cost to obtaining well-annotated synthetic data, currently, we do not have a dataset with the scale to rival the biggest annotated datasets with real videos. By discussing the different costs and concerns with real videos and showing the efficacy of synthetic data, we hope to motivate efforts in this direction.”
- Yo-what Kim et al. How Transferable are Video Representations Based on Synthetic Data? Paper