Identifying spam can sometimes be a hard task as spammers get better and better at their job. How to identify spam? It is not an easy question nor a hard on. It all depends on the situation the spammer is taking advantage of. Researchers at the Georgia Institute of Technology’s School of Interactive Computing have found out a solution over it.
They have developed a novel computational approach for internet communities to moderate online abusive comments. Scientists named it as Bag of Communities (BoC). It is a technique that influences large-scale, preexisting data from other internet communities to train an algorithm to identify abusive behavior within a separate target community.
Scientists identified 9 different communities. Five communities among them are rife with abusive behavior from commenters. Four of them are like heavily moderated Metafilter.
By using semantic characteristics from these two types of communities, scientists developed an algorithm that can learn comments if a new post is generated within the community. It then predicts whether the comment is abusive or not.
Eric Gilbert, an associate professor said, “MetaFilter is known around the internet as a good, helpful, supportive community. That’s an example of how, if your post is closer to that, it’s more likely that it should stay on the site. Conversely, if your post is closer to 4chan, then maybe it should come off.”
Scientists developed two types of algorithm. 1. static model and 2. dynamic model.
Static model access only to posts from the other nine communities, and can accurately predict abusive posts in the target community roughly three-quarters of the time. At the other hand, dynamic model mimics scenarios in which newly moderated data arrives in batches. It then learns over time and achieves 91.18 percent accuracy after seeing 100,000 human-moderated posts.
Both models outperformed a solely in domain model from a major internet community.
Many prior types of research for detecting online abusive comments moderation has focused on in-domain methods. But they are facing challenges in obtaining enough data to build and evaluate algorithms. But, in the BoC-based method, algorithms would leverage out-of-domain data from other existing online communities.
Chandrasekharan said, “Over time, as new moderator labels come in when it has seen examples of things that have been moderated from the site, it can learn more site-specific information. It can learn the type of comments that get moderated, and if there is a level of tolerance that is different from what you see in the static model, it could learn that over time.”
Gillbert said, “This is a core internet problem. So many places struggle with this, and many are shutting comments off because they just don’t want to deal with the trouble they cause.“