Data Scientists Engaged in the Battle Against Data Bias 

Responsible AI as defined by is a blend of ethical AI, explainable AI, secure AI, and human-centered machine learning. (Credit: 

By AI Trends Staff 

Data scientists have joined the battle to eliminate or at least identify the bias in datasets used to train AI programs. 

The work is not easy. One person making the effort to address it is Benjamin Cox of, a firm dedicated to what it calls “responsible AI,” a blend of ethical AI, explainable AI, secure AI, and human-centered machine learning. With a background in data science and experience at Ernst & Young, Nike, and Citigroup, Cox is now a product marketing manager at H2O.   

Benjamin Cox, Director of Product Marketing,

“I became deeply passionate about the field of responsible AI after years working in data science and realizing there was a considerable amount of work that needed to be done to prevent machine learning from perpetuating and driving systemic and economic inequality in the future,” Cox said in a recent interview in SearchEnterpriseAI“In a perfect world, instead of allowing algorithms to perpetuate historical human bias, we can use them to identify that bias and strip it out and prevent it.” 

He and his team take a number of steps to identify and neutralize bias in datasets. First the team looks at how the data was collected, to see if there were operational issues that cause bias to enter the dataset.   

It then looks for data imbalances that could treat one group unfairly, usually because not enough data is available for that class to make good decisions. One example is the use of zip code for a credit decision model. Zip codes have a 95% correlation to a specific ethnicity. If the team decides ethnicity is not to be used to drive credit decisions, they need to ensure the model is not reverse-engineering ethnicity by way of another attribute. 

Transparency and explainability help to deliver on responsible AI. “We are able to paint a good picture of why the model came to the conclusion it did—and if the reason it came to that conclusion violates regulatory rules or makes common business sense,” Cox stated. 

Automated machine learning tools can drive risk to the extent the developers are not on top of what the algorithms are doing. The team has developed a technical approach to address this. “We automate the entire machine learning interpretability dashboard toolkit and model documentation so users can very easily analyze what the autoML system developed,” he said.  

Machine learning models need to have a degree of statistical bias to find the optimal performing model. “If we have no bias but extremely high variance, we have essentially created an extremely overfit model. This is one of the nuances of data science and understanding the underlying business problem you are trying to solve,” Cox said.  

The brittle nature of some machine learning models during the Covid-19 pandemic has generated more skepticism of AI. “What Covid-19 may have done is really shine a light on models that were very overfitted to extremely linear and stable market scenarios,” Cox stated. “The balance between signal and stability is key for creating models that are more resilient to shocks.” 

Expecting that Machine Learning Should Incorporate Neutral Variables 

Christabelle Pabalan, math tutor, graduate of the University of San Francisco, masters in Data Science

Christabelle Pabalan comes at fighting data bias from a student perspective. A graduate of the University of San Francisco with a masters in data science, she stated in a recent article she authored in Towards Data Science, “The essence of AI is math.” She currently works as a math tutor at AJ Tutoring.   

In theory, machine learning should provide a neutral assessment of many variables. But the well-known axiom “garbage in, garbage out” means poor quality input results in poor quality output. Historically, good inputs would be polished, accurate representations of society as it has acted in the past. “However, we can now see that our garbage input could very well be a polished, accurate representation of our society as it has acted in the past,” Pabalan states. Thus, the hazard in machine learning has more to do with humanthan robots. 

When societally-biased data is used to train a machine learning model, the insidious outcome is a discriminatory machine learning model that predicts the societal biases we aim to eliminate,” she writes  

Reimagining machine learning might incorporate considerations in the book “Human Compatible” by Stuart J. Russell, which suggests that the standard model of AI is problematic due to the lack of intervention. The focus is on optimizing an initial set of metrics without any human-in-the-loop supervision. Stuart proposes that rather than using AI systems to optimize for a fixed goal, developers should create goals with the flexibility to adapt, to program in a level of uncertainty. This is called inverse reinforcement learning. 

“Inverse reinforcement learning is fed a set of behaviors and it tries to find the optimal reward function,” Pabalawrites. This process may help unveil the ways in which humans are biased. She poses a question: “When inserting algorithms into processes that are already complicated and challenging, are we spending enough time examining the context?” She cited the example of COMPAS, the Correctional Offender Management Profiling for Alternative Sanctions, which has been demonstrated to produce racially discriminatory results. COMPAS is still used by judges for sentencing in several US states.  

“These models have hurt many people on a large scale while providing a false sense of security and neutrality,” she stated, noting that she will be committing her own efforts to develop fair algorithms.  

Read the source articles in SearchEnterpriseAI and Towards Data Science.