Computers are only as good, or as bad, as the people who program them. And it turns out that many individuals who create machine learning algorithms are presumably and unintentionally building in race and gender bias. In part one of a two-part interview, Richard Sharp, CTO of predictive marketing company Yieldify, explains how it happens.
The Enterprisers Project (TEP): Machines are genderless, have no race, and are in and of themselves free of bias. How does bias creep in?
Sharp: To understand how bias creeps in you first need to understand the difference between programming in the traditional sense and machine learning. With programming in the traditional sense, a programmer analyses a problem and comes up with an algorithm to solve it (basically an explicit sequence of rules and steps). The algorithm is then coded up, and the computer executes the programmer's defined rules accordingly.
With machine learning, it's a bit different. Programmers don't solve a problem directly by analyzing it and coming up with their rules. Instead, they just give the computer access to an extensive real-world dataset related to the problem they want to solve. The computer then figures out how best to solve the problem by itself.
For example, say you want to train a computer to recognize faces in photographs. To do this using machine learning, you might first build a large dataset of photographs and manually mark up the locations of faces within them. You would then feed this dataset into a machine learning system. Using examples in the dataset you've given it, the machine learning system would figure out for itself how best to recognize faces in images.
So machine learning is primarily programming with data or programming by example. In our face recognition example, at no point did the programmer sit down and write a set of rules about how to recognize faces. In fact, programmers have no idea how the computer is doing the face recognition in practice. They can't say anything other than it is doing it based on the patterns in the data I gave it.
And therein lies the problem. The biases creep in because they are embedded in the real-world data. Our world is full of discriminatory biases such as gender pay gaps, racial wealth gaps, etc. So if you feed a machine learning system real-world data and ask it to solve a particular problem — like, say, maximizing revenue by showing specific adverts to specific people — it may well pick up on these biases and exploit them. The programmer has no idea that this is happening since the computer figured it out for itself. But when you step back and take a look you find you have a machine learning system that is discriminatory, for instance deciding not to show adverts for high paying jobs to women.
TEP: Can you give some real-world examples of bias in machine learning?
Sharp: Yes. There are plenty. Some relate specifically to the advertising placement scenario I just presented. For example, a study from CMU showed that Google was showing significantly more adverts for jobs paying over $200,000 to men than it was to women. Another study from the FTC revealed that searches for black-identifying names yielded a higher incidence of ads associated with arrest than white-identifying names.
These issues were almost certainly down to machine learning systems exploiting biases in real-world data. At no point did a sexist Google programmer ever deliberately write a program that encoded the particular rule: If this is a job advert and salary is more than $200K then target women less frequently. Instead, the dataset model trained because it probably had fewer women clicking on high paying job adverts (due to existing social biases). The model learned that this was a pattern that it could exploit to get closer to its goal.
There are plenty of other examples too. Amazon recently rolled out same-day delivery to new city areas, while conspicuously excluding predominately black zip codes. And, a machine learning system actively in use in US courtrooms significantly overestimated the likelihood of black defendants reoffending.
TEP: What are some of the consequences we face down the road if machines are inherently biased?
Sharp: We are training these systems based on how the world is currently, rather than how we want the world to be. If you ask a machine learning system to optimize advert placement to increase click-through rate, it will do exactly that. If you give the learning system a dataset that has females click on high paying ads less frequently, the machine learning system will learn to exploit that fact to get closer to its goal. The consequence is that machine learning systems will perpetuate, or even exacerbate, existing biases.
[In part two of this interview, Sharp will explain what we can do to avoid bias in machine learning.]
Comments
Dr Richard Sharps, Cambridge Phd in Computer Science. Great, I'm very impressed.
The real-world is the bad guy, it's prejudiced, impure. The programmer like the real-world, impure, but an innocent. The computer is the pure, good guy.
Ok, this sounds nice, but let's first try to think in a less (very) simplistic way, precisely because learning is about understanding complex problems, i.e. not just solving poorly simplistic stuff.
Well, there's a short, easy way to solve machine learning bias: just leave the CEO seat of your machine learning Company to some good Phd in Social Sciences.
To do such, you just need to remember that programming uses predefined sets of categories and words, and think of this as one of the most reliable definitions of what is a prejudiced (biased) mind. Let social scientists do the work of producing thoughts with effective meaning. Effective meaning systematically requires cross-category thinking. This have long been proved by social sciences. You need real scientists, not those that think of knowledge as a dead collection of categories and signs, like symbolic language sciences, that will always do, at best, social astrology, because they represent society in non-scientific way, i.e. like a finite-state machine. This “clock-style” representation of society is indeed in itself very very ideologically biased, is it in fact absolutely normative or totalitarian.
Before speaking about learning machine, just ask yourself how you know what you know, cease being illiterate of your own knowledge. Open a good book of sciences history and|or sociology. Work with effective scientists, who start from the well scientifically established fact that “clock” representations are never normal, but mostly exceptional in society. Unless you want to remain an ideologist, which you may, and then you won' be working for your own knowledge (not to speak about ours) or for science, but just to keep your own social position -and keep it implicitly, which is the most anti-social-scientific way to be.
Daniel Merigoux. Guess: i'm a linguist & a social scientist.
Indeed there are many machine learning projects which address these societal biases. The best solution would be to remove these biases completely. So, given that they still exist, every good data scientist should take these biases into account when preparing data before training. Since data preparation is easily as important as use of an algorithm or machine learning model, it is still a very human process and is largely at the discretion of the data scientist what they want to feed into the black box that is a machine learning model or combination of models.