Seth Stephens Davidowitz on data mining
John Thornhill talks to Seth Stephens-Davidowitz, a former Google data scientist, about what our internet searches reveal about who we really are.
Presented by John Thornhill and produced by Fiona Symon
Transcript
You can enable subtitles (captions) in the video player
[MUSIC PLAYING]
Hello, and welcome back to "Tech Tonic," a podcast that looks at the way technology is changing our lives. I'm John Thornhill, innovation editor at the Financial Times in London. In our last episode, we heard from Mary Lou Jepsen, a scientist with ambitious plans for the future of diagnostic imaging. Today, we hear from an expert in data mining, who tells us what search engines reveal about who we really are.
I think, with more data, if anything, we'll learn how complicated things are a lot quicker. A lot of times, we think we've found some fundamental law of human nature, and then, in a different society, it doesn't hold. And we'll be able to see that much quicker, because all this data is in all different kinds of countries, so we'll be able to test not just does it hold in this one place, but does it hold in every place. And frequently, the answer will be no. And we'll learn that we maybe understand people and society even less than we thought we did. But that's still important.
That's the voice of Seth Stephens Davidowitz, a former data scientist at Google, and the author of a book called "Everybody Lies" about what the internet reveals about who we really are. He came in to the FT to talk to me during a recent visit to London.
[MUSIC PLAYING]
Seth, let's start with the book's title, "Everybody Lies." Your contention is that everyone tends to tell the truth when they're in search engines. How do you know that?
Yeah. So the basic idea of the book is, traditionally, if we want to know what people are thinking, what people want, what they're going to do, why they do the things they do, we ask them. We conduct a survey, or maybe a focus group. But people tend to be dishonest in these surveys. Frequently, the numbers don't line up with the truth that we know.
And the argument in this book is that there are certain sources, not all sources, where people open up a lot more, and are comfortable telling things that they might not tell anybody else. And by mining this data, we can learn for maybe the first time who we really are as a species.
Can you give us some examples of that? What are the things that really surprised you when you started digging into the data?
The one that I think actually started my research and starts the book was a study of racism. So I'm from the United States. And after Barack Obama was first elected, everyone got really excited that we'd moved beyond racism, that racism was only a small factor now, and that we were maybe a post-racial society. And if you asked people in surveys did you care that Obama was black, everybody says of course not. Like, me? You know, that's crazy, I don't judge people based on their skin colour, I have no animosity towards African-Americans. And I was just shocked by how frequently people were searching for the n-word, the actual n-word, on Google, particularly looking for jokes mocking African-Americans.
And it had nothing to do with rap, as you [INAUDIBLE].
And it had nothing to do with rap. The version ending in -er is usually for jokes making fun of African-Americans. The version ending in -a is for rap lyrics. So I distinguished those two. And you know, and in parts of the country where I never would have guessed, where many people would have thought racism had largely disappeared, and that this racism was predicting a huge number of outcomes against African-Americans, such as opposition to Obama in his elections and support for Trump in his primary. So it predicts lower wages for black people. So it seems very, very clear that there's a huge amount of remaining racism against black people in the United States that predicts important life outcomes.
And I think you explained that Nate Silver then went on to show that there was an incredibly strong correlation between those areas that voted for Trump and the areas where you found those research findings.
Yeah, pretty much the strongest correlation of anything he could find in the data, stronger than economics or trade exposure or cultural factors or anything else. Demographics education, age, anything. Racism was the-- as revealed by Google, was the strongest predictor.
That, of course, gets to the eternal debate about data, the relationship between correlation and causation. What do you think, in this instance, can you safely say about the findings from that data?
I think you have to definitely be careful and look if there are other outcomes. I think some people throw around the phrase correlation doesn't imply causation anytime they don't like a result. And just means like, oh, we can't know anything, which I think is not true. I think there are tools that data scientists have controlling for other variables. You can say, OK, is there something else about the region that both causes them to have high racism and support for Trump. And then once you start testing all these hypotheses, and none of them explain it away, I think you can be pretty comfortable that this was a big factor, a causal factor.
OK. And you were saying earlier about the different parts of the internet. You talked about the kind of digital truth serum in Google searches, but clearly that's not true in Facebook posts or Twitter comment, in that people are projecting a public persona in one and talking to themselves in the other. Could you explain the difference between these?
Yeah. I think people kind of throw together big data these days. Big data, big data, you know, Facebook data is big data, Google's big data. But they're such different types of data. And Google, I think, is a source where people tend to be really, really honest about things that they might not tell anyone else because they're by themselves, they're not broadcasting their feelings to anybody else, whereas on Facebook or Twitter, you're trying to impress your friends, frequently, so you might exaggerate how intellectual you are, how happy you are, how good your relationship is, how wealthy you are.
And you can actually see that, frequently, Facebook data doesn't match up at all with real world data, that far more people say they own Mercedes and BMWs than actually do and far more people say they read intellectual sources than actually do. And I actually like to compare what people say on social media versus on search. So one of my examples is, if you compare how people describe their husbands on social media and search, the top ways that people complete the phrase my husband is on social media when they're trying to tell who they are to their friends is my husband is the best, my best friend, amazing, awesome, so cute. And on Google, when they're alone, they're by themselves, they're talking about their husband, it's my husband is so mean, a jerk, annoying, obnoxious. So a very different picture of marriage on the two data sources.
And are there other examples you could give of this kind of split between the public and the private persona?
Yeah. I think just in general with health problems is another one. Even when health is talked about on social media, it tends to be certain health conditions that maybe are less embarrassing, whereas on Google you see a lot of health conditions on parts of the body that people don't openly talk about. So that's kind of another area where you see the difference. It kind of tells us what's socially acceptable and what's less acceptable, or what's considered embarrassing.
When you were working at Google, what were you doing precisely?
I was working on both the economics teams and the quantitative marketing team. So a lot of data studies of how to use search data to better make decisions.
And as you say in the book, clearly Google, Bing searches are not the only data source that you're looking at. Can you tell us some of the other data sources you've been scouring, and what they show into different aspects of humanity, as well?
I call Google the most important data set ever collected on the human psyche. And I think that's the most important source, just because it's so comprehensive. Pretty much on any topic, there's something to learn in Google searches, whereas the other data sources tend to be one-off.
So I studied Wikipedia data. You can learn patterns of big success, where the successful people tend to be born. I studied Stormfront data-- that's one of the largest hate sites on the internet-- to kind of teach us what makes people join a hate site. I studied Pornhub data, which teaches us about people's sexual desires, and is another, I think, area where we've had a lot of misinformation because people are very uncomfortable sharing their true sexual feelings.
So you think Sigmund Freud could learn a lot from looking at the kind of data that you were looking on Pornhub, for example, about what people really believe, as opposed to what they lie about?
Oh, definitely. I think so many of these famous scholars would have a field day with this data. They were, in some sense, born before their time. They were kind of theorising based on their intuition, or Freud would talk about his own experiences. So he, you know, was attracted to his mother, and then decided everybody must be attracted to his mother, and then, you know, or a few patients he had, and said, oh, this is some broader pattern. Or Kinsey, I guess, had survey data, but the only people who he thought he could trust were a group of prisoners and prostitutes, so that was not really telling us, I think, too much about the general population. But now, we really have, I think, very reasonably accurate data on a large sample of the population.
One of the things that struck me when I was reading your book was that people would read it and realise the value of the searches that they were putting in, and therefore they would try to gamify the system. If they actually knew that a lot of this evidence was being collected, even if it's on an aggregated basis, rather than an individual basis, that somehow that would taint the source. Are you worried about that at all? Do you think now that you've actually exposed some of the findings that there is a kind of gameification process that could happen?
We'll see. I don't think so. I think that maybe that will happen for, like, a couple minutes after they read my book. And then they'll just go back to the normal way. I think even just Google autocomplete has always been there, so people kind of have to know that, all right, Google has some aggregate data on what people are searching. That's how Google autocomplete works. So I think it's not a mystery that this data is kept in an anonymous and aggregate form, with or without my book.
Now, in the conclusion, you make a very big claim, which I'd like to challenge you on, which is that for years we've known that the social sciences are not real sciences. Economists say that they have physics envy because they envy the mathematical precision of physics. But your claim is that this data revolution is turning those social sciences into real sciences. Can you say more about that?
Well, I think there have been a lot of flaws in the social sciences. So one big flaw is that the data really hasn't been so reliable. Surveys, people lie. And I think it's been hard to use that to get real insights into people. Another weakness is that people are really, really complicated, and we've been relying on these kind of one-off laboratory experiments where slight differences in conditions can give you dramatically different results. So one study will get very well-publicized, and then 10 different studies will fail to replicate it. That happens all the time in all kinds of fields.
And small samples mean that people can cherry pick for statistically significant results, and not get real results, whereas in physics, when it's empirical, there's huge samples. And I think all of that will change with the new data that's now just being left as people go through their lives to I think what I think social science should be, which is following the results of kind of natural experiments in the world, and seeing what people actually do in response to these natural experiments.
So take one area which is obviously very interesting to a lot of FT listeners, economics. How do you think the field of economics is going to be changed by this data revolution? What empirical evidence are we going to have to support some of the hypotheses, or disprove some of the hypotheses by which the economists work?
I think economics, I think the data will get better. A lot of people criticise Google search data and say that you can kind of think of a lot of weaknesses. OK, not everybody uses Google, and you don't know why someone makes a particular search. Even the worst search in the world, someone might have been just curious or doing research. There are lots of different reasons there are obvious flaws in the Google search data.
But I think people forget how flawed the existing data sources are. Once a data source is around for a while and has a fancy name, like consumer sentiment or even GDP or unemployment, people get very excited and say, oh, that's a legitimate measurement, it means something. And a lot of these data sources, when you pull under the hood, have a lot of flaws. So I think it will give a richer view of the economy.
There are lots of things that we don't measure that now we will be able to measure. I talk about leisure, on-the-job leisure or off-the-job leisure. We usually just think, OK, how many people are working, how many people have a job. And there's a definition for that. But there are clearly different types of work. Some people are very engaged in their work and are working all the time and have things to do, some people don't have anything to do and are playing solitaire on their computer all day. And that type of thing can now be measured. So we're going to get a much richer view of the economy.
I do think that one big difference between the social sciences and physics is, in physics, there seem to be simple laws that explain the entire universe that hold in all times and all places. I think that's not true in the social sciences. I think many social scientists have made mistakes by trying to find these universal laws. And I think, with more data, if anything, we'll learn how complicated things are a lot quicker.
A lot of times, we think we've found some fundamental law of human nature. And then in a different society, it doesn't hold. And we'll be able to see that much quicker, because all this data is in all different kinds of countries, so we'll be able to test not just does it hold in this one place, but does it hold in every place. And frequently, the answer will be no. And we'll learn that we maybe understand people and society even less than we thought we did. But that's still important.
But you think it might be possible, say, to have Google-style A/B testing of some economic experiments. I don't know, universal basic income, you could compare one region of America with another.
Now Facebook does more experiments in a single day than the Federal Drug Administration does an entire year because it's really, really easy in the digital world to set up an experiment. It's a line of code, or you can even have a few lines of code that automate experiments all the time. And I think those will be done in the social sciences, as well, to constantly check what works. Yeah.
One of the areas that you focus on, as well, is the whole inequality debate. And a lot of the findings that you have present quite a different pattern from the traditional picture we have of this. So could you say more about what are the patterns of inequality that you found, and in particular the kind of success rates of different populations in America? And then you also make the claim that, on the basis of that data, we can use that to improve people's lives. How are we going to do that?
Well, traditionally another advantage of big data is that you can zoom in to small populations. So if you have a survey of 1,000 or 2,000 people, you're not going to have information on every little town or city. You might have one, two, zero people from a particular town or city, so you can't say anything about that town or city. But now, with the big data sets that are automatically collected, we have big samples for every town.
So one of the studies I did was I used Wikipedia as a measure of big success. It was what kind of conditions create enormous successes. And you see there's huge variation in the success rates of different places within the United States. I didn't do a study for UK, but the places that have the highest success rates have 20 times the success rate than other places.
And I think the three factors that seem to be most predictive of producing a lot of really, really enormously successful people are cities. Cities do better per capita than rural areas. College towns, they're really successful, I think because their kids are exposed to a lot of innovative ideas and innovation. And immigrants. First of all, the kids of immigrants themselves do well, and then also people around immigrants seem to do well.
So I think the way to maybe get an environment that is conducive to huge success is to attract a lot of immigrants to kind of get that hustle and drive in the area. Definitely promote cities. Don't necessarily have policies that give huge advantages to suburban living. And then, universities. Have a lot of big ideas that kids are exposed to from an early age.
Right. So, so much of the debate about inequality is about class, isn't it? But you're saying in a way that geography is a really important factor, and maybe if we're trying to address inequality, targeted interventions in particular towns or regions are going to be more effective, in a way, than kind of universal blanket assistance to people.
That's right. And I think that's, again, another way where big data changes the conversation. When you have a sample of 2,000 people, you're kind of like-- you just can put people into poor or rich. But when you have a sample of everybody, you can say there are very, very different types of poor people, and which ones make it and which ones don't make it, and what can we do to make sure that more people make it.
Right. Now, it's become a bit of a cliche to say that data is the new oil. I've always thought it's more significant than that. But it's clear that the big tech companies are massive owners of a lot of the data sources. And as you're suggesting, they can use that data in very interesting and profitable ways. Do you think that there is a problem with the fact that data is being harvested and [INAUDIBLE] up by a few companies, or do you think quite the reverse, that there is almost a democratisation process that's happening as a result of this data revolution, and that anyone has access to any of the world's knowledge if they have an internet connection?
It's unclear at this point. I think big data is not good or bad, it's powerful. And it remains to be seen how this plays out. It would be great if the tech companies make their data available for researchers and we can learn a lot more about society. That would be a wonderful outcome. Companies could decide that there aren't PR advantages, and they maybe could sell their data, so there are better ways for them to use this data. And that would be a negative outcome.
So we'll kind of have to see. I do think there are definitely ethical and legal ramifications to the power of this data. Researchers are learning more and more of what predicts various outcomes. So what's going to make someone a good employee, for example. And the researchers found all these strange correlations, that if you like curly fries on Facebook, you're more likely to be intelligent.
So that's kind of a weird correlation, but everything kind of correlates with everything. People who like curly fries are going to be a different population than people who don't like curly fries. So for the purposes of business, it may be wise for them to hire someone who likes curly fries on Facebook because they'll probably be, on average, smarter.
And I think the ethical and legal system we have in place is based on a world where corporations might know five or six points about a person, but now corporations are going to know millions of things about a person. Their Facebook likes, all their social media behaviour, their purchase history, perhaps, all kinds of information from their previous computer use, maybe for previous employers. So we're not really prepared, I think, for this world.
Doesn't that worry you as a citizen that some companies are going to have this phenomenal knowledge about you as an individual?
It does a little bit. So one thing-- another concern I have is differential pricing. Ideally, corporations would like to charge each customer exactly how much that customer is willing to pay. But it's really, really hard for them to do that, and the main reason is they don't know how much a customer is willing to pay, and every customer wants the corporation to think they're willing to pay less.
And they find these clever ways to group people together-- senior citizens versus middle age versus students, or people who are buying plane tickets three weeks in advance versus the day before-- and kind of charge different prices based on which group you're in. But I think with all this data, they will be able to learn to a much higher degree how much everyone's willing to pay, and be able to fleece consumers more than they have in the past.
So on the debate on rent seeking, as it were, you think some of the data companies will be able to understand exactly that willingness of consumers to pay, and then theoretically could gouge them in ways that they wouldn't for others.
Well, you're already seeing it with Uber charging different prices if you're coming back from Goldman Sachs relative to if you're coming back from a different building. You know, you probably have a higher salary, maybe Goldman Sachs is paying for your Uber, so you get charged a different price. But you're going to see more and more of that. And you can predict from someone's mouth movements whether someone is drunk. And you could say, hey, this person is drunk, he or she is probably going to pay a lot more right now for this particular product. So I think gouging consumers is a legitimate fear.
Although it has to be said that, I mean, Uber is, at the moment, still massively loss-making. And in a way, it's a kind of wealth transfer from the VCs to all the riders in the Uber cars, isn't it?
True.
Final question. The world that you're portraying enables us to understand far more about what really motivates us and animates us. And so we should be able to have far more kind of rational decision-making about what we're doing. A lot of people would argue that, with the rise of populism, that we're heading in exactly the opposite direction. How do you explain this mismatch between the ability to have a more rational discourse and decision-making process and what appears, in some instances, to be quite an irrational human process?
I think in some sense, it's easier to learn what is good for us than to do the things that are good for us. So we don't need necessarily big data to know that it's good to exercise or know that it's good to eat healthy and not eat at McDonald's, but a lot of people struggle with that with or without data. I actually had an instance of this. I did a study on the effects of weather on depression. And you can see in Google searches so, so clearly that there's a huge drop in depression searches in nice climates during the winter relative to lousy climates in the winter.
And then right after I did the study, I moved from California to New York. And then I got depressed. So it's often easier to learn what we should do than to actually do it. But I do think that some of this data that we accumulate from the Trump situation and the other populist candidates will eventually be helpful in teaching us how to avoid it the next time as we learn more about voter behaviour.
Although, you know, to be honest, from Trump's perspective, this is a good thing, right, that he's elected. So he could hire-- somebody like Trump could hire the best data scientists and learn what makes people tick, and get power out of it. So it's not clear always that information leads to good outcomes.
Great. Thank you very much, Seth, for a fascinating discussion.
Thanks, John.
We'll be back next week with another episode of "Tech Tonic." In the meantime, if you'd like to comment on today's show or suggest any topics you would like us to cover in future episodes, please email us at techtonic@ft.com. This episode of "Tech Tonic" was produced by Fiona Simon.
[MUSIC PLAYING]