Evelyn Lamb: Welcome to My Favorite Theorem, a podcast about theorems where we ask mathematicians to tell us about theorems. I'm one of your hosts, Evelyn Lamb. I'm a freelance math and science writer in Salt Lake City, Utah. This is your other host.
Kevin Knudson: Hi, I’m Kevin Knudson. I’m a professor of mathematics at the University of Florida. It seems like I just saw you yesterday.
EL: Yeah, but I looked a little different yesterday.
KK: You did!
EL: In between when I talked with you and this morning, I dyed my hair a new color, so I'm trying out bright Crayola crayon yellow right now.
KK: It looks good. Looks good.
EL: Yeah, it's been fun in this, I don't know, like 18 hours I've had it so far.
KK: Well, you know what, it's what it's sort of dreary winter, right, you feel the need to do something to snap you out of it, although it's sunny in Salt Lake, right, it’s just cold? No big deal.
EL: Yeah, we’ve had some sun. We’ve had a lot of snow recently, as our guest knows, because our guest today also lives in Salt Lake City. So I'm very happy to introduce Suresh Venkatasubramanian. So Hi. Can you tell us a little bit about yourself?
Suresh Venkatasubramanian: Hey, thanks for having me. First of all, I should say if it weren't for your podcast, my dishes would never get done. I put the podcast on, I start doing the dishes and life is good.
EL: Glad to be of service.
SV: So I'm in the computer science department here. I have many names. It depends on who's asking and when. Sometimes I'm a computational geometer, I'm a data miner. Sometimes occasionally a machine learner, though people raise their eyebrows at that. But the name I like the most right now is bias detective, or computational philosopher. These are the two names I like the most right now.
EL: Yeah, yeah. Because you've been doing a lot of work on algorithmic bias. Do you want to talk about that a little bit?
SV: Sure. So one of the things that we're dealing with now as machine learning and associated tools go out into the world is that they're being used for not just, you know, predicting what podcast you should listen to, or what music you should listen to, but they're also being used to decide whether you get a job somewhere, whether you get admission to college, whether you get surveilled by the police, what kind of sentences you might get, whether you get a loan. All these things are where machine learning is now being used just because we have lots of data to collect and seemingly make better decisions. But along with that comes a lot of challenges, because what we're finding is that a lot of the human bias in the way make decisions is being transferred to machine bias. And that causes all kinds of problems, both because of the speed at which these decisions get made and the relative obscurity of automated decision making. So trying to piece together, piece out what's going on, how this changes the way we think about the world, how we—the way we think about knowledge about society, has been taking up most of my time of late.
EL: Yeah, and you've got some interesting papers that you've worked on, right, on how people who design algorithms can help combat some of these biases that can creep in.
SV: Yeah, so there are many, many levels of questions, right? One basic question is how do you even detect whether there’s—so first of all, I mean, I think as a mathematical question, how do you even define what it means for something to be biased? What does that word even mean? These are all loaded terms. And, you know, once you come up with a bunch of different definitions for maybe what is a relatively similar concept, how do they interact? How do you build systems that can sort of avoid these kinds of bias the way you've defined it? And what are the consequences of building systems? What kind of feedback loops do you have in systems that you use? There’s a whole host of questions from the mathematical to the social to the philosophical. So it's very exciting. But it also means everyday, I feel even more dumb than I started the day with so.
KK: So I think the real challenge here is that, you know, people who aren't particularly mathematically inclined just assume that because a computer spit out an answer, it must be valid, it must be correct. And that, in some sense, you know, it's cold, that the machine made this decision, and therefore, it must be right. How do you think we can overcome that idea that, you know, actually, bias can be built into algorithms?
SV: So this is the “math is not racist” argument, basically.
SV: That comes up time and time again. And yeah, one thing that I think is an encouraging is that we've moved relatively quickly, in a span of, say, three to four years, from “math is not racist” to “Well, duh, of course, algorithms are biased.”
SV: So I guess that's a good thing. But I think the problem is that there’s a lot of commercial and other incentives bound up with the idea that automated systems are more objective. And like most things, there's a kernel of truth to it in the sense that you can avoid certain kinds of obvious biases by automating decision making. But then the problem is you can introduce others, and you can also amplify them. So it's tricky, I think. You're right, it's getting away from the notion where commercially, there's more incentive to argue that way. But also saying, “Look, it's not all bad, you just need more nuance.” You know, arguments for more nuance tend not to go as well as, you know, “Here's exactly how things work,” so it's hard.
KK: Everything’s black or white, we know this, right?
SV: With a 50% probability everything is either true or not, right.
EL: So we invited you on to hear about your favorite theorem. And what is that?
SV: So it's technically an inequality. But I know that in the past, you've allowed this sort of deviation from the rules, so I'm going to propose it anyway.
EL: Yes, we are very flexible.
SV: Okay, good good good. So the inequality is called Fano’s inequality after Robert Fano, and it comes from information theory. And it’s one of those things where, you know, the more I talk about it, the more excited I get about it. I'm not even close to being an expert on the ins and outs of this inequality. But I just love it so much. So I need to tell you about that. So like all good stories, right, this starts with pigeons.
EL: Of course.
SV: Everything starts with pigeons. Yes. So, you may have heard of the pigeonhole principle.
SV: Ok. So the pigeonhole principle, for those in the audience who may not, basically, if you have ten pigeons and you have nine pigeonholes, someone's going to get a roommate, right? It’s one of the most obvious statements one can make, but also one of the more powerful ones, because if you unpack a little bit, it's not telling you where to find that pigeonhole with two pigeons, it's merely saying that you are guaranteed that this thing must exist, right, it's a sort of an existence proof that can be stated very simply.
But the pigeonhole principle, which is used in many, many parts of computer science to prove that, you know, some things are impossible or not can be restated, can be used to prove another thing. So we all know that if, okay, I need to store a set of numbers, and if the numbers range from 1 to n, that I need something like log(n) bits to store it. Well, why do I need log(n) bits? One way to think about this is this is the pigeonhole principle in action because you're saying, I have n things. If I have log(n) bits to describe a hole, like an address, then there are 2log(n) different holes, which is n. And if I don't have log(n) bits, then I don't have n holes, and therefore by the pigeonhole principle, two of my things will get the same name. And that's not a good thing. I want to be able to name everything differently.
So immediately, you get the simple statement that if you want to store n things you need log(n) bits, and of course, you know, n could be whatever you want it to be, which means that now—in theoretical computer science, you want to prove lower bounds, you want to say that something is not possible, or some algorithm must take a certain amount of time, or you must need to store so many bits of information to do something. These are typically very hard things to prove because you have to reason about any possible imaginary way of storing something or doing something, and that's very hard. But with things like the pigeonhole principle and the log(n) bit idea, you can do surprisingly many things by saying, “Look, I have to store this many things, I'm going to need at least log of that man bits no matter what you do.” And that's great.
KK: So that's the inequality.
EL: No, not yet.
KK: Okay, all right.
SV: I was stopping because I thought Evelyn was going to say something. I'm building up. It's like, you know, a suspense story here.
KK: Okay, good.
KK: Chapter two.
SV: So if you now unpack this log(n) bit thing, what it's really saying is that I can encode elements and numbers in such a way, using a certain number of bits, so that I can decode them perfectly because there aren't two pigeons living in the same pigeonhole. There's no ambiguity. Once I have a pigeonhole, who lives there is very clear, right? It's a perfect decoding. So now you've gone from just talking about storage to talking about an encoding process and a decoding process. And this is where Fano’s inequality comes into play.
So information theory—so going back to Shannon, right—is this whole idea of how you transmit information, how you transmit information efficiently, right, so the typical object of study in information theory is a channel. So you have some input, some sort of bits going into a channel, something happens to it in the channel, maybe some noise, maybe something else, and then something comes out. And one of the things you'd like to do is looking at, so x comes in, y comes out, and given y you'd like to decode and get back to x, right? And it turns out that, you know, you can talk about the ideas of mutual information entropy, they start coming up very naturally, where you start saying if x is stochastic, it's a random variable, and y is random, then the channel capacity, in some sense the amount of information the channel can send through, relates to what is called the mutual information between x and y, which is this quantity that captures, roughly speaking, if I know something about x, what can I say about y and vice versa. This is not quite accurate, but this is more or less what it says.
So information theory at a broader level—and this is where really Fano’s inequality connects to so many things—it’s really about how to quantify the information content that one thing has about another thing, right, through a channel that might do something mysterious to your variable as it goes through. So now what does Fano’s inequality say? So Fano’s inequality you can think of now in the context of bits and decoding and encoding, says something like this. If you have a channel, you take x, you push it into the channel and out comes y, and now you try to reconstruct what x was, right? And let's say there's some error in the process. The error in the process of reconstructing x from y relates to the term that is a function of the mutual information between x and y.
More precisely, if you look at how much, essentially, entropy you have left in x once I tell you y— So for example, let me see if this is a good example for this. So I give you someone's name, let's say it's an American Caucasian name. With a certain degree of probability, you'll be able to predict what their gender is. You won't always get it right, but there will be some probability of this. So you can think of this as saying, okay, there's a certain error in predicting that person's name from the—for predicting the person's gender from the name as you went through the channel. The reason why you have an error is because there's certain one of noise. Some names are sort of gender-ambiguous, and it's not obvious how to tell. And so there's a certain amount of entropy left in the system, even after I've told you the name of the person. There's still an amount of uncertainty. And so your error in predicting that person's gender from the name is related to the amount of entropy left in the system. And this is sort of intuitively reasonable. But what it's doing is connecting two things that you wouldn’t normally expect to be connected. It's connecting a computational metaphor, this process of decoding, right, and the error in decoding, along with a basic information theoretic statement about the relationship between two variables.
And because of that—Fano’s inequality, or even the basic log(n) bits needed to sort n things idea—for me, it's pretty much all of computer science. Because if we want to prove a lower bound on how we compete something, at the very least we can say, look, I need at least this much information to do this computation. I might need other stuff, but I need at least as much information. That clearly will be a lower bound. Can I prove a lower bound? And that has been a surprisingly successful endeavor, in reasoning about the lower bounds for computations, where it would be otherwise very hard to think about, okay, what does this computation do? or what does that computation do? Have I imagined every possible algorithm I can design? I don't care about any of that because Fano’s inequality says it doesn't matter. Just analyze the amount of information content you have in your system. That's going to give you a lower bond on the amount of error you’re going to see.
EL: Okay, so I was reading—you wrote a lovely posts about this, and I was reading that this morning, before we started talking. And I think this is what you just said, but it's one of these things that is very new to me, I'm not used to thinking like a computer scientists or information theorist or anything. So something that was a little bit, I was having trouble understanding, is whether this inequality, how much it depends on the particular relationship you have between the two, x and y that you're looking at.
SV: So one way to answer this question is to say that all you need to know is the conditional entropy of the x given y. That's it. You don't need to know anything else about how y was produced from x. All that you need to know, to put a bound on the decoding error, is the amount of entropy that's left in the system.
KK: Is that effectively computable? I mean, it is that easy to compute?
SV: For the cases where you apply Fano’s inequality, yes. Typically it is. In fact, you will often construct examples where you can show what the conditional entropy is, and therefore be able to reason, directly use Fano’s inequality, to argue for the probability of error. So let me give an example in computer science of how this works.
KK: Okay, great.
SV: Suppose I want to build a—so one of the things we have to do sometimes is build a data structure for a problem, which means you're trying to solve some problem, you want to store information in a convenient way so you can access the information quickly. So you want to build some lookup table or a dictionary so that when someone comes up with the question—“Hey, is this person's name in the dictionary?”—you can quickly give an answer to the question. Ok. So now I want to say, look, I have to process these queries. I want to know how much information I need to store my data structure so that I can answer queries very, very quickly. Okay, and so you have to figure out—so one thing you'd like to do is, okay, I built this awesome data structure, but is it the best possible? I don't know, let me see if I can prove a lower bound on how much information I need to store.
So the way you would use Fano’s inequality to prove that you need a certain amount of information would be to say something like this. You would say, I'm going to design a procedure. I'm going to prove to you that this procedure correctly reconstructs the answer for any query a user might give me. So there's a query, let’s say “Is there an element in the database?” And I have my algorithm, I will prove correctly returns this with some unknown number of bits stored. And given that it correctly returns us answer up to some error probability, I will use Fano’s and equality to say, because of that fact, it must be the case that there is a large amount of mutual information between the original data and the data structure you stored, which is the, essentially, the thing that you've converted the data from through your channel. And so this if that mutual information is large, the amount of bits you need to store this must also be large. And therefore, with small error, you must pay at least these many bits to build this data structure. And so this idea sort of goes through a lot of, in fact, more recent work in lower bounds in data structure design and also communication. So, you know, two parties want to communicate with each other, and they want to decide whether they have the same bit string. And you want to argue that you need these many bits of information for them to communicate, to show that the other same bit string. You use, essentially, either Fano’s inequality or even simpler versions of it to make this argument that you need at least a certain number of bits. This is used in statistics in a very different way. But that's a different story. But in computer science, this is the one of the ways in which the inequality is used.
EL: Okay, getting back to the example that you used, of names and genders. How—can you kind of, I don't know if this is going to work exactly. But can you kind of walk us through how this might help us understand that? Obviously, not having complete information about how, how this does correspond to names, but,
SV: Right, so let's say you have a channel, okay? What’s going into the channel, which is x, is the full information about the person, let's say, including their gender. And the channel, you know, absorbs a lot of this information, and just spits out a name, the name of the person. That's y and now the task is to predict. or reconstruct the person's gender from the name. Okay.
So now you have x, you have y. You want to reconstruct x. And so you have some procedure, some algorithm, some unknown algorithm that you might want to design, that will predict the gender from the name. Okay? And now this procedure will have a certain amount of error, let's call this error p, right, the probability of error, the probability of getting it wrong, basically is p. And so what Fano’s inequality says, roughly speaking—so I mean, I could read out the actual inequality, but on the air, it might not be easy to say—but roughly speaking, it says that this probability p times a few other terms that are not directly relevant to this discussion, is greater than or equal to the entropy of the random variable s, which is you can think of it as drawn from the population of people. So I drew uniformly from the population of people in the US, right, or of Caucasians in the US, because we are limiting ourselves to that thing. So, that's our random variable x. And going through this, I get the value y. So I compute the entropy of the distribution on x conditioned that the gender was, say, female. So I look at how much— and basically that probability of error is greater than or equal to, you know, with some extra constants attached, this entropy.
So in other words, so what it's saying is very intuitive, right? If I tell you, the person is is female— and we're sort of limiting ourselves to this binary choice of gender—but let's just say this person is female, you know—sorry, this person's name is so and so. Right? What is the range of gender? What does the gender distribution look like conditioned on having that name? So let's say the name, let's say is, we think of a name, that would be—let’s say, Dylan, so Dylan is the name. So there's going to be some, you know, probability of the person being male and some property being female. And in the case of a name like Dylan, those probabilities might not be too far apart, right? So you can actually compute this entropy now. You can say, okay, what is the entropy of x given y? You just write the formula for entropy for this distribution. It’s just p1logp(1)+p2log(p2) in this case, where p1+p2=1. And so the probability of error no matter what, how clever your procedure is, is going to be lower bounded by this quantity with some other constants attached to make it all make sense.
SV: Does that help?
SV: I made this completely up on the fly. So I'm not even sure it’s correct. I think it's correct.
KK: It sounds correct. So it's sort of, the noisier your channel, the probability is going to go up, the entropy is going to be higher, right?
SV: Right, yeah.
KK: Well, you're right, that's intuitively obvious, right? Yeah. Right.
SV: Right. And the surprising thing about this is that you don't have to worry about the actual reconstruction procedure, that the amount of information is a limiting factor. No matter what you do, you have to deal with that basic information thing. And you can see why this now connects to my work on algorithmic fairness and bias now, right? Because, for example, one of the things that is often suggested is to say, Oh, you know—like in California they just did a week ago, saying, “You are not allowed to use someone's gender to give them driver’s insurance.”
SV: Now, there are many reasons why there are problems with the policy as implemented versus the intent. I understand the intent, but the policy has issues. But one issue is this, well, just removing gender may not be sufficient, because there may be signal in other variables that might allow me, allow my system, to predict gender. I don't know what it's doing, but it could internally be predicting gender. And then it would be doing the thing you're trying to prohibit by saying, just remove the gender variable. So while the intention of the rule is good, it's not clear that it will, as implemented, succeed in achieving the goals it’s set out. But you can't reason about this unless you can reason about information versus computation. And that's why Fano’s inequality turns out to be so important. I sort of used it in spirit in some of my work, and some people in follow-up have used it explicitly in their work to sort of show limits on, you know, to what extent you can actually reverse-engineer, you know, protected variables of this kind, like gender, from other things.
EL: Oh, yeah, that would be really important to understand where you might have bias coming in.
SV: Right. Especially if you don't know what the system is doing. And that's what's so beautiful about Fano’s inequality. It does not care.
EL: Right. Oh, that's so cool. Thanks for telling us about that. So, is this something that you'd kind of learn in one of your first—maybe not first computer science courses, but very early on in in your education, or did you pick this up along the way?
SV: Oh, no, you don’t—I mean, it depends. It is very unlikely that you will hear about Fano’s inequality in any kind of computer science, theoretical computer science class, even in grad school.
SV: Usually you pick it up from papers, or if you take a course in information theory, it's a core concept there. So if you take any course in information theory, it'll come up very quickly.
SV: But in computer science, it comes up usually when you're reading a paper, and they use a magic lower bound trick. Like, where did they get that from? That's what happened to me. It's like, where do they get this from? And then you go down a rabbit hole, and you come back up three years later with this enlightened understanding of… I mean, Fano’s inequality generalizes in many ways. I mean, there's a beautiful, there's a more geometric interpretation of what Fano’s inequality really is when you go to more advanced versions of it. So there's lots of very beautiful—that’s the thing about a lot of these inequalities, they are stated very simply, but they have these connections to things that are very broad and very deep and can be expressed in different languages, not just in information theory, also in geometry, and that makes them really cool. So there's a nice geometric analog to Fano’s inequality as well that people use in differential privacy and other places.
KK: So what does one pair with Fano’s inequality?
SV: Ah! See, when I first started listening to My Favorite Theorem, I said, “Okay, you know, if one day they ever invite me on, I’m going to talk about Fano’s inequality, and I'm going to think about what to pair with it.” So I spent a lot of time thinking about this.
SV: So see you have all these fans now, that's a cool thing.
So my choice for the pairing is goat cheese with jalapeño jam spread on top of it on a cracker.
SV: And the reason I chose this pairing: because they are two things that you wouldn't normally think go together well, but they go together amazingly well on a cracker.
KK: Well of course they do.
SV: And that sort of embodies for me what Fano’s inequality is saying, that two things that you don't expect to go together go together really well.
KK: No, no, no, the tanginess of the cheese and the salty of the olives. Of course, that's good.
SV: Not olives, jalapeño spread, like a spicy spread.
KK: Oh, okay, even better. So this sounds very southern. So in the south what we do, I mean, I grew up in North Carolina, you take cream cheese and pepper jelly, hot pepper jelly, and you put that on.
SV: Perfect. That's exactly it.
KK: Okay. All right. Good. Okay, delicious.
EL: So do you make your own jalapeños for this, or you have a favorite brand?
SV: Oh boy, no, no. I'm a glutton, not a gourmet, so I'll eat it whatever someone gives it to me, but I don't know what to make these things.
EL: Okay. We recently, we had a CSA last fall, and had a surplus of hot peppers. And I am unfortunately not very spice-tolerant. Or spiciness. I love spices, but, you know, can't handle the heat very well. But my spouse loves it. So I've been making these slightly sweet pickled jalapeños. I made those. And since then, he's been asking me, you know, I've just been going out and getting more and making those, so I think I will be serving this to him around the time we air this episode.
SV: So since you can get everything horrifying on YouTube, one horrifying thing you can watch on YouTube is the world chili-eating competitions.
EL: Oh, no.
SV: Let’s just say it involves lots of milk and barf bags.
KK: I can imagine.
SV: But yeah, I do like my chilies. I like the habañeros and things like that.
EL: Yeah, I just watched the first part of a video where everyone in an orchestra eats this really hot pepper and then they're supposed to play something, but I just couldn't made myself watch the rest. Including the brass and wins and stuff. I was just thinking, “This is so terrible.” I felt too bad for them.
SV: It’s like the Drunk History show.
KK: We’ve often joked about having, you know, drunk favorite theorem.
SV: That would be awesome. That'd be so cool.
KK: We should do that.
EL: Yeah, well, sometimes when I transcribe the episodes, I play the audio slower because then I can kind of keep up with transcribing it. And it really sounds like people are stoned. So we joked about having “Higher Mathematics.”
KK: That’s right.
SV: That’s very good.
EL: Because they’re talking, like, “So…the mean…value…theorem.”
SV: I would subscribe to that show.
EL: Note to all federal authorities: We are not doing this.
KK: No, we’re not doing it.
EL: Yeah. Well, thanks a lot for joining us. So if people want to find you online, they can find you on Twitter, which was I think how we first got introduced to each other. What is your handle on that?
It's @geomblog, so Geometry Blog, that's my first blog. So g e o m, b l o g. I maintain a blog as well. And so that's among the places—I’m a social media butterfly. So you can find me anywhere.
SV: So yeah, but web page is also a good place, my University of Utah page.
EL: We’ll include that, including the link to this post about Fano’s inequality so people can see. You know, it really helped me to read that before talking with you about it, to get some of the actual inequalities of the terms that appear in there straight in my head. So, yeah, thanks a lot for joining us.
KK: Thanks, Suresh.
SV: Thanks for having me.