To the average person, artificial intelligence systems are …unintelligible. We hear a lot of terms bounced around: black boxes, enigmatic, even just plain magic. These complex neural networks, with parameters that can number in the trillions (never mind their outputs, which are even bigger) are fairly impossible for even a curious brain to conceptualize. But the Knowing Machines team wants us to understand. You're going to meet several team members today, each of them tasked with demystifying the black boxes by writing an essay about their favorite dataset. Each essay focuses on different aspects of each dataset: its inception and its limitations, how it structures knowledge, makes predictions, and intervenes in the world. Seen through the lens of an expert, maybe it'll become your favorite dataset too.
The Blue Dot Session, “Greylock," "Lumber Down," "Turning on the Lights," "The Big Ten," "Dance of Felt," "Angel Tooth," "Dear Myrtle," "Children of Lemuel," “Rafter”
Tamar: From the Engelberg Center on Innovation, Law and Policy at NYU School of Law and USC's Annenberg School for Communication and Journalism. This is Knowing Machines, a podcast and research project about how we train AI systems to interpret the world. Supported by the Alfred P Sloan Foundation. I'm your host, Tamar Avishai.
Tamar: So artificial intelligence systems are considered kind of unknowable. We hear a lot of terms bounced around black boxes. Enigmatic magic. I mean, these are complex neural networks with parameters that can number in the trillions, never mind their outputs, which are even bigger. It's kind of like imagining how long dinosaurs were around or how long ago. It's just not something you can wrap your head around. And yet, no matter how complex or convoluted these systems become, the data that is used to train them are still one of the most important and human sources of evidence that we can use to trace the histories, practices, and politics of how these systems interpret the world we live in. The knowing machines team, several of whom you're about to hear from, want to make the datasets that feed artificial intelligence systems intelligible. So they sat down and wrote an essay about their favorite one, inviting me and you to not just begin to understand the role that these data sets play in terms of how they structure knowledge, make predictions, and intervene in the world, but maybe even like them as much as they do. So without further ado, the Knowing Machines team presents nine ways to look at a dataset.
Will: So my essay is called [00:02:00] Datasets as Sociotechnical Artifacts the case of C 4.
Christo: Investigating Datasets.
Hamsini: AI Birds.
Kate: Datasets at an event horizon.
Sasha: Investigating ImageNet.
Jason: What can Lion teach us about copyright law.
Mike: Datasets as Institutions
Jer: I haven't even titled it yet.
Mike: By Mike Annany.
Christo: By Christo Buzchek.
Kate: By Kate Crawford.
Sasha: By Sasha Luccioni.
Jason: By Jason Schultz.
Will: My name is Will Orr.
Hamsini: I’m Hansini Sritharan.
Jer: By Jer Thorpe.
Will: So C4 is a data set, and it's like a large language data set that is created by engineers at Google. Basically, it takes a dataset called Common Crawl, which is like a large scrape of the internet and cleans it in order for it to be used by a language model that they make called T5, which is an open source model themselves. A lot of audits after the dataset was released kind of found that despite a lot of cleaning in terms of removing duplicate content and often what they thought to be offensive content, a lot of this content still remained. And this is all to say, essentially, that the dataset itself that was the final dataset, perpetuates these power structures that we know too well on the internet in terms of the dominance of cis white heterosexual voices and often from the global north. And this is particularly important as like datasets that models that are trained on this dataset is often kind of then reinforced these facts. And so it means that marginalized voices often go excluded and underrepresented in these language models. So I was kind of taking this case of this C4, and I was fortunate to speak to one of the creators of C4 and kind of think about why this, you know, these outcomes kind of occurred, I suppose. And it was interesting because they were they were explaining that these were kind of expected but not intended consequences of the, of the filtering pipeline. So, for instance, the removal of marginalized content, although it was not intended in the filtering, it was kind of to be expected because of the constraints that they were working in. So essentially this this essay [00:04:00] kind of looks at the sociotechnical constraints.
Will: So essentially like the legal and the social aspects that kind of contribute to the production of datasets themselves. This case of C4 is really interesting because dataset creation is often just kind of seen as, you know, just something that needs to be done or that is kind of done first, but just kind of haphazardly in order to get the model out, but isn't necessarily given the thought or attention that it really needs in order to make these things. Well. And they mention as well in an interview that the process of, you know, particularly cleaning this dataset properly is a very long process that it requires a lot of, you know, external input and probably input from communities. And, you know, no one on that team was really felt qualified in order to do that themselves, which is, you know, fair enough. But at the end of the day, what that meant is that they relied on something that was quick and easy and that was ultimately imperfect and produced this, this artifact that, you know, can potentially produce harms down the line. Datasets are always going to be kind of products of their social environments. That's true. And we're always kind of resisting power in certain ways. But I think that we can think about how we can make these better. And this often, you know, it does require resources and it requires time. Yeah. And intention. I think we need to do kind of act with intention in this, in this space as well. What are what does responsible dataset creation kind of look like and how do we safeguard those processes?
Jer: So the essay is about the data that gets produced by a community collaborative science platform called iNaturalist by Jer Thorpe. Haven't even titled it yet. It's hugely popular. They have a million and a half active users, and people go out every day all around the world, and they take photographs of living things. And it started as a platform such that people could do [00:06:00] community science based biodiversity surveys of places. So let's go out in the suburbs of Chicago and try to figure out what species live there. And although there had been efforts to do this kind of thing around specific groups of animals like birds or fungus or insects. Inaturalist really set out to try to do all of it. And so you can take your iNaturalist app out into your backyard or out to the park, and you can take a photograph of pretty much every living thing that you see, and it will end up in their data set. And and this project was running for quite a long time until it became a project that involved machine learning. And what happened is that because it had been running, I think, for almost a decade for the machine learning came up. They had a whole bunch of really well tagged images, so people would would take these images of a plant.
Jer: And mostly because these were citizen scientists, they would take a guess at what it was. And then other people who maybe had expertise about that plant could come in and they could say, actually, what this is, is know insert real, real name here. And so I was really blown away when I first tried the, the machine learning version of iNaturalist, because it worked so well. And the reason why it worked so well was this effort of these community members over, over a really long time. And so the iNaturalist dataset is the training set that they use to, to to teach their model how to recognize pretty much every living thing on the planet. So I use iNaturalist. I'm a proud citizen scientist, but the reason why I picked this essay was because of a conversation that we had with one of the developers of iNaturalist, when he mentioned something which I think is very particular about iNaturalist, and that is that iNaturalist considers humans to be wildlife. Depending [00:08:00] on your philosophy, that might seem either totally obvious or completely ridiculous. And I think there's something about that fact that it kind of sits both ways with people. But what that means is that the iNaturalist database, alongside of pictures of frogs and tadpoles, has a whole bunch of pictures of kids and and adults and, and and it also has pictures of the things that humans leave behind.
Jer: One decision that that I naturalists made quite early in the process was that the evidence of animals would also be counted as a record of that species. So if I were walking in the forest and I saw the kind of bed that a deer makes out of cracked branches to sleep at night, I could count that as evidence of seeing the species. So they trickled that down to people. And it means that if I find a gum wrapper, then I'm seeing an evidence of humans. And there was something that really resonated with me about that. I live in an urban park, and, and I think a lot of people maybe are confronted every day by that human evidence, by, by like big piles of garbage from people who've had picnics or, you know, in the New York Harbor, we get a lot of garbage floating up on the beach. But in, in amongst of that is are also these quite amazing, like, wildlife experiences. So there was something about that story and about this data set itself that made me made it really feel personal to me.
Hamsini: So full confession I really wasn't a birder until I started looking at this dataset and talking with colleagues who are birders.
Tamar: But you are one now.
Hamsini: Getting there, I think. I don't have binoculars yet. My essay is just called AI birds and I'm Hamsini Sritharan. It's [00:10:00] looking at the birds data set, which was put together by a collaboration of computer scientists and ornithologists at the Cornell Lab of O, the Lab of Ornithology. And it's a dataset of 48,000 images of birds commonly found in North America. The researchers are specifically responding to datasets like ImageNet that are sort of very blunt tools, I guess, and tried to create a dataset that would lend itself to what they call fine grained classification tasks, which is, you know, instead of telling like a bird from a chair, it's what species of bird is it? How old is it? What's its sex? Those kind of more narrow, nuanced categories that machine learning systems are not so great at recognizing wire birds. So good to think with when it comes to digital technology. I think for me, it really is about the sort of ubiquity and variety that they present, right? They're just beautiful, fascinating little creatures. And they're they're everywhere, right? If you go outside and you sort of tune your attention to the world around you, no matter where you are, you will probably see birds because they're so visual as well as so sonic and so expressive. There's something that lends itself to kind of thinking with them and using them to sort of illustrate ideas, and I think there's something really inspiring about them. There's just like a really lovely way to bring back sort of meaning to these data sets by looking at them as a person.
Hamsini: You have this photo of this bird in this really naturalistic setting, and it's if you look at the birds data set in this data set tool, you can see sort of how it's been annotated. But [00:12:00] like that's it. You don't get any of the story right. You don't know who this bird is. When this photo was taken, like, what was this bird's life like? You know, is it really something that can be taken as an exemplar of its species? Right? Or is it like a little freak? If you think about this image, there's just so many layers of story that you don't have access to. When you just look at these images, right from the story of the bird itself to the person who stopped to take a picture of it and like, framed it in a specific way to, you know, the people who are annotating these images and sort of curating them and trying to identify its species. And then you get into the sort of layer of then what is the machine seeing and taking away from this, right? What is the algorithm learning? And I think that all of these layers have just so much richness to them. And there's so many like histories and narratives that get flattened out. That's not always something that is as clear cut as you might think, you know.
Christo: Investigating datasets by Christo Buschek. For me, it was always about like, how do you approach a thing like a dataset and how can you approach it in an investigative way? And I think one of the important things, especially because this work we're doing is so much embedded in an academic world that, like I do, try to make a differentiation between investigation and research, because I think they're two different things. And one of the reasons why it's different is like that. Your motivation is a different one. Like when I'm investigating something, I always try to to assume like an adversarial position because I think you it's not about like being unfair, but it's just about like you try to find like the contradictions or the weak [00:14:00] spots in whatever it is that you want to investigate. I don't know if it's personal, but it's definitely like very, very interesting. I think like my background being a journalist, like definitely shapes also like how I approach this work. So my background is more like in data investigation. It's I'm not an academic, I'm also a programmer. So I work with data and I write logic around data in order to produce more data. Like it's kind of like a thing like my goal is not to achieve knowledge, but it's basically like to obtain facts and truth about something. And of course, truth is can be a relative thing. So you also have to define like what are your questions. The thing that you want to know defines basically the way that you have to look at things. So the questions matter. And like you can look at the same information but with different question. You will get different data out of it. So it's also just like through the lens that you want to know. Something like it will define like what what you will attain.
Jason: What can lion teach us about copyright law? By Jason Schulz. So this essay is about what it means to look at each individual image in a, you know, a data set of 40 million images. And then think about it from the perspective of copyright law. A lot of people will talk about copyright law and AI or, you know, whether images are stolen or whether people should be paid. But it's always in this sort of abstract big picture aggregate. And what I wanted to do with this essay was go look at specific images and talk about some of the challenges about figuring out, like, is it even copyrightable? Who owns it? How would we know? And could we use it anyway? Is there a permissive use such as a fair use or something else to show? It's really complicated. Each individual image requires? [00:16:00] Well, each individual image tells us a story and requires its own analysis. And then multiply that by 40 million or 5 billion and there's no way to do it. I'm a copyright nerd, you could say, if you want. I mean, copyright law is fascinating to me, in part because it tries to take on these big picture questions of what does it mean to create something in a society? And what does that society owe you? If you create something, does it owe you anything? Does it owe you money? Does it owe you some sort of legal rights? Does it owe you credit compensation? And so really, you know, copyright law is about all that. It's trying to figure out what is the right societal relationship to creators. And there have been many challenges to that relationship over the years from technology, the printing press, the camera, the internet, and now AI. And, you know, each time you could just try to tweak it.
Jason: But I think when you have big fundamental shifts like we were in right now, which is a pretty big one, you have to also sort of back up and ask, what are we trying to do here? Like what is society want out of creators and what does society want out of technology and and how are we going to like figure out the right balance between it all? What interests me specifically about like this dataset and this tool was that you can answer these questions for particular artists. You can answer these questions from the point of view of the technologists. You can answer them from the point of view of museums. But I just wanted to look at specific stories around specific images and start to ask the questions that I thought were interesting about what do we even know about who made this? Who owns this? You know, we often assume things. We often assume everything is copyrighted, and we often assume someone's getting [00:18:00] screwed if you're copying and not paying them. And that's not true. There's a lot of copying that happens, which is super beneficial to society and no one gets hurt. There's a lot of copying where someone's already made enough money, and this is just helping society do extra interesting things. And then there is copying where it really hurts artists and they can't make a living. And then you have tougher questions around, well, should we allow that copying to happen or not? And I think breaking it down to a little bit more granularity is a better way to answer those questions. Not trying to answer them like at a high level of like a newspaper headline. I don't think that's the right way to approach it.
Sasha: You know, it's important to recognize a person as a person once you start getting into categories like criminal, stranger, ballbuster, and redneck. Are these really categories that you want your model to be using? Investigating ImageNet by Sasha Luccioni. We're exploring the nine lives of ImageNet and how it continues to be used within the AI community in its different kind of shapes and forms and how, you know, a lot of people use different versions of it, and there's no kind of single image net that you can use to compare models. And so the fact is that you can't really compare models unless you're using a common comparison. A common point of comparison. And so image net is kind of that point of comparison. But in reality it's not because there's no single image net. So it's kind of a paper about how the way in which we evaluate AI models using image net is flawed. Well, I think that members of the AI community have integrated image Net as a, as a given. It's kind [00:20:00] of the the go to dataset. If you're doing almost any kind of image classification or computer vision task. So people use it for all sorts of things. And even if, as I say in the essay, they're not using the categories themselves, they'll use the images because it's considered kind of the the data set, the biggest dataset, the cleanest dataset is validated, and people don't take the time to figure out what's in there.
Sasha: They just kind of take it as a, as a given. And once you start scratching at the surface, you realize that the stuff that's in there is maybe not the stuff you want your models to to train on. So in the last couple of years, we've entered a paradigm of very general AI models. So people call them foundation models. And the idea behind these foundation models is that you can train them on a lot of data, and they learn a general representation of, of the world almost. So it can be a lot of text data. And in which case they'll learn, you know, how language works, quote unquote. Or it could be a lot of image data and they'll learn how what are the objects in the world around us? And then once they've learned this really general kind of almost universal knowledge, the hypothesis is, is that then you can train them for specific tasks with less data. So, for example, if I want to identify the birds in my garden right now, I'm training an AI model to do that. First, I'll start with a model that's already been trained on kind of general objects, and then I'll fine tune it on pictures of birds, for example.
Sasha: And what they don't realize is that by taking this foundation, by by building upon this foundation, you're you're baking in biases into your model. And so it's important to understand what these, like, large datasets contain, because they're reused and reused and reused in almost any AI product that's going to use computer. Computer vision is going [00:22:00] to have ImageNet somewhere inside of it, somehow integrated into it. It was just interesting kind of thinking through this whole idea of how data sets are not just like a single entity that is very like well-defined and reproducible. And actually, it's like I kept coming back to this idea of foundation because it has become such a catchy term in AI lately, because people talk about foundation models, but they're also foundation data sets, and image net is definitely one of them. And actually it's not it's not the most solid foundation because of the different ways in which people use it and define it. And so I was just thinking about, like, all the different ways over 14 years in which image net has been used. And, you know, we kind of are like, it must be a good data set, it must be a good model because it's trained on this data set. But all of those assumptions are not necessarily true.
Mike: I see journalism and news as something that is always sort of a little dance of interpretations, and it's always a thing that could be otherwise. Datasets as institutions. The New York Times annotated corpus by Mike Arnone. Institutions are these things that are usually invisible, and they seem really stable, and they seem really routine, and they float into the background. And we have a really hard time understanding how institutions work. There's this old idea defining institutions as quote unquote, loosely coupled arrays of standardized elements, and that's a very academic, technical way of basically saying institutions are these collisions and combinations of a whole bunch of different things that are what people do. They're what people think they're doing. They're what [00:24:00] people make. And then you look across all of those elements and you figure out how do those elements fit together. They're like puzzle pieces. And then if you can put those puzzle pieces together of what people do, what they make, what they think they're making, what they say they're making, if you can put all those puzzle pieces together in a way that makes sense or that looks coherent or looks stable, then we've usually called those things institutions. So journalism is one of those institutions because we have a hard time understanding and defining exactly what journalism is. And, you know, decades and decades of of research and scholarship, trying to figure out what is journalism, why does it matter? How is it made? How do people make sense of it? Datasets are one way of understanding this institution of journalism, and I want to look at that in this essay through one particular dataset.
Mike: And that's this dataset called The New York Times Annotated Corpus, which again is one of these sort of background invisible things that makes journalism work, makes computational journalism work. And what The New York Times Annotated corpus is, is it's this very, very large collection of articles that were published in The New York Times between 1987 and 2007, and it contains about 1.8 million documents. And these documents, these articles have had abstracts written about them. So they've extracted it and said, you know, what is this article really about? The articles are categorized into topics into people like who appears in this story. Where did this story appear? Who wrote this story? This is all metadata for that sort of describes what these articles are. And what's quickly happened is that this has become kind of a de facto standard or one [00:26:00] of the, quote, best or most authoritative images of what news language is. So if you're building a machine learning model or you're building some kind of computational system and you're looking around and you're saying, I need some model or some example of what quote unquote good news is, the New York Times Annotated Corpus is that set of stories. It's the it's the best example of what you can find. That's what The New York Times would argue. They would say this is the best, because The New York Times is highly invested in it being the best, the top, the most exclusive form of journalism. I had an old advisor tell me that journalism exists in traditions, not nature. There's nothing natural about journalism. It's made up.
Mike: We make it up. Humans make it up. They make it up because they decide what news is. They decide what a good way of describing a story is. They describe what word to use, what they want the reader to walk away from the story, having gotten out of the story. To me, that is all unstable. It's all a matter of human interpretation. And that's a good thing. That's a good thing because we all bring to it our different ideas of what we think news should be about. So I always get nervous when somebody tells me, no, no, this is what news really is, or this is what good journalism really is. I always have this personal pause of saying, hold on a second. That's just one way to look at the world. Don't don't tell me that this is the authority or the standard of what news is. So the reason I'm interested in the New York Times Annotated corpus is it's becoming this or it has become the standard, this authority, this seemingly objective example of what a good news article is or what a you know, what a good journalistic beat is. And that is both seductive because it looks like objectivity, it looks like authority. And you say, ah, I don't have to worry about what journalism is because this data set is going to tell me. To me, that's [00:28:00] really pernicious and a little bit dangerous. If we freeze news as a single thing, if we lock it down, if we stabilize it, then we're limiting our view of the world. We're limiting our view of what journalism could be.
Kate: Datasets at an Event Horizon by Kate Crawford. So this is a personal issue for me because, you know, really starting to do the work of studying datasets, seeing why they matter and seeing what's at stake in a training dataset for an AI system had not really become a thing back in around sort of 2015, 2016, when I really started to do this work at scale, and I was working with artist Trevor Paglen. We were really studying the history of training datasets. So we started by looking at some of the earliest ones, you know, these black and white collections of images of irises that were going to be used in medical studies through to the use of, you know, early portrait data for facial recognition systems sort of being put together in the 1990s, funded by the Department of Defense all the way through to what happens with the internet, when suddenly we have this moment where there's so much data available online that it can be scraped and harvested and put into these training datasets, often without a person ever really looking at it. We traced that arc and in doing so, had to also kind of look at the materiality of the data itself, like really looking at the images, looking at the text, thinking about the taxonomies of their organization and their ordering. And we [00:30:00] had to make methodologies for this. It wasn't a thing that was being done. Even computer scientists, when they were making models, will often just pull the training data off the shelf and not ever open it up to look at what's inside. And it was in doing this that two things became clear. One, there's so much stuff in training data that is deeply problematic and that is dehumanizing, that is profoundly stereotypical.
Kate: And in many cases just illogical or wrong, you know, just things that are mislabeled, things that make no sense, concepts that aren't visual, being given a visual title. I mean, just it's a mess in there. Like, totally, as you know, one of one of the creators of a dataset recently described it. It's like these things can look like hot garbage. So that was the first thing to learn and to realize that if that's your foundation, it is going to inform everything that's built upon it. It is. It is the material that every system will use to interpret the world. So it was clear to me from the very outset that training data is essential, overlooked and is beset with problems. The second thing, and this was the work of really sort of doing this, is that it? It's really important that we actually do this work and come to understand it. And it is extremely difficult. You know, it, it does take years. It does take a certain amount of just dogged determination to just sit there and say, am I going to understand how this thing works, what's in it, where it comes from, who's represented, who's not represented? But to ask these deep epistemic questions about how worlds are being built? I think for me, that became a personal mission. Like it really is like this is a thing which isn't being done enough in the field. It will shift how these systems work and how people will be impacted by them. So for me, that is very much, you know, a, you know, a personal question and an ongoing orientation to why do [00:32:00] this type of research.
Tamar: Beautiful.
Tamar: Next time. On Knowing Machines, we explore the legal implications of AI, from facial recognition to taking the bar exam while being watched by a computer, to the connection between an AI lawsuit, free speech and Oprah.
Oprah: Okay, free speech not only lives, it rocks.
Tamar: We'll see you then.