Engelberg Center Live!

Dataset Creators: The Architects of AI

Episode Summary

How do you make a dataset? And…how do you make a better one? Does a dataset accurately reflect what it claims to, and who should be held accountable if a dataset is misused? These are just a few of the questions posed by Will Orr and Kate Crawford when they embarked on a groundbreaking study of dataset creators, a historically overlooked and undervalued piece of the AI puzzle. In this episode, Will and Kate discuss their findings, the challenges all creators face, and what it means to produce and care for these unstable, unwieldy repositories once they’re out in the world.

Episode Notes

The Blue Dot Sessions, “Arizona Moon,” “Color Country"

Episode Transcription

Knowing Machines

Episode 4: “Dataset Creators”

 

Tamar Avishai: From the Engelberg Center on Innovation, Law and Policy at NYU School of Law and USC's Annenberg School for Communication and Journalism, this is Knowing Machines, a podcast and research project about how we train AI systems to interpret the world. Supported by the Alfred P Sloan Foundation.

 

I'm your host, Tamar Avishai.

 

So let's talk about data sets. Artificial intelligence is, of course, full of them. It runs on them. It is them. Generative AI models are trained on the unimaginably voluminous amounts of data that have been scraped, grouped, and are actively interpreting our world even as I say this. The data set used to train a generative AI model influences what it does, how it responds to prompts, and what it produces. Small idiosyncrasies within the training data can have huge outcomes on what it produces. And this data is on literally everything. I mean, think about what ChatGPT is capable of. In the last episode, you heard about the role of data in facial and emotional recognition, and in the next episode, you'll hear about their role in the birder community in journalism and art. If the human brain has conceived of it, you'll find it in a dataset. And today, we're going to talk about how a dataset is made and specifically who is making them, because yes, they are made by people.

 

We tend to think of data as objective or neutral, conveniently forgetting that both data and datasets are profoundly shaped by the intentions and biases of their creators. So who are these creators? What kinds [00:02:00] of micro and macro constraints are they under, and what kinds of questions do they have to think about and grapple with before they unleash these unwieldy, unstable repositories out into the world? Questions like when is a data set ready to train an AI model? Does the data set accurately reflect what it claims to reflect? Could it be misinterpreted or misused to talk about these questions and maybe suss out some answers? I sat down with Will Orr and Kate Crawford, who, as part of the Knowing Machines project, undertook a comprehensive study on dataset creators. This study is the first of its kind, interviewing 18 dataset creators and actually centering their voices, which have historically taken a back seat to model designers and developers. And what they determined in particular, were four central challenges faced by creators that shape the construction of datasets. Here's Will:

 

Will Orr: So firstly, scale. Dataset creators felt increasing pressure to scale up their creations. Yet this brings with it associated costs such as the inability to adequately care for its contents. Resources creators often negotiated limited computational and financial resources in constructing and making use of datasets. Yet these trade offs can compromise the quality of subsequent datasets while also limiting who can meaningfully make use of them. Shortcuts. Dataset creators can rely on shortcuts and proxies that, when naturalised, lead to systemic failures that can affect those who rely on them downstream. And finally, accountability. Creators communicate an ambivalence regarding where accountability for their creations and their impacts lie due to the legal constraints they were operating within and the inability to control the uses of their dataset.

 

Tamar Avishai: Okay. Remember these four challenges throughout [00:04:00] this conversation? Scale, resources, shortcuts, and accountability. And of course, the importance of centering creators and their voices.

 

Will Orr: So yeah, like you said, we interviewed 18 dataset creators for this project and for this podcast. I met again with two of them, Chandra Bhagavatula, a lead researcher at the Allen Institute of AI, and Robin Jia, a computer science professor at the University of Southern California. You'll hear their voices over the course of this episode. You know, when we really want to drive a point home.

 

Tamar Avishai: Here's my conversation with Will and Kate.

 

So Will and Kate, thank you so much for joining me today. So I want to learn about datasets today more broadly, but I also want to learn about your study more specifically. So let's start broad for people like me who are coming into this subject fresh. And then we'll get to the study. What is a dataset and why is it so important for AI?

 

Will Orr: So datasets are essentially the the collection of all the data that is then fed into these machine learning models. And so we have like two primary forms of datasets. You have these large training datasets that are used to kind of as like the foundations of all these machine learning systems. And you have these benchmark datasets, which are also used to evaluate how these systems work. And essentially these datasets produce the representations of everything that is, you know, that comes to be represented within these systems. And so when these systems fail, right? And when they produce harms, we can look at these training datasets to kind of understand why that might be the case and what forms of data may have led to these misrepresentations or harms.

 

Tamar Avishai: So what inspired you both to take on this study of dataset creators?

 

Kate Crawford: While dataset creators are incredibly important in the ecology of how we make models, in so many ways, datasets really set the epistemic boundaries. The [00:06:00] the the model of the world that will create, you know, how a system interprets the things that it's fed. So in this sense, dataset creators have an enormous amount of power and input into what an model looks like, how it works. Yet they haven't really been many studies that have really centered dataset creators. And this was something that I was remarking on with. Well, when we first started working together, gosh, a year and a half ago now. Um, and so we really shaped this study to, to essentially go and find dataset creators and to ask them questions around what is this process, this social construction of datasets. You know, it's it's not just one person, you know, sitting in a room deciding what a dataset will be and then building it. It's often sort of in these institutional contexts. Sometimes they're in tech companies, sometimes they're in academic labs, sometimes they're in these sort of, you know, wider sort of research conglomerates. So finding out those kinds of questions was, was, you know, a really major motivation for the study.

 

Will Orr: Yeah. And I'll just add, you know, these dataset creators they often face with, you know, myriad of challenges throughout their their creation process. And they often have to make these decisions on the fly that materially shape the way these datasets are constructed in the end. And there's no clear ways to kind of make these decisions, right. Often they have to rely on ad hoc practices or subjective determinations throughout their practice. And so what we really wanted to do is kind of like shine a light on this practice of dataset creation itself. This is a labor that's kind of often devalued in the pipeline of machine learning research, and it doesn't get a lot of attention in, say, like research papers. And often these these dataset creators are learning lessons, often just by themselves. So we wanted to kind of bring these voices together to see what we could learn from dataset creators as a community.

 

Kate Crawford: Exactly right. And we could think about this as a form of ground truth construction. You know, that's really what [00:08:00] these architects are building. This is this is the ground truth. They're building the foundation. So for this work to be undervalued and under theorized in computer science and in more specifically, I think it's a bit of a problem because it is so important. And so part of this process is hearing from dataset creators, hearing about these challenges, but also hearing from them about how things could be better, how these challenges could be addressed, how we could actually make, you know, stronger, more ethical, more helpful frameworks and forms of thinking to guide dataset creators.

 

Tamar Avishai: And briefly, can you talk about the inner workings of this study, like the methodologies that you used?

 

Kate Crawford: Sure. So, you know, Will and I started working together about a year and a half ago, and it was very clear to us that dataset creators are a really important constituency. If you want to understand how models are being built and the impacts that they have on the world. So. So we looked at the literature in this space, and we were really surprised to find that there aren't really any major studies that center dataset creators. So it was very clear to us that this was a gap that should be addressed. So we designed a study back in early 2022 where we wanted to really look at a wide range of dataset creators, dataset creators who are making, you know, really specific datasets, in some cases, benchmark datasets. And we came up with a set of around 18 dataset creators to interview.

 

Will Orr: Yeah. So the creators that we interviewed were from a wide range of datasets, right. And also from a wide range of contexts. So many of them were from academia and from universities. And we also had some dataset creators that were from nonprofit organizations and a few that were also located within private tech companies as well. The datasets themselves. Yeah, there were a range [00:10:00] of benchmark. Datasets and large scraped datasets, and often they were from a wide range of domains as well, including emotion recognition, action recognition, a lot of natural language processing, and personal recommendation as well. And I think what's important is that the data collection process itself was was diverse. So many of them were collecting datasets with crowdworkers. Some were also scraping datasets. And so what we really wanted to do is understand how dataset creation is working as a field, as a whole, and to to really uncover the shared challenges that are faced by this community.

 

Kate Crawford: And I'll add to that that we spoke to a few, but not many corporate researchers, because of course, in many cases, researchers inside major tech companies aren't really free to speak about how they're making large scale datasets, which, of course represents another problem for really looking into the often quite opaque processes of how AI systems are built.

 

Tamar Avishai: Okay, so let's move on to the findings of this study. Was there anything glaring that you saw dataset creators not addressing in their work, like any open challenges that aren't being addressed?

 

Will Orr: Yeah, I think there's a number of open challenges in the dataset creation community. I think that, you know, big ones are privacy and consent. So at the moment, large, the way large datasets are kind of created is by scraping large amounts of data from the from the internet. And often this is, you know, without getting consent of the people that are kind of represented within these datasets. And at best you have this kind of like opt out consent. So if you find that your data is is within these large datasets, you're able to kind of request that it is removed. Um, what I'd love to see in the future is, you know, moving towards this kind of opt in consent as well. So having positive consent of people that are that know that they're being collected and they're and they're willing to be represented within datasets. Um, I think there's [00:12:00] also a larger question about privacy, whether people want their data within datasets at all. And, you know, in this new era of generative AI, I think these questions are more pressing than ever.

 

Tamar Avishai: Okay, so I keep hearing so much about scale. Why is scale so important in the creation of datasets?

 

Will Orr: Yeah, that's a really good question. So so scale is such a motivator for many of these dataset creators, right. Um, this kind of comes from the the notion that if you have a large enough dataset, you'll inevitably be able to kind of produce these novel outputs or interesting findings essentially with, with your datasets. And this is kind of with the, the thought that a larger dataset is also seen as kind of more objective, and you're able to kind of get to this more this essential truth of the data. So when I talk to Robin about this, remember he's a professor at USC dataset creator who recently spoke to for this podcast. Here's what he had to say about scale.

 

Robin Jia: The reason why the like this large size was so important was that, you know, these these neural network models that I mentioned, they're not very good at learning from only like a few examples. So so in order for them to be good, they really need to see a lot of examples of like, oh, this is the input you're giving me and what is like the correct output I'm supposed to give, right. What is the correct answer to all these different questions? But it turns out if you have 100,000 of these pairs of questions and answers, they can actually learn to to do this task quite well.

 

Will Orr: But what we find is that scale kind of brings all these unexpected challenges as well, right? So instead of scale kind of drowning out the noise of, of datasets, what you see is that scale just exponentially increases it. So we have these large datasets that essentially have a very, very poor quality. And and that job of cleaning these datasets is always left to the user of the dataset, who may not really have the tools or means to do so. We also see scale kind of impose operational constraints upon dataset creators. Right? [00:14:00] So despite them wanting to create the largest datasets possible, often that means there are technical constraints such as scrapers breaking down because there's just too much data, there's resource constraints. So scraping a lot of data can be quite costly as well. Or, you know, engaging with many crowdworkers as well can be quite a resource intensive endeavor. And this itself can kind of shape the dataset and mean that it's often not the the best artifact that they could have achieved, but rather a compromise with these objectives in mind. When I spoke to Robin, he really highlighted the pressures and trade offs that an imperative for large scale datasets imposes on dataset creators, and this is particularly regarding the resources needed for their creation.

 

Robin Jia: I mean, honestly, like kind of keeping keeping costs down, I guess is a is a big one or else you, you run out of money, right? And so it really puts a lot of pressure on you to kind of you have to kind of decide what you want to do, and then you kind of have one shot to do it at large scale. With the budget you have and then that's that's your final data set. It does kind of hurt your ability to iterate. It also just makes things slower for crowdsourcing. Specifically, it's there's this like balance of like you want to obviously like if you find more people to work on your task, then your task can get done faster. But then are you, you know, hiring people who maybe aren't doing the task as well as you would like them to do. Right? So there's that sort of trade off. So we see how.

 

Will Orr: Resource restrictions can limit creators control over a data set. And this can have material impacts on the quality of the dataset downstream.

 

Robin Jia: Anyone who's like tried to do stuff with crowdsourcing knows that. Like you have to be very careful about kind of getting actual high quality work, right. So there's a lot of different people on these platforms. They're going to vary in a lot of different ways, you know, for whether are they really trying to do the task or are they kind of spamming or even if they're trying to do the task just like some of them might have a better grasp of what what [00:16:00] is expected of them, or some of them might not have understood the the instructions. Or, you know, there's a whole sorts of other things, issues that can arise. And basically that means there's you have to do a lot of quality control.

 

Will Orr: Again, this is Chandra Bhagavatula, a lead researcher at the Allen Institute of AI.

 

Chandra Bhagavatula: There is a lot of careful curation of this community of annotators that needs to happen behind the scenes, and it takes a lot of time and effort to actually do that. This doesn't make into research papers glamorous, but it is actually a lot of work to to initially do that just to get people to generate the Winograd Schema formatted questions. It took us multiple iterations. Like, my guess is I don't exactly remember. My guess is about two months to of of constant iteration and back and forth to actually get to that stage. And that's just the format like or the kinds of questions. So there are other challenges that we had to tackle as well.

 

Will Orr: So Robert and Chandra both underscore the challenges that this drive for larger datasets impose on data collection and quality.

 

Kate Crawford: Exactly. And it's important to remember that with scale comes a type of unwieldy kind of collection. It can really be very difficult for dataset creators to sort of work with these datasets because, you know, as one of the interviewees said to us, at a certain point, a dataset is just too large to audit, particularly by any kind of manual means. And what that means is that fundamentally, creators aren't actually looking at these datasets. They're not opening them up and seeing what's inside them. And that's a real problem because so many cases, particularly with these internet sized datasets, there is just a lot of really bad data in there. Sometimes it's just bad quality, sometimes it's really problematic representations, you know, it's really quite gnarly when you really start to open up a dataset like, you know, line five and you see 5 billion images, [00:18:00] so many of which have been scraped from e-commerce sites and Pinterest, and then these, you know, really grainy, you know, badly described images. And this is the foundation of making sense in AI. So this, this problem of scale really can also introduce a problem of quality.

 

Tamar Avishai: Okay. So what are some of the constraints that dataset creators are operating under. And I mean how did they make the trade offs they must have to make in order to meet their objectives?

 

Will Orr: Yeah. So there's, you know, a lot of constraints that dataset creators are operating under. So a big one is about resources. And this isn't just financial resources, but it's also time of the dataset. Creators themselves is the compute. So the computational resources and also labor. And so the you know, creating datasets can require massive amounts of labor throughout the data creation pipeline. And each of these and, you know, there's also funding deadlines and conference submission deadlines that dataset creators are often racing towards. And these constraints often shape the datasets and their final products. So what this means is that, you know, datasets kind of often exist as a compromise between many of these constraints. So for instance, we had one dataset creator that was really interested in collecting data across all genders, but due to the constraints of their work, they were only able to collect data in a certain period of time, and they were only able to source participants from their computer science department, which was largely skewed towards male participants in their early 20s. And so what this meant is that the dataset itself became overrepresented by by males, and females were drastically underrepresented in this dataset. So while they have this ideal of providing a diverse dataset that's representative in practice, you get this datasets that are a compromise of their operational constraints.

 

Kate Crawford: And it's interesting, this sort of there's always a challenge balancing time, money and [00:20:00] compute in these these processes of dataset creation. And it's compounded by the fact that this work isn't seen as being important or it's not valued as being as anywhere near as sexy as being like, you know, the people who are writing, you know, the algorithms and creating the models. So it often isn't prioritized in budgets. It's often not given enough time. It's, you know, being given a kind of small compute cycles relative to actually training the model. So because it's sort of at the bottom of the hierarchy, it often is sort of being pulled together really quickly with these sort of big problems around the diversity or the quality of the data. And that goes on to really impact the models and to impact how they work in the world. So this layer is just consistently undervalued and ignored, even though it is so important.

 

Will Orr: Creators also spoke about making tradeoffs throughout as solutions to problems they encountered along the way. Dataset creation is rarely a straightforward process, and creators all expressed running up against unexpected challenges during the project that often called for unique solutions. So Chandra spoke about the overrepresentation of easy instances in the dataset and the costs of implementing an algorithm that could filter these out.

 

Chandra Bhagavatula: So one big limitation here is that just by the nature of the algorithm, since it filters out or retains hard instances, it is prone to retaining noise as well. What is more more difficult like the most difficult part to predict its noise. So we do retain.

 

Will Orr: What do you mean by noise?

 

Chandra Bhagavatula: Yeah. So let's imagine in general most of your dataset is clean, but there are some instances where. Going back to the sentiment example, the sentence is clearly positive, but it is labeled negative through some errors somewhere in your pipeline. Right. That is a noisy instance, and it is going to be super hard to predict that if you are doing well on all of the other instances and you're always. To get that instance wrong. So [00:22:00] instances like that miss annotated instances is an example of. Of noise. So there are definitely chances of noise noisy instances being present in in the final dataset.

 

Will Orr: Yeah, I think it's a really interesting example of, you know, some of the trade offs that kind of are made throughout the process of dataset creation as well. And also like, you know, sometimes you need to look for innovative solutions. I guess an ad hoc solutions to the problems that you face. So yeah, I think it's a it's a really fascinating example for that.

 

Tamar Avishai: And so these constraints and challenges, like what impact do they have on this idea of accountability? I mean, who takes accountability for datasets and their impacts? Yeah.

 

Will Orr: So this is a really big challenge in the creation of datasets. So on the one hand, these creators that we spoke to, they do recognize that they have a material impact and control over datasets and the ways these artifacts shaped and inform machine learning systems. Right. So every decision that they make essentially changes the dataset and how things become represented. Yet at the same time, they feel constrained due to their context. Right? I think particularly talking to dataset creators in in corporate contexts, we hear about how legal departments may dictate the final forms of datasets and really shape the way that these datasets are able to be sent out into the world. For instance, one dataset creator noted that they weren't able to release the dataset at all, and they just had to release the code to reproduce the dataset. And that meant that they had to rely on these publicly available artifacts that were themselves flawed, that they couldn't necessarily change even though they would love to. And so what this means is that you kind of get this dataset that is itself flawed, and the dataset creators don't feel as though they have control over it in the first place. On the other end of the scale you have, you know, the creators are unable to know how users kind of make use of their dataset in the end. So they might write these very extensive terms of use, for instance, about [00:24:00] how they would like their dataset to be used. But these are only really suggestions. And so creators felt that they were unable to control how the datasets were used down after they were released into the world. And that really muddies this question of accountability over datasets.

 

Tamar Avishai: Okay, so in this era of generative AI, what are some ongoing open challenges that dataset creators face or that dataset creation faces?

 

Will Orr: Yeah, it's a big question at the moment I think. So I. So the dataset creators that we spoke to for this, for this episode, at least they were pointing towards this question of how do we benchmark what the capabilities of generative models. You know, previously datasets were kind of created to test a specific task of a, of a machine learning model. And then the capabilities of a certain model were then ranked on a leaderboard as, as a representation of the capabilities of, of the industry of the field. And yet these same techniques don't necessarily work any longer in the era of generative AI, in these free forming texts that might not have, like a single right or wrong answer. So how do we kind of test the capabilities of these generative models? I think is a really important thing that we're going to need to address quite soon. And also along with that is what are the most important aspects to be benchmarking, I think is a really important question to be addressing. So what are the most important capabilities that we really need to evaluate of these models?

 

Robin Jia: Similarly, if you're given a document and you have to summarize it, there's lots of different ways to summarize the document. Right. So we enter this area where there's no one right answer anymore. And this is also where I think like real kind of impact of ChatGPT is also in these same sorts of situations, right? Where you you ask it something, it gives you some very long answer, and it's impossible to kind of enumerate all the correct answers to your question and just check [00:26:00] if its answer is one of those. So this issue has been around for a long time, and it's generally something that that makes benchmarking a lot harder. So the kind of normal, the kind of gold standard for evaluation in these situations is what people refer to as human evaluation. So basically, you know, you just have to get a person to look at this model's output and just judge, you know, how how good is this? In fact, like we know that OpenAI is literally doing this. This is part of their pipeline of creating ChatGPT is they they pay people to look at ChatGPT answers and say if they're good or not.

 

Will Orr: Yeah. There were also speaking about how crowd work might be affected by generative AI systems. So they were saying that while dataset creators often might rely on crowd workers to source their data, these crowd workers have started using ChatGPT, for instance, to generate their datasets. So how does this kind of flow on effect kind of impact the datasets that we then make? I think that's an important challenge as well.

 

Tamar Avishai: Okay. So if I'm understanding this correctly, one of the issues here is that essentially a generative AI model is pulling from AI generated data. Like that's what's training it.

 

Kate Crawford: Exactly, right. You can think of this as a kind of, you know, inception phenomenon where all of these generated images and texts and now becoming training datasets in their own right and you're starting to see sort of more and more abstraction away from, again, this idea of what ground truth is. And as we've discussed in other episodes, given the problems with representation, with bias, with sort of dehumanizing ways of depicting particular types of people, imagining how that then becomes this idea of how the world is represented just raises really major challenges for the field. Of course, there are a whole series of technical questions you can ask about how synthetic data will actually create the next generation of models, but so [00:28:00] many of the dataset creators that we spoke to are really concerned about this problem.

 

Tamar Avishai: I was going to say inception, that's that's where my brain just went, okay, so how can dataset creation be improved?

 

Kate Crawford: Oh, so many ways.

 

Will Orr: Yeah, exactly. Yeah. I mean there's plenty I mean, I think one of the big ones is striving for data quality, right. So everyone that we spoke to was really stressing to us the importance of, of making sure that the, the data is of a sufficient quality and to really know what the data is even saying at all. And as we were saying that, you know, this notion of scale can sometimes challenge how we can understand the data in itself. So particularly they were saying to really look at the data and make sure that it represents what it intends to in the first place, and, and to make sure that the task itself is valid, like a valid representation within the dataset. I think they were also mentioning the importance of making mistakes along the way. So a lot of them were saying that it's impossible to know everything that will go wrong in the creation of a dataset, but the important thing is to to make those mistakes, to realize them, to check those mistakes and to then improve. So iteration is so crucial in the creation of datasets. In order to finally get a data set that you're happy with. And that is that is of a high quality. Um. And as Robin explained to me, high quality data is also important for building trust among users and their motivation to use a dataset.

 

Robin Jia: Like, besides just being large, I think we also did a pretty good job for the most part of ensuring that what the dataset says are the correct answers are actually correct. And this is like kind of a trivial thing, but I actually think it makes kind of a big difference. If you're someone who's working on a dataset and you just look at some examples and if you see there's a lot of like issues with them, it's just it's kind of demotivating. You're like, oh, you know how high of an accuracy is even possible if there are all these kind of mistakes in, in the dataset itself. So I think just like, I [00:30:00] don't know, looking at your data, like trying to think from the perspective of, like if someone else is looking at this data, would they actually be motivated to work on it? Or would they feel like, oh, there seems like there's a lot of issues here. I'm going to I'm going to avoid using this dataset.

 

Kate Crawford: Another really important issue is collaboration. This is something that multiple dataset creators noted that having a team where people will actually pay attention to different aspects of a dataset has been very helpful to them. And clearly interdisciplinarity here is important too. It's very rare that a dataset creator is working in a team where you have people with a computer science background, as well as people with a social science background or even a humanities background. So you can imagine the ways that some of the problems that we see with these datasets could be addressed much earlier on in the pipeline. If you have a really good collaborative team from diverse backgrounds asking those questions at the very beginning.

 

Will Orr: Yeah. And I think the, you know, the other one that we've been pointing to throughout this episode is to really strive to value this dataset creation labor as an appropriate and important part of the machine learning pipeline. So for too long, this this labor has kind of gone unseen and unvalued. But really it needs appropriate resources. It needs the time that that is required to make these datasets of a high quality. And it needs the the community as well to. Yeah, share these understandings and these findings. So when I was speaking to Chandra, he really echoed this sentiment.

 

Chandra Bhagavatula: I feel like as a community, we don't value dataset work as much as we value modeling work. Um, even when you know that high quality data is the one that, like, really fuels a lot of these models, right? And higher quality data can lead to better results with smaller models even. Yeah, I think we need to somewhat try to change that attitude a little bit or [00:32:00] like as a community value dataset work a lot more.

 

Kate Crawford: And I'll share a bigger concern that I have about what's happening with datasets generally, certainly for many years, since, you know, a group of colleagues and I wrote a paper called Data Sheets for datasets, we've been really worried about the fact that so many datasets don't even have any documentation or information about where that data came from, who's represented it, what was it intended for in the first instance? And without that kind of baseline information, it becomes a really difficult for so many of the people who want to use that dataset. It just means that that information disappears with all sorts of problems that get produced down the line. So this issue around how we understand what is in a dataset is something that is still, you know, it still hasn't really been addressed. There is no clear industry standard around this. And we have a second problem now too, which is that a lot of companies are not releasing information about which datasets they're using to train models. So we know nothing about what datasets we use to train GPT four, for example, we knew a little bit about GPT three and even more about GPT two. So in a weird way, we're actually going backwards with each new model. We know less and less about what was actually going into that model, what were the training datasets. And that presents a real problem. It presents a problem for researchers who are trying to improve these models and think about that data more constructively, but it also presents a public problem, which is that you just don't know what information is going in there or why you're getting the results you get. That, I think creates another type of distancing from accountability in terms of the impact that these models have in the world.

 

Tamar Avishai: Well, will and [00:34:00] Kate, thank you so much for the time and the research that has gone into this study and for your time today.

 

Will Orr: Thanks so much for having us, Tamar.

 

Kate Crawford: It's been an absolute pleasure.

 

Tamar Avishai: Next time on knowing machines. A bird in the hand is worth two in the large language model.

 

Jer & Hamsini: How are these systems changing the way that we know birds? And then you also have on top of that the birds themselves and how they like f*** with the process of becoming data.

 

Tamar Avishai: We'll see you then.