Engelberg Center Live!

What Lives Inside Our Datasets? An Introduction to Knowing Machines

Episode Summary

AI is everywhere, and everyone - from you and me to corporations to the computers themselves - are scrambling to understand it, and our place in it. In this episode, team leads Kate Crawford and Mike Annany have a frank and comprehensive conversation about artificial intelligence, from its technical roots to its epistemic and philosophical implications to how much water it takes to keep it cool. What is generative AI? What is a dataset? What are the industries currently most affected, and which industries need to watch their backs? Kate and Mike explain it all, and, moreover, ask deeper questions about what it all means for us, and what this series hopes to answer.

Episode Notes

Music used:

The Blue Dot Sessions, “Drone Birch,” “Song at the End of Times,” “Trek VX"

Episode Transcription

Newsclip: Professor Stephen Hawking has said that artificial intelligence could be the biggest event in the history of civilization. Or the last.

 

Newsclip: Could artificial intelligence soon be doing my job?

 

Newsclip: I think every company is going to be an AI company.

 

Newsclip: Artificial intelligence, or AI, is used in many spaces, but for all its advantages, it also has a negative impact on the environment.

 

Newsclip: I wouldn't be surprised if in the next 6 to 12 months we have models that are actually capable of truthfulness.

 

Newsclip: As a person of color, I'm concerned about the inherent biases reflected in AI. A lady in Georgia was arrested by Louis, Louisiana State Police through a facial facial recognition.

 

Tamar: From the Engelberg Center on Innovation, Law and Policy at NYU School of Law and USC's Annenberg School for Communication and Journalism. This is Knowing Machines, a podcast and research project about how we train AI systems to interpret the world. Supported by the Alfred P Sloan Foundation. I'm Tamar Avishai. I'm an audio producer and a journalist who probably, like you, has been watching the rise of AI and the discourse around it with increasing alarm over the last number of years, and particularly the last few months. It feels like it's everywhere in every headline, but honestly, I don't even understand what I'm hearing enough to know if I'm worried about the right thing. I'm less concerned about being killed by a robot overlord than I am about climate change, or what I put out on social media, or artists being put out of work. But it's hard for me to wrap my mind around how all of this is affected by asking Siri to add dish soap to my shopping list. Fortunately, [00:02:00] we've got the experts at the Knowing Machines project to unpack and demystify the AI black box through a whole range of interdisciplinary stakeholders humanists, social scientists, engineers, artists, journalists who are thinking through the longer term consequences of AI together. This is the podcast where you hear them explain their concerns. This is where we look under the hood and dig into the incentives and ideologies baked into our datasets. And this is where you then decide how you want to move forward. In this episode, I sat down with Professor Kate Crawford and Professor Mike Arnone, two co-leads of the Knowing Machines Project, to discuss both the background and current state of AI, how we got here, and how we should think about thinking about it. Throughout this podcast, here's our conversation. So, Mike and Kate, what is this moment about?

 

Kate: Well, I don't know how you're feeling, Mike, but I would have to say this has been one of the the most dramatic years in I that I've certainly known in my career. It's been a transformative time where generative AI and specifically transformer models, text to image models and large language models have completely dominated the landscape. We've seen all of the major tech companies really shift towards including generative AI in their products, and we've seen millions of people. In fact, now hundreds of millions of people change the way they search for information, the way they write and the way they produce images. So [00:04:00] really quite an extraordinary economic, cultural and political shift just in the last 12 months. But I'm curious what you think, Mike.

 

Mike: No, I would totally agree. And I think, you know, both of us have seen these things that were for a long time, you know, in computer science laboratories or industrial laboratories that were sort of hidden from public view all of a sudden explode and achieve this kind of scale in such a short period of time that I don't think we've seen before. So we're seeing this, this rapid, widespread change, places that I you know, historically, we didn't think it was going to be all of a sudden it's there and it's shaping those places. I think we've seen some industries do some pretty significant soul searching about, you know, what are they what what kind of autonomy, what kind of freedom do they have from these tech companies? How quickly is this change happening, and how much agency or power do they have to control the change, whether that's getting their workforces ready or their customer base is ready, or just even thinking about what they what they are sort of fundamentally and existentially. And then I think we've got this issue of sort of the regulatory uncertainty because this, this pace of change is happening much faster than I think some jurisdictions are ready to deal with. They just they're not exactly sure how to move as quickly as some of these industries are. And, and then one of the other things I think about is this question. And you and I've, you know, talked and written about this for years, but what does resistance look like? What is refusal look like? What is what is that even possible when the ubiquity is just so powerful? Yeah, all the things are happening at once.

 

Kate: And it's kind of extraordinary to think about how much it's changed. And the fact that I think in many cases, the companies that really led the pack on generative AI weren't even ready, I think, for the type of success that they saw. I mean, of course, ChatGPT launches in November last year, and within two months, it had broken every [00:06:00] single record for a consumer technology reaching 100 million people. Like, that's faster than Google, it's faster than Spotify, it's faster than TikTok. I mean, there is no other application that has grown that quickly. And I think in many ways it kind of left a lot of industry leaders kind of flat footed. Even the ones who were designing these systems to get to the place of saying, what are these actually for? What are these best for? And so what we've seen, I think, is this kind of extraordinary period of, of cultural adaptation where people have been trying to figure out, you know, what is an what is a large language model like ChatGPT good for? And so we've seen people adapting it to using it for news stories, with a whole series of disasters that that followed very quickly on realizing the creation of misinformation, false facts, etcetera. We've seen people using it as a search engine, which is really not what a chatbot is really good for. It's not a fact search architecture. We've seen people using it to assist in education, to help write essays. We've seen people using it to try and replace entire services and at the same time, this extraordinary kind of cross talk that's happening around, well, it's not good for this, but what is it good for? So this sort of powerful moment of uncertainty and almost testing how these systems can be incorporated into everyday life, while at the same time we're seeing companies really just build it in at every juncture. It's in word, it's in PowerPoint, it's in search, almost sort of searching for its its killer app. So it's it's really been an extraordinary kind of moment of improvisation.

 

Tamar: And is this experience also true for lawmakers as well as industry folks.

 

Kate: Something similar in the regulatory landscape? Of course, the EU has been at the head of the pack with the act, but the act was well and truly framed before [00:08:00] the generative AI wave. So again, you're seeing improvisation from regulators trying to adapt this. Very large and unwieldy piece of legislation to a new emergent technology. And of course, you know that in theory is coming out by the end of this year. So we're about to see some really significant changes on that side as well.

 

Mike: I would echo all that and and add as well. So Lucy Suchman, who I know you and I both know, has this wonderful phrase. She talks about AI's, what she calls a quote unquote, a floating signifier, or sort of this idea that it's this thing that's out there and it's, you know, or a Rorschach test might be another way to think about it, but it's this everybody is scrambling to either see themselves in this moment and see themselves in generative AI, or to figure out they don't want it. But it's. But what the it is, is, is still so sort of multi-headed and all over the place that I think there's we're still in this moment of trying to figure out what this thing is. And again, the other phrase that comes to mind is, you know, science, technology scholars who you and I both both worked with and talked about call this moment interpretive flexibility. Just a fancy word for saying we're just trying to figure out what it is we're trying to. And there's a moment of inflexibility and interpretation. And that's that's both a scary moment, but it's also potentially a very beautiful moment where if you can rapidly figure out what kind of interventions make sense, what kind of framings make sense, what kind of, you know, powers and forces and people to muster. Then there is potentially, you know, a chance to to influence this moment in a way. And the last thing I'll say in terms of the, the, you know, what's going on is this matter, you know, being embedded in a university. I'm in meetings all the time where people are questioning what is the workforce of tomorrow need to be, and do we rapidly need to reshape pretty much every department and school at this university to speak to generative AI? And so there's an excitement there. But I think there's also a bit of pushback of saying, wait a second, [00:10:00] you know, do we is this really the existential fundamental moment? So I'm watching that. I'm watching that conversation, that debate play out on my campus as people figure out what this moment is in an educational setting.

 

Kate: And this is actually where I think historians of technology are so helpful in really contextualizing what's going on. And in fact, we could go all the way back to the 1960s and the very first construction of a chat bot by Joseph Weizenbaum at MIT. And, you know, he noticed very early on that even with a very simple set of scripts which he was building into Eliza, that people would have these interactions and be completely convinced that they were engaging with a sentient, conscious entity. Now, this horrified Weizenbaum. And, you know, he, of course, then goes on to write Computational Power and Human Reason in 1978 as a tirade against the way that these technologies will actually cause people to become, in his words, kind of delusional about what these technologies can and can't do. But I think what's so interesting is to think about the ways in which I think we, as users of these technologies, fill in the gaps. We cognitively kind of bridge those kind of, you know, inadequate or awkward or completely incorrect responses from these systems to imagine what's actually going on to to imagine into them a type of greater capacity than what they have.

 

Tamar: And that's, that's a very human imaginative process where we assume, right, that a system that communicates to us must be intelligent. And it has a long history there. Right?

 

Mike: Yeah. You know, one one way of telling that history, I think, is to say, well, then the whole field of human computer interaction sort of followed that with this question of saying, how should we relate to technologies? And, you know, think of Terry Winograd and Lucy Suchman as early, you know, disagreements [00:12:00] and fights over, you know, the politics of categories and how how should you design these systems and who are the people that are using them, and how do you model people and how do you model people in computers? And then think about the, you know, Reeves and Nass, you know, the model of computers as social actors, which is another kind of, you know, theory or casa casa, these people who got so angry with their computers, they would like, throw them out the windows and they would take action against these physical objects because they saw them as standing in for these feelings, these frustrations. So it's not only I think it's also a cognitive thing, but it's also an affective dimension here where people have strong feelings about their computers because they have feelings about each other and their moments and these contexts. So I see this exact moment is absolutely wrapped up in and related to these long histories of how to make systems and how to think about them and feel in relation to them.

 

Tamar: So if these definitions and histories and even frustrations are so multifaceted and nebulous for the experts, you can imagine how confusing they must be for. The lay among us. Can we take a step back and actually define our terms? What is generative AI?

 

Kate: So there's the narrow definition of what generative AI is. Mean, of course. Generative AI is essentially a set of models that produce content, and they do that by using algorithms. Traditionally, generative adversarial networks Gans or variational autoencoders known as Vaes, and these are models that are trained on vast datasets. And they can use that data to generate sort of novel data examples. And so that means synthesizing images, generating text, transferring a style. So we're predictive. I was really trying to anticipate potential statistical reality based [00:14:00] on historical patterns. Generative AI is much more about creating novel outputs from a vector space that it's generated through learned data. So that's the technical answer. But if we really look at this big picture, it's actually a gigantic planetary system. To make this work you need a huge amount of data, as mentioned, but you also need a huge amount of compute. That means a whole lot of right now, A100 and H100 chips that come from Taiwan, you need a huge amount of power. So the energy required to run these models is just staggering. In fact, predictions have varied between 5 to 50 times more energy intensive than traditional, say, search. And then you need a lot of water. And we've seen from recent studies that the fresh water needed to run something like GPT is actually an astounding amount. In fact, both Microsoft and Google's water consumption has gone up between 30 to 40% in just the last year.

 

Tamar: And just to be clear, the water is to cool the systems.

 

Kate: Exactly right. So so essentially, because you've got these huge ziggurats of computation and they use so much energy, they're actually really hot. So if you go inside a data center, you know, inside one of these big server rooms, they're producing an enormous amount of heat. So you need water as a coolant. So that's why you often find these big data centers located by big rivers, or they're in places where they can attempt to cool it in different ways.

 

Mike: So think generative AI is more than a technological phenomenon. It's more than what a computer scientist might tell you. It's about these social and cultural and sort of political and economic phenomena it's all wrapped up in when you ask, what is generative? I think it's all those things at once, but it also has a relationship to what is sometimes called. Quote unquote, traditional AI or predictive AI. [00:16:00] And there's a bunch of different words to describe this shift where that predictive kind of AI was based on decades of surveillance and capturing data of various kinds and recombining data. And there were these people that were cleaning data and standardizing it and categorizing it, but it was in the service of training predictive models. And these kinds of predictive models were things that could judge inputs against these patterns that were in data sets so it could decide, you know, are you a good or bad insurance risk or, you know, did you have this medical symptom or not? Or more nefarious examples as well in terms of, you know, are you likely to recommit a crime or not? Or what are patterns of news? And then they would suggest actions. These predictive models would say, well, this is what the the rational or expected or quote unquote normal thing to do is in this situation, which is to give somebody a loan or not, or make that diagnosis or not, or write this headline or not.

 

Mike: But it would make these suggestions and think these predictive models were, you know, very rightly and and absolutely rightly critiqued for questions of discrimination and bias and unaccountability and inscrutability. And how do you even know these systems are and that that alone is a really powerful and important conversation to have. But I think, you know, as Kate alluded to earlier, this moment, this rapid explosion of sort of generative AI still has a lot of these problems and challenges of quote unquote predictive AI. But now we've added rapidly this new layer, which extends and sort of uses these earlier data sets and models to make media to create text and image and sound and, you know, video games and entire social networks and avatars, all of these things that we traditionally thought of as the product of human craft and human labor. These are now new types of media that are just sort of statistically mimicking and modeling these [00:18:00] data sets.

 

Tamar: So going back to our terms again, what is a training data set and why is it so important to to AI and to this project in particular.

 

Kate: So we could think about training data as sort of the alphabet. From which you make meaning in an model. It really is the kind of the, you know, the alpha to the omega of something like stable diffusion or a GPT four. So it's really had a really long and fascinating history. And here I'd point to the work of the historian Chao Cheng Li at Stanford, who has studied the really sort of early years of when training data began. And you can really trace this back to the late 1970s and early 1980s at IBM's Continuous Speech Recognition Lab. And this is where they made this really important shift from thinking about artificial intelligence as something that was built on expert analysis, what was called expert systems. You try and teach computers, say, linguistic principles. You know, what is the structure of a sentence. But at the CSIR lab, they made this really interesting decision to stop trying to do that and just to use large scale statistical analysis. So we shifted from trying to teach computers to understand us, to trying to get a computer to just predict the next word in a sentence.

 

Tamar: So how did they do that? How was their approach in this lab different?

 

Kate: So to make that statistical approach work, that probabilistic AI work, you needed a huge amount of data. Now back then, data was really hard to come by. In fact, IBM tried to use things like technical manuals, kids books. It just didn't sound enough like, you know, conversational speech and they didn't have enough words. So it really took years and years. And there are some fascinating stories about, you know, they they [00:20:00] ultimately, in fact, IBM was facing a federal antitrust lawsuit that went for over a decade. There were a thousand witnesses called. And that created this big corpus where they could start to do this work of, you know, predicting words at a sentence. But it really wasn't until the internet came along that you could make really big training data sets, and you did it by scraping the internet. And so we then started to see this growth of data sets going from the early kind of Caltech data sets, which might have 10,000 images, to ImageNet, which moved up to 14 million images in 2009, 2010. And now, of course, we have line five, which is 5 billion images and text captions. So you can see this huge exponential curve to just getting as much data as possible. And that is the data that is teaching an AI model how to interpret things in the world, be it an object or a cat or a dog or a person or a sentence.

 

Tamar: That's a huge shift. And I would imagine it raises a lot of questions. Right?

 

Kate: Exactly right. Some of these questions are personal. Like this is your data, some are ethical, you know, should this be how we make AI and some are legal, you know, are there not some very real questions around copyright consent, credit or compensation for the way that all of this data has been extracted and then used to enrich a few companies?

 

Tamar: So how did you start getting interested in training data in the first place?

 

Kate: So I started asking these kinds of questions back in 2016, when I started collaborating with the artist and researcher Trevor Paglen, where we started to do a two year study of training datasets, and with a particular focus on the sort of highly influential training dataset called ImageNet. And back then, we were doing this very manual kind of labor of sifting [00:22:00] through training data sets, looking at images, cataloging them, putting them into giant Excel sheets. It was this kind of it was the most kind of painstaking manual process that you could imagine. And that works up to a point. But when you're starting to look at a training dataset that is 5 billion images and text captions, it gets a lot harder. So at this point, we have to start asking questions around what kind of tools do we need to actually look at this, to study it, to create platforms and search engines, if you will, to really start to catalog the principles and, and the aesthetics and the epistemology of, of these giant constructions. So in knowing machines that that really meant creating a data lab. And we've had the great privilege of working with Christo Buschek, the data investigator and Pulitzer Prize winner, to really start to create essentially, data platforms as observatories to observe and understand and study that data.

 

Tamar: And, Mike, how do you work with training data and how do you understand it?

 

Mike: So another way that think about the power of training data and what it is, is that think about it as a kind of almost a public language. So as these training data become this foundation or this basis for all the systems that we've been talking about today, I sort of think about I worry about a little bit of as these data sets are containing all of this, these images and these text conversations. To me, these are all things that have been expressed and they've been expressed in different contexts. And they were always language always exists in a context. It doesn't exist as sort of a floating piece of data that can be transported from one place to another without consequences. And think about things that are unexpressed, the human condition and sort of how we understand ourselves and how we understand each other, often through things that don't [00:24:00] leave neat and easy and sort of tidy texts or images or videos or things that are easily computationally processed. I think about what kind of language do we have to understand ourselves and each other if we're only relying on, you know, large scrapings of social media posts or, you know, books that have been written or all these kinds of things that have been expressed in particular contexts. So I worry or I think about what's left unsaid or what are the silences, what are the absences, what are the things that never make it into a data set, or are very small and sort of hidden in a data set? They're minor. They're not sort of dominating a data set.

 

Mike: That's also part of human language, that's also part of public language and how we understand each other. But if we have this sort of statistical and computational view of what it means to know ourselves through language, I think we we run the risk of losing the power of silence, the power of absence, and the power of things that are not said. And that's really sort of a big important part, I think, of knowing ourselves. And the other point I think about is what's considered sort of normal or what's considered a statistical pattern or what's considered a thing that can be reproduced, you know, with a statistical certainty. That's another dimension of the human condition, which we know is is really particular. And we I don't think we necessarily want to be living in a world where our language is defined by what's normal, what's unexpected, what's usual, what's standard. And those are all words that sort of matter and belong in a statistical view of what language is. But if we if that becomes the lingua franca of knowing ourselves and knowing each other, then I think we've lost a chance to think about what alternatives are. Think about what a queering of a data set would look like. Think about what a different or non-traditional or subversive reading of what language [00:26:00] would be, that kind of challenge to normality and power and certainty and statistical stability is something that we sort of at a very high level philosophically, sort of we run the risk of losing if we give ourselves over to a data set world.

 

Kate: I completely agree with that, Mike. And I think that, you know, in some ways, I think the term large language model is really misdirection. You know, it's not really about language. They should really be thought of as large text models. They're these giant grab bags of text, but they're not. Really about language, silence, discourse, dialogue, connection and communication. That's a very different set of practices that really aren't well represented by the current definitions of generative AI. And it's interesting to see how I think that term has caused us to assume a deeper linguistic understanding of language and its many practices that simply isn't there. This is a very large type of text prediction engine, and that's a very different thing to a communicative practice. And in exchange, and in some ways I think this is, you know, really the conversations that you and I have had, Mike, for years that really inspired the name of this project. You know, this idea of knowing machines is, is not just about the fact that, you know, these are machines that purport to be knowing things, but also how do we come to know these machines? How do we come to be able to question, to research and to to see what they have, but also what they're missing? So this this term, knowing machines really has that kind of double meaning, because knowing is such a fraught concept right now. You know, we're being presented with text generators and LMS that supposedly know stuff. But the minute you actually really study them and you look at how they're built, you know, you realize this is about doing next token prediction. [00:28:00] It's not a knowledge model at all. And so this is one of the things that's really inspired our research is sort of digging into that layer, doing that excavation work to to see and to understand and and to know these machines better for ourselves.

 

Tamar: And part of understanding what's going on in AI right now means getting outside of this technical bubble entirely, right?

 

Mike: I would agree with that and say, I also think of this project and what we've been talking about as a way to identify spaces for people who are trying to stand outside of these systems, who are trying to challenge the systems and trying to say, no, no human culture and the human condition and our social worlds are more complicated. They're more nuanced. They're more varied. They're more honestly, they're more interesting than what a large language model can give us. And I when I think about the power of training data sets, honestly, one of my fears is being bored of living in a world where I'm told stories that are expected. I'm given images that are very much standard images that I'm not surprised by, or reading a novel that doesn't make me think differently about the world. It just sort of reinforces patterns or expectations that I have. So I think a lot about how in this moment, can we know these machines? Yes, absolutely. But know them for a reason, know them for a purpose. And that purpose or that reason is much more sort of a deeply felt desire to to create change in the world, to think about the world differently, to think about people who are often standing outside of the dominant systems or standing outside of the ways of knowing the world that are seen as normal or natural or standard. Those are the kinds of of perspectives that I think we need to figure out, and we need to figure out how to support them. So for me, that's one of the big driving [00:30:00] moments of why I do this work at all, it's to say. Different worlds are possible, and we need to create those worlds. And it's more complicated than the large language model. Technical manual is going to tell you.

 

Kate: And this is why I love that we collaborating also with artists in this project, that we have artists like Vladan Joler and Annie Dawson who are really looking at these systems from a critical perspective while building ways of allowing people to experience it. In Annie's case, you know, she does algorithmic theater and writes plays where people can actually see these systems at work on a stage. And with Vlad Angela, we've been building this very large visualization called Calculating Empires of really looking at this moment of generative AI, but putting it in a 500 year context, putting it in a much deeper historical trajectory around what does it actually take, what types of capacities, what types of capital? What types of power do you need to kind of construct these kinds of planetary computational systems?

 

Tamar: Yeah. And and that's a much bigger question than just like, how do we make AI fairer? How do we make it less biased?

 

Kate: It's not just about, you know, looking at the models and trying to refine them and slightly improve them. It's about asking, how does this change the world? What are the political, cultural and even ecological stakes of using these systems, of building them in to all of the institutions, from health care to education to criminal justice that we're seeing already? Because the stakes are really large and it almost requires, I think, a group of people who come from very different positionality, who are going to see these, these systems in different ways.

 

Tamar: Okay, so if we take these systematic [00:32:00] ideas, then what professions are seeing the changes?

 

Mike: Education. And as a professor, it means a lot to me to figure out what a good learning environment looks like. And I've seen students, you know, have anxiety about potentially being accused of cheating. I've had them think about what it means to be in a classroom that's led by a strong and thoughtful teacher. And I've thought a lot about, you know, what? What are the bounds that we have on surveillance of students and how much are we really allowed to and is smart to or is ethical to watch students in different environments? So there's this question swirling around what is academic integrity? What does it mean to create a good classroom environment? And the other bit to think about is sort of the scale and economics of education, because universities are transforming as an industry. They're often really eager to cut costs to make classrooms that are bigger with less amount of labor. And I looks really seductive to them in this moment, because it looks like you can offer large classes at very low cost. But the thing that I worry about is what happens to sort of the the humanity of learning in those moments, what happens to what it means to develop a really good relationship with a teacher. So as we get into another parts of this podcast, education and learning and pedagogy and student culture are sort of really big questions in this moment of generative AI.

 

Kate: One of the industries that I've been focusing on a lot recently is the cultural industries. And of course, this has been the focus of the largest strike in history, which is the Hollywood writers strike and Actor's strike over the use of AI in their work. That's certainly been one of the kind of core platforms that they have been striking over. And this question really cuts across all of the [00:34:00] cultural industries. We could think about music, we can think about books, we can think about, you know, voice actors. We can think about illustrators, we can think about movie editors. I mean, honestly, you can just keep naming fields and professions that are being affected by generative AI.

 

Tamar: Well, exactly. And this is one of those core things that really concerns me too. You know, as a producer, as a journalist, you know, given how many people just in these fields are facing job losses.

 

Kate: Also thinking about what types of labor are really at most risk here. And so one of the things we can see in the culture industries is that there are people sort of in the top tier who have a well established name and have a sort of a particular style. And these are the voices that in many ways have been most prominent in the debates, but in some ways are actually the least vulnerable. It's the sort of middle tier. It's the work for hire creative workers who are there to, you know, create a storyboard for a film or to do graphic design for a particular website. I mean, these sort of work for hire professions, I think are at much greater risk. And we're starting to see, of course, already labor shifts there where there are just fewer people being hired. And so there's a there's a key question, I think, for these industries around, what does it mean to create something? What does it mean to say I have copyright in this text or in this image, but that isn't something that is really applying to these models. This is part of the reason, of course, that we have a legal team within knowing machines to look at these legal questions. And as we'll hear, it's actually really complicated and interesting story as to why models can just harvest all this material. But it is raising some really core questions around what it is to be creative, what is our relationship to creative work, and what is our ability to control those creative works once they're out in the world?

 

Tamar: And [00:36:00] Mike, what professions are you focused on?

 

Mike: The big profession that I'm focused on right now is journalism, and I'm interested in how journalism, how newsrooms, how the press as an institution is changing in this moment and in response to this moment. And you forgive me as a as a journalism professor, I also sort of put this in historical context. Think about the fact that journalism has always been a product of its language, its politics, of the technologies of the era. The phenomenon is not new that journalism needs to respond to and use and leverage and resist and all those good things, the technologies of the moment. I think about, I tell my students sometimes there's this great example in New England in the 18th century. News organizations actually built fast boats to try to go out and meet the boats coming in from England, because the faster the boat that you could make, the sooner you could get the news that was on those boats coming in from England. And then you had a competitive advantage. And that was like breaking news of the 18th century depended on having a really fast boat. So you had these bizarre experiences where they were making sails and hulls that were different. We also had this sort of crisis of professional confidence in journalism after World War One, when journalists felt they had been lied to by the government and they needed to figure out what is the profession of journalism, what's our relationship to fact and truth? And they started journalism schools, you know, big famous journalism schools like we have today or this perennial, you know, honestly kind of boring question.

 

Mike: Think of, you know, our bloggers, journalists, there was a whole set of people who were just obsessed with that question of, is a blogger, a journalist, you know, in the early days of the web. But the point is that journalism has always done this. It's always tried to figure out its own identity, its own standards, its own ethics and professionalism in each era of technological change. So today, as news organizations are starting to use generative AI in their newsrooms, as journalists are, you know, sometimes publicly, but sometimes a little sheepishly [00:38:00] using generative AI to do background research or to come up with good questions for an interview. One of the things I think we're, we're seeing is the institution and the profession of journalism start to collide with these large language models, these datasets, these systems that are about reflecting the world back to people. And so that's the big sort of urgent issue that I see in journalism is what does it mean to have our public stories, our news, our the language that we use to know each other? What does it mean when those grow out of, you know, ChatGPT and dollies and Midjourney and these systems that are rooted in these really fraught and complicated datasets that we've been talking about in this project.

 

Mike: And we will. For me, that's the that's the big moment, because this is what some scholars call the epistemic certainty, you know, just a fancy way of saying, how well do I know that I know the world? Can I trust some piece of image I see? Can I trust a piece of text that's core to our moment of governance right now? And it's it's not just a, you know, a quote unquote academic question. If we're truly going to respond to climate change, if we're going to respond to political turmoil, if we're going to respond to these planetary shared collective crises, we better have a language of storytelling and of knowing each other that is distinct from or has some autonomy from these large language models of what technology companies. So journalism, for me is the one of the big places I'm seeing this play out.

 

Kate: One of the other lovely stories that that I like about the relationship between journalism and speed is actually the extraordinary tale of Paul Julius Reuter back in 1850, when he was just thinking about setting up the Reuters news fleet, and he did so with a fleet of 45 carrier pigeons [00:40:00] that he used to basically shorten the distance between Brussels and Aachen, which was like the end of the telegraph line. And basically the pigeons would carry these messages in like two hours, which would beat the railroad by six hours. And so these kind of tiny, tiny differences were really the basis of an entire news organization. So that that combination of thinking of really unusual ways to increase the sort of temporality, the speed up news cadences has always been, you know, part of the DNA of the business.

 

Mike: Another place that we see journalism and technology sort of colliding is when you think about where did telegraph lines go? Historically, there's a reason that we mostly got news and journalism from the east and west coasts of the United States or from large population centers in Europe. And there's a reason that we didn't have very many stories about Africa, about large parts of Asia, about places that were not technologically visible to the world. So therefore, it's set up this feedback loop of we didn't get stories from that part of the world. People thought that part of the world didn't matter. Therefore, they didn't invest in that part of the world. Every and that you can just see through where we're telegraph lines. There's great maps of the 1800s showing where telegraph lines were and were not. And when think about large language models, I think about where is language, where is language not? And we've always, always got this interplay between blank spaces, silences, absences, places that were ignored, people that were ignored, and where technological investments have been made. So that's the large sort of big picture that I think about is where are technology, where are stories, where are people exactly?

 

Kate: And you can see it in the datasets. You can really see where these images are being drawn from. And it's being drawn from these dominant cultures, cultures that have had earlier access to technologies, who've had far more populations, who've been sharing text and images [00:42:00] for decades. And that means there are entire worldviews, entire cultures that are absent from these datasets, which means that models can't show them. And one of the problematic responses to that has been technology companies saying, well, we should just surveil all the people all of the time and let's get all of your data. But another response is, let's start thinking about why those absences are there and why some people have had a choice now to be able to say, am I in this dataset or am I not in this dataset? And this really brings us to these kinds of questions around how people might be able to opt out of these kinds of structures altogether.

 

Tamar: Can they? I mean, will we get an answer? Yes, from this series about whether or not we can opt out of this?

 

Kate: Yeah, absolutely. I mean, on the question of datasets right now, one of the things that I think is really interesting is the creation of a platform called have I Been Trained? And this was created interestingly by artists, yet again, not technologists in the first instance. And they created this platform for you to be able to look at a major training dataset and to see if your artworks are in there, to see if it has been using your work to train models. And then if you find it, you can say, I would like to opt this image out, this artwork out of the training dataset and already have been trained, has now been responsible for the removal of over 1.5 billion images from line five B, which is the big training dataset that train stable diffusion. So that is a lot of images that have now been opted out. That's that's people who've made a choice to say that they don't want to be in there. They don't want an AI model being able to imitate their style and apply their style to create an image. It's interesting to see if this is going to work more broadly. Of course, [00:44:00] in the US we don't have the same regulatory structures as they do in the EU, but it is, I think, putting a lot of pressure on companies to try and think about how can they really build consent in, how can people start to have a choice here? Because it really does seem to be a fundamental desire that people have to make a choice to be able to say, I want in or I want out. And I think that's really reasonable.

 

Tamar: Thank you, Kate and Mike for this conversation and these incredible insights. This is going to be a really fascinating series and I'm so excited to be along for the ride.

 

Kate: Thanks so much, Tamar. This has been wonderful and excited to talk more.

 

Mike: Thanks, Tamar. This is a fantastic conversation. It's going to be a great series.