Wikipedia already shows signs of huge AI input
Serene Lee/SOPA Images/LightRocket via Getty Images
The arrival of AI chatbots marks a historical dividing line after which online material can’t be completely trusted to be human-created, but how will people look back on this change? While some are urgently working to archive “uncontaminated” data from the pre-AI era, others say it is the AI outputs themselves that we need to record, so future historians can study how chatbots have evolved.
Rajiv Pant, an entrepreneur and former chief technology officer at both The New York Times and The Wall Street Journal, says he sees AI as a risk to information such as news stories that form part of the historical record. “I’ve been thinking about this ‘digital archaeology’ problem since ChatGPT launched, and it’s becoming more urgent every month,” says Pant. “Right now, there’s no reliable way to distinguish human-authored content from AI-generated material at scale. This isn’t just an academic problem, it’s affecting everything from journalism to legal discovery to scientific research.”
For John Graham-Cumming at cybersecurity firm Cloudflare, information produced before the end of 2022, when ChatGPT launched, is akin to low-background steel. This metal, smelted before the Trinity nuclear bomb test on 16 July 1945, is prized for use in delicate scientific and medical instruments because it doesn’t contain faint radioactive contamination from the atomic weapon era that creates noise in readings.
Graham-Cumming has created a website called lowbackgroundsteel.ai to archive sources of data that haven’t been contaminated by AI, such as a full download of Wikipedia from August 2022. Studies have already shown that Wikipedia today shows signs of huge AI input.
“There’s a point at which we we did everything ourselves, and then at some point we started to get augmented significantly by these chat systems,” he says. “So the idea was to say – you can see it as contamination, or you can see it as a sort of a vault – you know, humans, we got to here. And then after this point, we got extra help.”
Mark Graham runs the Wayback Machine at the Internet Archive, a project that has been archiving the public internet since 1996, says he is sceptical about the efficacy of any new efforts to archive data, given the Internet Archive stores up to 160 terabytes of new information every day.
Rather then preserving the pre-AI internet, Graham wants to start creating archives of AI output for future researchers and historians. He has a plan to start asking 1000 topical questions a day of chatbots and storing their responses. And because it is such a massive task, he will even be using AI to do it: AI recording the changing output of AI, for the curiosity of future humans.
“You ask it a specific question and then you get an answer,” says Graham. “And then tomorrow you ask it the same question and you’re probably going to get a slightly different answer.”
Graham-Cumming is quick to point out that he isn’t anti-AI, and that preserving human-created information can actually benefit AI models. That is because low-quality AI output that gets fed back into training new models can have a detrimental effect, leading to what it is known as “model collapse“. Avoiding this is a worthwhile endeavour, he says.
“At some point, one of these AIs is going to think of something we humans haven’t thought of. It’s going to prove a mathematical theorem, it’s going to do something significantly new. And I’m not sure I’d call that contamination,” says Graham-Cumming.
Topics:
Source link