Tech

Reddit to Update Web Standard to Prevent Automated Data Scraping Amid AI Startup Concerns

Reddit updates its web standard to block automated data scraping from its website, aiming to prevent AI startups from bypassing the rule to gather content for their systems.

Nitish Verma

26 Jun 2024 02:55 EST

New Update

Social media giant Reddit has announced that it will update its web standard to block automated data scraping from its website, a move aimed at preventing AI startups from bypassing the rule to gather content for their systems. This decision comes at a time when artificial intelligence firms have been accused of plagiarizing content from publishers to create AI-generated summaries without giving credit or asking for permission.

The update will focus on the Robots Exclusion Protocol, or "robots.txt," a widely accepted standard meant to determine which parts of a site are allowed to be crawled. Reddit will also maintain rate-limiting, a technique used to control the number of requests from one particular entity, and will block unknown bots and crawlers from data scraping on its website.

Why it Matters : The move is significant, as it sets a precedent for other websites to follow in protecting their content from AI firms that bypass web standards to gather data. This update will help prevent the misuse of public content and ensure that AI companies abide by Reddit's terms and policies.

In recent weeks, there have been reports of AI companies circumventing the web standard to scrape publisher sites. A letter to publishers by the content licensing startup TollBit said that several AI firms were bypassing the rule to gather content. This follows a Wired investigation that found AI search startup Perplexity likely bypassed efforts to block its web crawler via robots.txt.

Earlier in June, business media publisher Forbes accused Perplexity of plagiarizing its investigative stories for use in generative AI systems without giving credit. Reddit's update aims to prevent such instances of content misuse and ensure that AI companies respect the website's terms and policies.

Key Takeaways :

Reddit updates its web standard to block automated data scraping from its website.
The move aims to prevent AI startups from bypassing the rule to gather content for their systems.
Reddit will update its Robots Exclusion Protocol and maintain rate-limiting to control data scraping.
The update sets a precedent for other websites to protect their content from AI firms.
Reddit's move is a significant step in preventing the misuse of public content and ensuring AI companies respect website terms and policies.