Social networking platform Bluesky recently unveiled a proposal on GitHub that aims to provide users with new options regarding the scraping of their posts and data. This initiative could impact the use of user-generated content for purposes such as generative AI training and public archiving. The discussion gained traction following remarks made by CEO Jay Graber during a presentation at the South by Southwest festival earlier this week. On Friday night, Graber further highlighted the proposal in a post on Bluesky, which caught the attention of users and sparked a wave of reactions.
Many users expressed concerns over Bluesky's new plans, perceiving them as a significant shift from the platform's previous commitment to user privacy. Users recalled Bluesky’s assurances that it would not sell user data to advertisers or utilize user posts for AI training. One user, Sketchette, voiced their discontent, stating, “Oh, hell no! The beauty of this platform was the NOT sharing of information, especially with generative AI. Don’t you cave now.” Graber responded by emphasizing that generative AI companies are already scraping public data from various online sources, including Bluesky itself, since “everything on Bluesky is public like a website is public.”
In light of these user concerns, Graber articulated that Bluesky is seeking to establish a “new standard” for data scraping, akin to the robots.txt file used by websites to manage communication with web crawlers. This proposed standard aims to create a mechanism that defines user permissions for data scraping, though it is important to note that it would not be legally enforceable.
Bluesky's proposed standard is designed to provide a "machine-readable format" that ethical actors are expected to respect. While carrying ethical weight, the standard's enforceability remains a point of contention. The proposal outlines four categories under which users can choose to allow or disallow the use of their Bluesky data: generative AI, protocol bridging (which connects different social ecosystems), bulk datasets, and web archiving (such as the Internet Archive’s Wayback Machine).
According to the proposal, if a user opts out of their data being used for generative AI training, companies and research teams are expected to respect this choice during data scraping or when transferring data using the protocol itself. Molly White, a writer for the Citation Needed newsletter and the Web3 is Going Just Great blog, commented that the proposal is a “good initiative” and criticized the backlash against Bluesky. White argued that the proposal is less about welcoming AI scraping and more about providing a consent mechanism for users to communicate their preferences regarding the already occurring scraping activities.
However, White also pointed out a potential flaw in this approach, remarking that both Bluesky's proposal and a similar initiative by Creative Commons depend on the assumption that scrapers will adhere to these signals. She noted past instances where companies have ignored robots.txt files or engaged in piracy to scrape data.
The latest proposal by Bluesky underscores the ongoing debate surrounding user privacy, data scraping, and the ethical implications of AI training. As the conversation evolves, it remains to be seen how users will respond to these new options and whether they will be effective in safeguarding user data in an increasingly data-driven world.