Yellowpages Data Scraping: Content Scraping: What’s Private & What’s Public?

The WSJ covers data/content scraping this morning — specifically regarding personal and consumer information. A variation on this is also what search engines do, although site owners can request that pages not be crawled thereby providing a protocol for content to be excluded from the index.

Content scraping (especially of personal information) is by and large involuntary and doesn’t operate within the same accepted norms. There are legal, quasi legal and illegal use cases for the data obtained. According to the WSJ:

The market for personal data about Internet users is booming, and in the vanguard is the practice of “scraping.” Firms offer to harvest online conversations and collect personal details from social-networking sites, résumé sites and online forums where people might discuss their lives.

The emerging business of web scraping provides some of the raw material for a rapidly expanding data economy. Marketers spent $7.8 billion on online and offline data in 2009, according to the New York management consulting firm Winterberry Group LLC. Spending on data from online sources is set to more than double, to $840 million in 2012 from $410 million in 2009.

There are two angles that I see: privacy and copyright. “Public information” and “facts” cannot be protected by copyright. But these definitions are not as straightforward as they seem. However the latter is what allows the local business databases to exist; a name, address and phone number is a “fact” but the presentation of that information can be protected.

The scraping of “data” from myriad sites could potentially be defended as the capture of factual information voluntarily exposed on publisher sites (implicating terms of service). But what is a “fact”? And what are the understandings and expectations of people using these sites? Regardless of the terms or what people are legally entitled to expect, people on Facebook don’t actually expect that their personal information is going to wind up in some third party database and be resold to marketers or employers.

What’s entitled to privacy and what’s public and fair game? One could argue that most online content has a public or quasi-public character by default unless it’s explicitly made private — “secret” groups on Facebook could reasonably be considered private. But what about “closed” groups?

There are lots of thorny issues here and technology is way ahead of the law. My guess is that if consumers knew about any of the data-mining going on they’d vote to shut it all down.

At some point the lines between what’s private and what’s public online will need to be more clearly defined as a legal matter. There should also be a rule requiring clear consent to be obtained by publishers to make user information public. However such a rule will probably need to be imposed on them by a legislature. And while we’re at it I would also impose civil and even criminal penalties on third parties that collect and use personal data without permission.

Source: http://screenwerk.com/2010/10/12/content-scraping-whats-private-whats-public/

Yellowpages Data Scraping

Monday, 27 May 2013

Content Scraping: What’s Private & What’s Public?

No comments:

Post a Comment