Saturday, May 1, 2010

Details About Tweets Going to the Library of Congress…& why historians are so excited:

April 30, 2010
When History Is Compiled 140 Characters at a Time
By RANDALL STROSS

TWITTER users now broadcast about 55 million Tweets a day. In just four years, about 10 billion of these brief messages have accumulated.

Not a few are pure drivel. But, taken together, they are likely to be of considerable value to future historians. They contain more observations, recorded at the same times by more people, than ever preserved in any medium before.

“Twitter is tens of millions of active users. There is no archive with tens of millions of diaries,” said Daniel J. Cohen, an associate professor of history at George Mason University and co-author of a 2006 book, “Digital History.” What’s more, he said, “Twitter is of the moment; it’s where people are the most honest.”

Last month, Twitter announced that it would donate its archive of public messages to the Library of Congress, and supply it with continuous updates.

Several historians said the bequest had tremendous potential. “My initial reaction was, ‘When you look at it Tweet by Tweet, it looks like junk,’ said Amy Murrell Taylor, an associate professor of history at the State University of New York, Albany. “But it could be really valuable if looked through collectively.”

Ms. Taylor is working on a book about slave runaways during the Civil War; the project involves mountains of paper documents. “I don’t have a search engine to sift through it,” she said.

The Twitter archive, which was “born digital,” as archivists say, will be easily searchable by machine — unlike family letters and diaries gathering dust in attics.

As a written record, Tweets are very close to the originating thoughts. “Most of our sources are written after the fact, mediated by memory — sometimes false memory,” Ms. Taylor said. “And newspapers are mediated by editors. Tweets take you right into the moment in a way that no other sources do. That’s what is so exciting.”

Twitter messages preserve witness accounts of an extraordinary variety of events all over the planet. “In the past, some people were able on site to write about, or sketch, as a witness to an event like the hanging of John Brown,” said William G. Thomas III, a professor of history at the University of Nebraska-Lincoln. “But that’s a very rare, exceptional historical record.”

Ten billion Twitter messages take up little storage space: about five terabytes of data. (A two-terabyte hard drive can be found for less than $150.) And Twitter says the archive will be a bit smaller when it is sent to the library. Before transferring it, the company will remove the messages of users who opted to designate their account “protected,” so that only people who obtain their explicit permission can follow them.

A Twitter user can also elect to use a pseudonym and not share any personally identifying information. Twitter does not add identity tags that match its users to real people.

Each message is accompanied by some tidbits of supplemental information, like the number of followers that the author had at the time and how many users the author was following. While Mr. Cohen said it would be useful for a historian to know who the followers and the followed are, this information is not included in the Tweet itself.

But there’s nothing private about who follows whom among users of Twitter’s unprotected, public accounts. This information is displayed both at Twitter’s own site and in applications developed by third parties whom Twitter welcomes to tap its database.

Alexander Macgillivray, Twitter’s general counsel, said, “From the beginning, Twitter has been a public and open service.” Twitter’s privacy policy states: “Our services are primarily designed to help you share information with the world. Most of the information you provide to us is information you are asking us to make public.”

Mr. Macgillivray added, “That’s why, when we were revising our privacy policy, we toyed with the idea of calling it our ‘public policy.’ ” He said the company would have done so but California law required that it have a “privacy policy” labeled as such.

Even though public Tweets were always intended for everyone’s eyes, the Library of Congress is skittish about stepping anywhere in the vicinity of a controversy. Martha Anderson, director of the National Digital Information Infrastructure and Preservation Program at the library, said, “There’s concern about privacy issues in the near term and we’re sensitive to these concerns.”

The library will embargo messages for six months after their original transmission. If that is not enough to put privacy issues to rest, she said, “We may have to filter certain things or wait longer to make them available.” The library plans to dole out its access to its Twitter archive only to those whom Ms. Anderson called “qualified researchers.”

BUT the library’ s restrictions on access will not matter. Mr. Macgillivray at Twitter said his company would be turning over copies of its public archive to Google, Yahoo and Microsoft, too. These companies already receive instantaneously the stream of current Twitter messages. When the archive of older Tweets is added to their data storehouses, they will have a complete, constantly updated, set, and users won’t encounter a six-month embargo.

Google already offers its users Replay, the option of restricting a keyword search only to Tweets and to particular periods. It’s quickly reached from a search results page. (Click on “Show options,” then “Updates,” then a particular place on the timeline.)

A tool like Google Replay is helpful in focusing on one topic. But it displays only 10 Tweets at a time. To browse 10 billion — let’s see, figuring six seconds for a quick scan of each screen — would require about 190 sleepless years.

Mr. Cohen encourages historians to find new tools and methods for mining the “staggeringly large historical record” of Tweets. This will require a different approach, he said, one that lets go of straightforward “anecdotal history.”

In the end, perhaps quality will emerge from sheer quantity.

Randall Stross is an author based in Silicon Valley and a professor of business at San Jose State University. E-mail: stross@nytimes.com.

No comments:

Post a Comment