headlines xml
stories xml
site logo

headlines

Distributed Archiving
by Geekus McGeek on 2005-06-24 22:50
icon
,
Distributed Archiving Information gets lost, that's just the way it is. In libraries, in companies, on the web, anywhere. There's strong argument that not all information should be stored -- certainly before the web when I worked in a library I could see the impact of keeping too much. But in the last few years the reliance of the media, researchers, and journalists on the ever-in-flux web, sources not committed to hard media, has, in so many words, begun to creep me out. A current Technology Review article "The Fading Memory of the State" puts this feeling into perspective by describing the enormity of the task of the National Archives to keep up under a mountain of info in thousands of formats.

Lost info hit home recently, when I was going through the meager archives of headlines on this site and was pretty darn annoyed to find about 60% of them linked to sites or articles no longer available. I was particularly focusing on the april/may period of 2003 when Abu-Ghraib torture was discovered. I'd linked a CNN story as "Six U.S. soldiers reprimanded over alleged abuse" then, only two years later, the link is 404. Strangely, all the other CNN links I had from the same period, not torture related, were still live. Read into that what you like. I don't have any idea what happened. I contacted CNN and they, of course, didn't write me back about it.

And yet what do I do if I'd used that story as a source in reportage, blog, or just generally as an important story I wanted to revisit, need to validate something, or simply follow that details of that thread of story from two years ago? Can I go to the library and find it? No, it's not print. CNN's certainly not going to give me a hand, and honestly, I don't really expect them to.

I can, fortunately, go to the waybackmachine where I did find it. The waybackmachine is in an incredible project, but, like CNN, there are limitations either self-imposed or technical. There are many dead links in my own headline and blog archive I can't find there, which are now, for all intents and purposes, gone forever. Sorta like the library of Alexandria burning every year. Not that all that info is going to be so valuable, but the degree of value can't be predicted, so the impulse is to save as much as possible.

As good as the waybackmachine is, as good as Google's temporary cache is, as good as institutions try, it's a crap shoot of smart bot crawling that they'll save the sweet spot. That's why internet archiving, or at the least, news article archiving should be done in a distributed manner, by a method where lots of small groups save what's meaningful to them, index it neatly, with the idea that in the long run all these pockets of meaningfully saved articles will overlap the important stuff. People, bloggers in particular, could create indexes with the same specs, so that an aggregator can merge them, or other engine can weave it together. You could use the indexes of the same links stored in different places as a kind of versioning as well -- to see if that CNN story changed within a two week period.

I put together a little perl script (this is by all definitions only a first pass) which rips through my site's RSS feed, getting the recent entries from it, then parsing out the outside links from them into a link store. The outside links are retrieved via wget, given a unique filename-id based on the url, and an xml index created linking the whole thing together. Every night this runs, getting whatever is newest on the site, so that there is never drag on the server. I'm not storing images, the text of the articles are really what concern me, and if I were deep linking heavily for each entry, say a meg of html for each archive, storage space is so cheap, and web hosting so generous in that regard, that I can easily go a couple years without ever having to check up on it. Consider this: I can save the entire text of everything I link to within a one year period on something as small as an SD card. If bloggers are concerned about the vaporizing of previous web article sources, or their mysterious changing under either governmental pressure (say, in China, for instance) or market pressure (the US of course), they can always go to their own personal archive and check the text, and if lots of bloggers are doing the same thing, to an aggregator to check another version if so linked. You can see basically how this works on previous posts, here and here, see the 'linkstore' and 'readstore' links at the top right.

06-28-05
I've decided to really work on this. There are a few things to start with:
- make an xml schema for the indexes (schema, dtd, namespace, sometimes the terms confuse me).
- break out the config into an installer
- make it self-standing except for wget. I can't think of a way around that.
- make it able to work out of the box with MT and wordpress, otherwise no one will use it.
- think about how the indexes would be best aggregrated and cross-referenced.
- what happens if it's successful and there's a petabyte of data? does it collpase? entropic? loss?

If you're interested in working on this too, you can contact geekus at mcgeek.com.

All work will reside in blog-ish form here. If enough people are interested, I'll go wiki with this.

Comments

> add a comment
  •  

    mcgeek on 2005-07-04 wrote:

    Thanks, I agree completely. The librarian in me is very disturbed at the rate and ease of which written historical markers disappear.

    Hey, even non-coders can help -- I may call upon you to test.

  •  

    dandam on 2005-07-04 wrote:

    Super cool stuff and EXTREMELY necessary in my experience of the web. Too many articles disappear. How we accurately we recall info is a reflection of how well we plan our future.

    Couldn't we say that it Proust's whole point was that memory/history is the lens through which we experience our present? Without some common, at least partially valid, record how can we ever say with any confidence how we have arrived at our present circumstances.

    I'd love to help, but I'm not sure how I'd be of use as a non-coder. Let me know if you have some use for me.

> add a comment

More

Sep 2006
IE7 And The Acid2 Test
Jul 2006
Revenge Of Drupal
Apr 2006
"No"
Nov 2005
KDevelop + Subversion
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,