headlines xml
stories xml
site logo

headlines

project linkstore

pyLinkstore

Got a chance to redo linkstore in Python. And my verdict is: vastly better. I think code-wise it's probably longer than the perl script, but it's organized better. And it would be easier to expand or edit. I didn't really add any funcationality other than the ability to block parts of HTML from having links stored (ie. same bloggy longterm side links kind of thing). I had no hair-pulling moments, and only one brief frustration, again with XML SAX, but I found a module called ElementTree (and here) which is simple and did what I needed. Hear other people rag on XML processing modules and conventions, in this thread here. This is not unique to Python. While I'm sure I didn't write this as cleanly as I will be able to with some more experience, as a first stab it's excitingly easy to get things done better and quicker. Download pyLinkstore here.

Python, finally

...getting back to this and getting into python running as cgi, much as perl would. I know mod_python exists, and what it does, but for this simple project I'm just needing simple cgi architecture -- but also wanting the 'cleanness' of python. Since I'll need to output a bunch of stuff, and read from XML (which has become pretty essential to all web apps these days), looks like there's a decent amount of python tools out there. The ease of that, I spose, will be a sort of test.

Google freaking

Haven't worked on this project at all in a while since I'm making big changes to the next version of McGeekCode. But I've been watching the crawlers and I realize I've forgotten something pretty darn important. I see Google hitting the pages in the store, then crawling those links, relationally -- which means those pages links come up 404. Add an 'index no-follow' tag to the create_indexhtml.pl file for the store indexes. At least Google is hitting these finally, it took a long time for them to get there, whereas Yahoo was all over it right away. Another problem seems to be Reuters, which will only deliver page content via a browser, not wget -- it's not the user_agent, I've tried that, there's something else going on there.

Linkstore 0.1 (perl variety)

Here's a tarball of the linkstore scripts. Call me nuts, but I've decided to redo it in Python. Why Python? Cuz I've been looking forward to actually doing something in it, and this seems pretty suited. So instead of me bitching about Perl here, I can bitch about Python. But think of it as a Python newbie crib sheet.

Visualizing Data

One of the things I love about googlemaps is that it allows you to visualize map point info in different ways, beautifully. I've started working with the gmaps API and a geocoder for a project at work, and it occurs to me that if there were a similar web archive map, to visually navigate stores of like info it would make big general searches easy and aesthetically pleasing. What would indexes of indexes look like if their data latitude and longitude could be plotted on a field of the set of info?

OS X, indexer, config options

Tried to start some Perl development on OS X and had hours of miserable failure trying to get XML::Parser and Expat going. Tried cpan, manual compile, forced, etc., no luck. My life is too short for this shit. Sticking w linux and freebsd environs. All other servers I've worked on have been good, don't understand the structural differences of OS X with Perl, but I don't blame Larry on this one. Anyhow, I have the indexer working, which crawls over the directories and creates html indexes from the xml and a master index as needed.

Was thinking, feature-wise, about a couple things:
- there should be a settable code (perhaps html comment format) which would allow people to skip blocks of links from archiving, if they want, say, the side navs to other sites which most bloggers use. Again, the person should have the ability to archive what's important to them.
- let users restrict size of page archival. Ie., I had one blogger in my nav who's page was 275k. 275k. My god, why.

$library->{store}{link} = beer

Have the basics of the html indexer down, reading from xml using XML::Simple, which is what it says it is. Also, a master indexer. These would run when you want, crawling over the store, building the static human readable pages. Humans, feh. Was again hitting my head against the wall on some Perl stuff after spending all day w easy ol' PHP -- sometimes the hash/array/hashref/etc stuff in Perl drives me nuts. "@{$library->{store}{link}}[$i]" or "$$library{"archive_title"} =~ s/(t|n)//gm;" is not the kind of thing I want to be staring at after a couple beers. And really, if it's not a language you can program in easily after a couple beers, how good is it? (I'm not so sure how serious I am about this.) Perhaps I'm just looking for excuses to finally break out some Python.

Spiders

Looks like Yahoo is already spidering. Since I'm playing this by ear, I'm not really sure how the aggregation of unique ids for urls would float in the various search engines. But it looks like the entry on SC Souter had it's linkstore read and the id's indexed. The upshot is that if you had the same url MD5'd the same way and you put that id into yahoo, it would be on the same page as mine, allowing you to compare indexes and archive stores. I'm imagining a frontend via soap to a search engine which would convert your human readable url into MD5 for submittal would be trivial.

Perl-a-licious

Had one of those 3am nights trying to get some Perl modules going. For everything I love about Perl there is something equally despisable about it. XML::LibXML is pretty widely used, I wanted to try the HTML::Template::XPath stuff, thinking that it might work as a perl based archive reader. Not much luck. The template format was too restrictive, I had crazy trouble getting some dependencies resolved for xml::common (still not sure why), and overall it was unproductive. XML::simple has so far proven to be the most flexible, although I'm not ruling out a from scratch reader based on libXML and tools. My Perl fluency is not nearly as great as my PHP. But realistically Perl is the bottom-line choice for a web app that can be widely distributed I think. Wordpress makes me think twice, but MT is Perl, so that's really the 500 pound monkey in the room (is that even the right cliche?).

Lots of archive equals overlap

I'm imagining that Joe's, John's and Susan's are for the most part very different. They link to different things for the most part. But there are some things which overlap. It may only overlap for two of them, but a few things will be shared for the three. Each circle generically represents an archive -- it could be on the same server, as different users, could be on the other side of the planet, doesn't matter. When you're sharing xml indexes of unique identifiers of urls, you'll be able to find multiples within a given time period if the xml indexes can be aggregated.
xml
Distributed Archiving
thumbnail
trying to stop news history from vanishing, one url at a time
Geekus McGeek - 2005-06-24
, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,