In case you missed it, Google acquired a company called Appjet on Friday. Their main product is Etherpad, a multi-user collaborative notepad. Their initial press release announced the immediate suspension of new activity on the site, with all content to disappear at the end of March. Since Friday, bowing to massive community pressure, Google/Appjet have announced the eventual opensourcing of the Etherpad codebase. They promise to keep the service live until then. (Notice that we’ve gone from a concrete timeline to an extremely vague, well-intentioned, promise.)
I know of Etherpad because we use it pretty heavily at work. It’s a multi-user version-controlled notepad. Revisions can be saved. Newer pads save every keystroke for posterity. They save which author wrote which characters when. There’s a chat pane where authors can discuss the work. Looking at currently available public pads, you can watch bored 4chaners slander each other, parents help their children craft stories, and people leave little notes in the hopes that someone will stumble on them later.
It appears that I’ve caught Jason Scott’s need to archive this kind of history before it slips into non-existence. Friday night, I started writing code to pull out as much data as possible. This is going to be both much easier and massively more difficult than I initially expected.
There are three major facets of this archival venture. First, grabbing the initial view. The initial page view (the current revision, the authors, the last 20ish lines of chat) are all stored in a javascript structure in the page. Easy enough. Second, grabbing the revision history. This is available by branching out of the initial data into other pages. Easy enough. Third, grabbing all the pads. Now, this is where things get tricky. Etherpads are initially assigned a random id. This id is (currently) a 10 character string containing a-z, A-Z, 0-9. base 52 across 10 characters. Some quick math yields an ungodly number of combinations. This is where I need help.
My first pass is up on gitorious. Basic usage looks like:
scan_and_archive.pl --output some/directory/
The script drops a status file, which defaults to ‘ep_ids.txt’. This file contains every scanned id and whether it was a valid id. The script also grabs the javascript structure and drops it in the specified dir (defaults to ‘archive’) by id. If interrupted, the script will pick up at the last id it tried. There are a few options, none of which have made it into any useful help output. This sucker is rough. However, it’s a start at grabbing the data. Over the next few days, I’ll be adding code to take those js files and extract their revision history for archival as well as general cleanups.
Here’s the thing. We’re looking at a massive id space. If you want to help, I need to you start at a spot other than the very beginning. Here’s the syntax. It’s a two step process.
scan_and_archive.pl --output some/directory --start aaaa
# Almost immediately, hit Ctrl-C to interrupt
scan_and_archive.pl --output some/directory
The first command starts things off and sets up the history file. The second command keeps things going without starting over at ‘aaaa’ whenever it’s run again. As far as the id is concerned, they’re characters. Pick a random string and let it run.
If you decide to help, email me at sungo@sungo.us, find me on as sungo on #archiveteam on irc.efnet.org, reply on twitter (@sungo), whatever you want. Patches are very welcome to the codebase.
Edit: While it has more perl module requirements, please ponder using scan_and_archive_full.pl. It retrieves all revisions for pads as well.