Internet Archive Resources

This page is deprecated

Bulk Access to OCR for over 1 Million Books
Archive.org API docs
Internet Archive's S3-like server API
Online Publishing Distribution System (OPDS): The Open Publication Distribution System (OPDS) Catalog specification is a syndication format for electronic publications based on Atom RFC4287 and HTTP RFC2616.
- OPDS Spec
- Internet Archive OPDS catalog - useful for getting updates

*All content below this line was last updated in May 2009.*

Servers

Open Library is hosted at the Internet Archive's data center in San Francisco, along with thousands of other servers that power the Internet Archive's operations, a cluster known as the Petabox. Here are the servers we have (all hostnames have the suffix .us.archive.org):

new servers

12 new servers have been alloted to Open Library.

ia331503 - phr's dev machine
ia331504 - Edward's dev machine
ia331506 - Anand's dev machine, also running staging server
ia3315[07-11] - solr
ia3315[25-26] - production database
ia3315[32-33] - production web nodes

please write how to allocate these machines here

Raj thinks that there should be 2 frontend machines, 2 production db machines, 2 bibliographic search machines, 3 full text search, and 3 test boxes.

Configuration:

Processor: Intel Quad Core
RAM: 8GB?
Disk: 4 * 1.4TB

ia331525 - Production DB master

/0 - root partition, logs
- /var/log/lighttpd - lighttpd log
/1 - data
- /1/var/lib/postgresql - postgres data

old servers

wiki-beta: This is the machine that runs the web server and the application. It also contains the main copies of most of our key data in /1/pharos (which is automatically backed up to pharosdb).

pharosdb: This is the machine that runs our database. It is an IA "black box" (large 2U server, one of a few dozen(?) at the archive) with two single core Opteron processors (upgradeable to 2x dual core), running 64 bit linux, with 12GB of ram and a big RAID array. Pharosdb is named after Pharos, which was the lighthouse at Alexandria built in the 3rd century B.C. Pharos is also the former internal name of the Open Library project, and might appear that way in a few old documents.

apollonius: This is an IA "red box" similar to Wiki-beta, with a dual core Athlon with 4GB of ram. It's being used mostly for search engine development and testing.

zenodotus: Another red box, hosting one of the two solr instances used for fulltext search. Apollonius and Zenodotus are both named after former Librarians of Alexandria.

pharosdb-bu: 2 dual-core Intel Xeon CPUs for a total of 4 cores running at 2.8 GHz, 10 GB RAM, 2 750 GB PATA drives. The OS is our standard Ubuntu 7.10 image for homenodes. The plan is for this machine to eventually run the database. It is currently running our main search engine instance.

ia311530: AMD Athlon(tm) X2 Dual Core Processor BE-2350 running at 2.1GHz, 8GB RAM, 4 1TB SATA disks, an gigabit intel cards. Installed is our standard Ubuntu 7.10 image.

ia311538: AMD Athlon(tm) X2 Dual Core Processor BE-2350 running at 2.1GHz, 8GB RAM, 4 1TB SATA disks, an gigabit intel cards. Installed is our standard Ubuntu 7.10 image.

[suggestion: list specs for each box here]

Software

Required software on the new servers.

(write the software name and ubuntu package name)

python 2.6
postgres 8.3
lighttpd
git

Python Packages:

psycopg2 - python2.5-psycopg2
simplejson - easy_install simplejson
pymarc - easy_install pymarc
PIL - python-imaging

Petabox

The Petabox is an interesting system. It consists of thousands of standard 1U servers, each running a similar server image that runs daemons like rsync and a simple HTTP server and SSH. They all have NFS-mounted home directories from the machine home.us.archive.org and anyone with a cluster account can log into any of them. Each machine has exactly four large drives, mounted at /0, /1, /2, /3.

When you want to store something in the Archive, you create an "item". You do this by making a request to the Catalog, telling it what you want the item name to be, (optionally) how big the item is, and where the data can be downloaded from. The Catalog finds a drive that has that much free space, creates a directory on that machine with that name, and downloads your data into it.

Normal humans can do this using http://www.archive.org/create/. OL programmers who want to do this sort of thing in bulk can use a secret URL; ask Aaron Swartz for details.

So imagine that you want to create an item called ietf_rfc_collection including the file http://example.org/rfcs.tgz. The Catalog will create a directory on a machine like ia301505 called something like /2/items/ietf_rfc_collection/ and download your file into it. You can then access it at http://ia301506.us.archive.org/2/items/ietf_rfc_collection/. Unless you request otherwise, the Catalog also makes a backup of your data to another machine (usually the machine with the next-highest number, so ia301507 in this example) for redundancy.

Now if you visit http://www.archive.org/download/ietf_rfc_collection/, the webserver will do a UDP broadcast to the Petabox asking machines to reply if they have an item named ietf_rfc_collection. The user will get redirected to the URL http://ia301506.us.archive.org/2/items/ietf_rfc_collection/ (or ia301507 if that machine responds first).

Every item is supposed to contain two metadata files. The first, item_name_files.xml (e.g. ietf_rfc_collection_files.xml) is an XML document containing simple metadata about the files in the item, most importantly their format and their hashes. (The hashes can be used to see if data has been corrupted and restore from the backup if necessary.) The program /petabox/sw/bin/files-maker.php on any petabox machine will generate a rudimentary _files.xml. There is also a PHP script on archive.org that makes you do this. (A more comprehensive _files.xml might contain format, size, md5, and sha1 for every file.)

The second, _meta.xml contains general metadata about the item. This is used to generate a nice looking page for the archive.org web site and search engine and when written properly allows you to visit a page like http://www.archive.org/details/ietf_rfc_collection/ and get a nice human-readable page. Again, there is a PHP script on archive.org that makes you do this.

The definitive version of the metadata files is stored in the items but the data is also included online in a MySQL database that can be accessed through a program called the metamgr. Much of it is also in the search engine.

The final cool thing about the petabox is their system called derive. Derive might be thought of as the opposite of make: make takes a request for something to build and works backwards thru the dependencies to figure out how to build it. Derive goes through the catalog looking at the different file types (based on the _files.xml) and tries to figure out what it can derive them. Thus, for example, if you upload a big QuickTime file, derive will try to make smaller QuickTime files and a Flash version so that lots of people can view your movie. Or, if you upload a bunch of raw scanned book images, derive will try to make collections of smaller JPEGs and OCR them to make a text file, and so on.

It doesn't seem likely that we'll be using derive much, but I mention it here because it's cool. As for the rest of it, any data files we want to preserve for posterity (e.g. any raw imports to the system or any generally-useful exports) should be put in the petabox.

History

Created April 11, 2008
34 revisions

December 21, 2022	Edited by Mek	deprecation notice
February 9, 2011	Edited by George	Edited without comment.
February 9, 2011	Edited by George	Edited without comment.
February 9, 2011	Edited by George	Edited without comment.
April 11, 2008	Created by an anonymous user	moving dev docs