The State of Data Import

Step 1: Get the data

The data we have and the data we want is all listed on the how you can help page. We've got a fair amount of it right now awaiting processing.

Here's where it is on our servers -- note: this doesn't include other data listed in the how you can help page:

Powells crawl: wiki-beta:/1/pharos/crawl/powells (aaronsw, phr)
OCA full text: fill in (phr)
Indian Million Books Project: in progress (anand)
random Princeton University Press files: wiki-beta:/1/princeton (aaronsw; wget mirror)
random Biblioteque National France files: wiki-beta:/1/ttk/bnf (bill)
book covers: wiki-beta:/1/pharos/covers and wiki-beta:/2/pharos/bookcovers (aaronsw)
bookmooch data: wiki-beta:/1/pharos/covers/bookmooch (aaronsw)
librarything data: wiki-beta:/1/pharos/covers/librarything (aaronsw)
librarything data: wiki-beta:/1/pharos/crawl/librarything (aaronsw; wget mirror)
fred20 data: wiki-beta:/1/pharos/crawl/fred20 (aaronsw)
Access Copyright data: wiki-beta:/1/pharos/onix/accesscopyright (aaronsw, rejon)
ONIX data: wiki-beta:/1/pharos/onix/originals (aaronsw, alexis)
recently-uploaded ONIX data: wiki-beta:/tmp/incoming (aaronsw, alexis)
[[/dev/docs/data/PSU|PSU data]]: wiki-beta:/1/pharos/onix/psu (aaronsw)
Stanford copyright data: wiki-beta:/1/pharos/onix/copyright (aaronsw)
bulk.resource.org data: wiki-beta:/2/pharos/crawl/bulk.resource.org (aaronsw; wget mirror)
random Powells data: wiki-beta:/2/pharos/crawl/ftp.powells.com (aaronsw; wget mirror)
partial isbndb data: wiki-beta:/2/pharos/crawl/isbndb.com (aaronsw; wget mirror)
onlinebooks project data: wiki-beta:/2/pharos/crawl/onlinebooks.library.upenn.edu (aaronsw; wget mirror)
random book text: wiki-beta:/2/pharos/crawl/www.??????.us (aaronsw; wget mirror)
ISBNs: wiki-beta:/2/pharos/isbns and more recently apollonius:/0/pharos/crawl/more_isbn_project (aaronsw; workspace)
worldlibrary.net: wiki-beta:/2/pharos/pdfgrab (aaronsw; wget mirror)
wikipedia dumps: wiki-beta:/2/wikipedia (aaronsw; wget mirror)
wikipedia dumps: apollonius:/2/feisty/x-home/phr/wiki (phr)
dbpedia dumps: zenodotus:/2/pharos/crawl/dbpedia/ (aaronsw; wget mirror)
wikia dumps: zenodotus:/2/pharos/crawl/wikia/ (aaronsw; wget mirror)
amzn data: apollonius:/3/aaronsw/isbn and wiki-beta:/2/pharos/isbns/crawl (aaronsw)
MBL/WHOI catalog: wiki-beta:/1/pharos/marc (aaronsw)
enron emails: apollonius:/3/aaronsw/crawl(aaronsw; wget)
elsevier covers: wiki-beta:/1/pharos/onix/elsevier (aaronsw; wget)
LC additional data: apollonius:/0/pharos/crawl/lc_catdir (aaronsw; wget)
undemocracy: zenodotus:/2/pharos/crawl/dbpedia/undemocracy (aaronsw; wget mirror)
bowker: apollonius:/3/aaronsw/crawl/bowker (aaronsw)
nyt front pages: apollonius:/0/pharos/crawl/nytimes_front_page (aaronsw; wget mirror)
drexel catalog: apollonius:/0/pharos/crawl/drexel (aaronsw; wget)
laurentian catalog: apollonius:/3/aaronsw/crawl/laurentian (aaronsw; wget)
orbiscascade identifier mappings: wiki-beta:/2/pharos/crawl/orbiscascade (aaronsw; wget; hasn't been used on amzn yet)

Here's stuff that's been archived in the petabox:

unc catalog: apollonius:/0/pharos/crawl/unc (aaronsw)
mit catalog: apollonius:/0/pharos/crawl/mit (aaronsw)
umich records: apollonius:/0/pharos/crawl/umich_pdscan_records (aaronsw; wget)
amazon similarity graph: apollonius:/0/pharos/crawl/amazon_similarity_graph (aaronsw; awscrawl.py)
oregon catalogs: apollonious:/3/aaronsw/crawl/oregon/marc_oregon_summit_records (aaronsw; flash drive)
uoft catalog: wiki-beta:/2/pharos/crawl/uoft (aaronsw; wget)
muohio catalog: apollonius:/3/aaronsw/crawl/muohio (aaronsw; wget)
wwu catalog: apollonius:/3/aaronsw/crawl/wwu/marc_western_washington_univ (aaronsw; cdrom)
boston college catalog: wiki-beta:/2/pharos/crawl/bostoncollege/ (aaronsw; ftp)

in progress

lc content (incl. books): apollonius:/1/pharos/crawl/lc (aaronsw; wget mirror)
stanford books: apollonius:/0/pharos/crawl/stanford (aaronsw; in progress)

Step 2: Process it

Aside from some special cases (e.g. lists of ISBNs, book covers, holdings data), we take each data sources, write a processor for it, and output Python dictionaries.

Right now the code we have for this is in our repository in the catalog directory. There is code to process a number of record formats, including MARC and ONIX.

Step 3: Merge it

As records are added, an algorithm detects whether the book is already represented in the database. In that case, some new fields from the incoming record may be added to the record in the database, such as additional identifiers, new subjects, and tables of contents. The success of determining duplicates depends on the quality and accuracy of the data in the records. We hope to make it easy to merge duplicates manually through the user interface so that Open Library users can do what the algorithm cannot.

Step 4: FRBRize it

Then we need to go through and detect relationships between works (example: all of these editions of Tom Sawyer are all editions the same conceptual work). From this we can add relationships to each object and create new objects (like works).

This is not implemented yet.

Step 5: Import it

Once we have all this, we need to start giving it identifiers and importing it into ThingDB.

Right now there is some simple code for this in the catalog directory with some silly identifier schemes.

History

Created April 11, 2008
4 revisions

August 17, 2022	Edited by Mek	Edited without comment.
August 17, 2022	Edited by Mek	adding curators permission
July 21, 2009	Edited by Karen Coyle	changed merge step to say we are doing it
April 11, 2008	Created by an anonymous user	moving dev docs