Open Library logo
Last edited by Karen Coyle
July 21, 2009 | History

The State of Data Import

Step 1: Get the data

The data we have and the data we want is all listed on the how you can help page. We've got a fair amount of it right now awaiting processing.

Here's where it is on our servers -- note: this doesn't include other data listed in the how you can help page:

Here's stuff that's been archived in the petabox:

in progress

Step 2: Process it

Aside from some special cases (e.g. lists of ISBNs, book covers, holdings data), we take each data sources, write a processor for it, and output Python dictionaries.

Right now the code we have for this is in our repository in the catalog directory. There is code to process a number of record formats, including MARC and ONIX.

Step 3: Merge it

As records are added, an algorithm detects whether the book is already represented in the database. In that case, some new fields from the incoming record may be added to the record in the database, such as additional identifiers, new subjects, and tables of contents. The success of determining duplicates depends on the quality and accuracy of the data in the records. We hope to make it easy to merge duplicates manually through the user interface so that Open Library users can do what the algorithm cannot.

Step 4: FRBRize it

Then we need to go through and detect relationships between works (example: all of these editions of Tom Sawyer are all editions the same conceptual work). From this we can add relationships to each object and create new objects (like works).

This is not implemented yet.

Step 5: Import it

Once we have all this, we need to start giving it identifiers and importing it into ThingDB.

Right now there is some simple code for this in the catalog directory with some silly identifier schemes.

History

July 21, 2009 Edited by Karen Coyle changed merge step to say we are doing it
April 11, 2008 Created by an anonymous user moving dev docs