Step 1: Get the data
The data we have and the data we want is all listed on the how you can help page. We've got a fair amount of it right now awaiting processing.
Here's where it is on our servers -- note: this doesn't include other data listed in the how you can help page:
-
Powells crawl:
wiki-beta:/1/pharos/crawl/powells
(aaronsw, phr)
-
OCA full text: fill in (phr)
-
Indian Million Books Project: in progress (anand)
-
random Princeton University Press files:
wiki-beta:/1/princeton
(aaronsw; wget mirror)
-
random Biblioteque National France files:
wiki-beta:/1/ttk/bnf
(bill)
-
book covers:
wiki-beta:/1/pharos/covers
andwiki-beta:/2/pharos/bookcovers
(aaronsw)
-
bookmooch data:
wiki-beta:/1/pharos/covers/bookmooch
(aaronsw)
-
librarything data:
wiki-beta:/1/pharos/covers/librarything
(aaronsw)
-
librarything data:
wiki-beta:/1/pharos/crawl/librarything
(aaronsw; wget mirror)
-
fred20 data:
wiki-beta:/1/pharos/crawl/fred20
(aaronsw)
-
Access Copyright data:
wiki-beta:/1/pharos/onix/accesscopyright
(aaronsw, rejon)
-
ONIX data:
wiki-beta:/1/pharos/onix/originals
(aaronsw, alexis)
-
recently-uploaded ONIX data:
wiki-beta:/tmp/incoming
(aaronsw, alexis)
-
[[/dev/docs/data/PSU|PSU data]]:
wiki-beta:/1/pharos/onix/psu
(aaronsw)
-
Stanford copyright data:
wiki-beta:/1/pharos/onix/copyright
(aaronsw)
-
bulk.resource.org data:
wiki-beta:/2/pharos/crawl/bulk.resource.org
(aaronsw; wget mirror)
-
random Powells data:
wiki-beta:/2/pharos/crawl/ftp.powells.com
(aaronsw; wget mirror)
-
partial isbndb data:
wiki-beta:/2/pharos/crawl/isbndb.com
(aaronsw; wget mirror)
-
onlinebooks project data:
wiki-beta:/2/pharos/crawl/onlinebooks.library.upenn.edu
(aaronsw; wget mirror)
-
random book text:
wiki-beta:/2/pharos/crawl/www.??????.us
(aaronsw; wget mirror)
-
ISBNs:
wiki-beta:/2/pharos/isbns
and more recentlyapollonius:/0/pharos/crawl/more_isbn_project
(aaronsw; workspace)
-
worldlibrary.net:
wiki-beta:/2/pharos/pdfgrab
(aaronsw; wget mirror)
-
wikipedia dumps:
wiki-beta:/2/wikipedia
(aaronsw; wget mirror)
-
wikipedia dumps:
apollonius:/2/feisty/x-home/phr/wiki
(phr)
-
dbpedia dumps:
zenodotus:/2/pharos/crawl/dbpedia/
(aaronsw; wget mirror)
-
wikia dumps:
zenodotus:/2/pharos/crawl/wikia/
(aaronsw; wget mirror)
-
amzn data:
apollonius:/3/aaronsw/isbn
andwiki-beta:/2/pharos/isbns/crawl
(aaronsw)
-
MBL/WHOI catalog:
wiki-beta:/1/pharos/marc
(aaronsw)
-
enron emails:
apollonius:/3/aaronsw/crawl
(aaronsw; wget)
-
elsevier covers:
wiki-beta:/1/pharos/onix/elsevier
(aaronsw; wget)
-
LC additional data:
apollonius:/0/pharos/crawl/lc_catdir
(aaronsw; wget)
-
undemocracy:
zenodotus:/2/pharos/crawl/dbpedia/undemocracy
(aaronsw; wget mirror)
-
bowker:
apollonius:/3/aaronsw/crawl/bowker
(aaronsw)
-
nyt front pages:
apollonius:/0/pharos/crawl/nytimes_front_page
(aaronsw; wget mirror)
-
drexel catalog:
apollonius:/0/pharos/crawl/drexel
(aaronsw; wget)
-
laurentian catalog:
apollonius:/3/aaronsw/crawl/laurentian
(aaronsw; wget)
-
orbiscascade identifier mappings:
wiki-beta:/2/pharos/crawl/orbiscascade
(aaronsw; wget; hasn't been used on amzn yet)
Here's stuff that's been archived in the petabox:
-
unc catalog:
apollonius:/0/pharos/crawl/unc
(aaronsw)
-
mit catalog:
apollonius:/0/pharos/crawl/mit
(aaronsw)
-
umich records:
apollonius:/0/pharos/crawl/umich_pdscan_records
(aaronsw; wget)
-
amazon similarity graph:
apollonius:/0/pharos/crawl/amazon_similarity_graph
(aaronsw; awscrawl.py)
-
oregon catalogs:
apollonious:/3/aaronsw/crawl/oregon/marc_oregon_summit_records
(aaronsw; flash drive)
-
uoft catalog:
wiki-beta:/2/pharos/crawl/uoft
(aaronsw; wget)
-
muohio catalog:
apollonius:/3/aaronsw/crawl/muohio
(aaronsw; wget)
-
wwu catalog:
apollonius:/3/aaronsw/crawl/wwu/marc_western_washington_univ
(aaronsw; cdrom)
-
boston college catalog:
wiki-beta:/2/pharos/crawl/bostoncollege/
(aaronsw; ftp)
in progress
-
lc content (incl. books):
apollonius:/1/pharos/crawl/lc
(aaronsw; wget mirror)
-
stanford books:
apollonius:/0/pharos/crawl/stanford
(aaronsw; in progress)
Step 2: Process it
Aside from some special cases (e.g. lists of ISBNs, book covers, holdings data), we take each data sources, write a processor for it, and output Python dictionaries.
Right now the code we have for this is in our repository in the catalog
directory. There is code to process a number of record formats, including MARC and ONIX.
Step 3: Merge it
As records are added, an algorithm detects whether the book is already represented in the database. In that case, some new fields from the incoming record may be added to the record in the database, such as additional identifiers, new subjects, and tables of contents. The success of determining duplicates depends on the quality and accuracy of the data in the records. We hope to make it easy to merge duplicates manually through the user interface so that Open Library users can do what the algorithm cannot.
Step 4: FRBRize it
Then we need to go through and detect relationships between works (example: all of these editions of Tom Sawyer are all editions the same conceptual work). From this we can add relationships to each object and create new objects (like works).
This is not implemented yet.
Step 5: Import it
Once we have all this, we need to start giving it identifiers and importing it into ThingDB.
Right now there is some simple code for this in the catalog
directory with some silly identifier schemes.
History
- Created April 11, 2008
- 4 revisions
August 17, 2022 | Edited by Mek | Edited without comment. |
August 17, 2022 | Edited by Mek | adding curators permission |
July 21, 2009 | Edited by Karen Coyle | changed merge step to say we are doing it |
April 11, 2008 | Created by an anonymous user | moving dev docs |