Click here to skip to this page's main content.

Internet Archive logo

Site Search

Site Search
Last edited by Edward Betts
March 31, 2009 | History

Schema

This is out of date

The authoritative Open Library schema -- a specification of the database fields used to represent items like books and authors -- is a python expression in the source repository, here.

An more readable version may be generated by executing that file; here it is as of 2007-08-30. (Asterixes indicate multi-valued fields. The types "string", "text", "url" and "date" are all currently represented in ThingDB as strings, but could be displayed or edited in different ways.)

edition

FieldTypeMARC FieldsExample (Description)
source_record_loc string* "marc_records_scriblio_net/part01.dat:29834:543" (a locator for the source record data)
source_record_id string* "LC:DLC:00000006" (a record identifier that is globally unique and that also can be constructed consistently from the contents of a record and an identifier for its source catalog)
author_identifier string* 100:abcd, 110:ab, 710:ab, 111:acdn, 711:acdn "Twain, Mark, 1835-1910" (unique author id in some catalog)
contributions string* 700:abcde "Illustrated by: Steve Bjorkman"
title string 245:a clean_name "The adventures of Tom Sawyer"
subtitle string 245:b clean_name "a play in three acts"
by_statement string* 245:c "Herman Melville ; [illustrated by Barry Moser]"
sort_title string "adventures of Tom Sawyer"
other_titles string* 246:a, 730:a-z, 740:apn "Mark Twain's The Adventures of Tom Sawyer"
work_title string 240:amnpr, 130:a-z (The 240 "work title" is used in the OCLC FRBR algorithm. The 130 is also used, and there should be either a 130 or a 240 in a record, but not both. It would be ideal if we could pick up either for the work title.)
edition string 250:ab "2nd. editon" (information about this edition)
publisher string 260:b clean_name "W. W. Norton & Co."
publish_place string* 260:a clean "New York"
publish_date date 008:7-10 "2006"
pagination string 300:a "viii, 383 p. :" (full pagination information)
number_of_pages int 300:a biggest_decimal 383 (largest decimal found)
subjects string* 600:abcd--x--v--y--z, 610:ab--x--v--y--z, 650:a--x--v--y--z, 651:a--x--v--y--z "Runaway children -- Fiction"
subject_place string* 651:a*, 650:z* "Venice (Italy)"
subject_time string* 600:y*, 650:y* "20th century"
genre string* 600:v*, 650:v*, 651:v* "Biography"
series string* 440:av, 490:av, 830:av "Oxford world's classics"
language string 008:35-37 "ISO" tag "ISO: tel" (coded or human-readable description of the text's language)
physical_format string* 245:h
notes string* 5XX!505!520:a-z
description text 520:a
exerpts text*
table_of_contents text* 505:art
cover_image url
scan_contributor string
scan_sponsor string
dewey_number string* 082:a "914.3"
LC_classification string 050:ab "BJ1533.C4 L49"
ISBN string* 020:a normalize_isbn, 024:a normalize_isbn "9780393926033" (13-digit ISBN)
UCC_13 string
UPC string
ISMN string
DOI string
LCCN string 010:a normalize_lccn "2006285320"
GTIN_14 string
oca_identifier string "albertgallatinja00stevrich"

author

FieldTypeMARC FieldsExample (Description)
identifier string* "Twain, Mark, 1835-1910" (unique id in some catalog)
name string "Mark Twain" (human-readable name)
birth_date date "1835"
death_date date "1910"
bio text
Please see Karen Coyle's earlier [notes](http://www.kcoyle.net/temp/PharosSchema+kc.html) on the schema, and also the tables and notes below, all of which inspired the working schema. Possible new schema -------------------------

EDITION

name type example/description
source_name STRING  
source_record_pos INT  
work ID-REF  
authors ID-REFs Tolkien, J. R. R.
contributors STRINGs "Illustrated by: Steve Bjorkman"
agencies/organizations STRINGs American Civil Liberties Union. Berkeley Chapter
title STRING The adventures of Tom Sawyer
"by" statement STRINGs Herman Melville ; [illustrated by Barry Moser]
sort title INT adventures of Tom Sawyer
other titles STRINGs Mark Twain's The Adventures of Tom Sawyer
edition STRING 2nd. editon
publisher STRING W. W. Norton & Co.,
publish_place STRING New York :
publish_date DATE c2007.
number_of_pages STRING viii, 383 p. :
subjects STRINGs Runaway children -- Fiction
series STRINGs Oxford world's classics
notes STRINGs  
BISAC_subject_categories STRINGs see definitions here
language_code STRING code from ISO 639-2/B; e.g., "tel"
language STRING human-readable description of the text's language, e.g, "Telugu"
physical_format STRING  
description HTML  
table of contents STRINGs  
Dewey numberSTRINGs914.3
LC ClassificationSTRINGBJ1533.C4 L49
cover_image URL  
scan_contributor STRING  
scan_sponsor STRING  
ISBN_10 STRING 0393926036
ISBN_13 STRING 9780393926033
UCC_13 STRING  
UPC STRING  
ISMN STRING  
DOI STRING  
LCCN STRING  
GTIN_14 STRING  
oca_identifier STRING "albertgallatinja00stevrich"

New EDITION with MARC and ONIX fields


[1] The 240 "work title" is used in the OCLC FRBR algorithm. The 130 is also used, and there should be either a 130 or a 240 in a record, but not both. It would be ideal if we could pick up either for the work title.

[2] There are two sources in the MARC record for date of publication. The 260 $c may contain characters beyond the year ("c1997" or "1946 [reprinted 1965]"). Positions 07-10 of the 008 field have a normalized date ("1997" or "1946"). The dates as represented in the 260 will not be found outside of library records, so the 008 date can be substituted for it. For ONIX, the publication date often has month and day as well as year. For uses in terms of merging and for faceting, only the year should be used.

[3] MARC has a wide range of notes that appear in fields that begin with "5". All notes EXCEPT the 505 (table of contents) and 520 (summary) can be placed in a notes field. Notes fields can be repeatable.

[4] The ISBN field is not necessarily "clean" – there can be trailing data (0195144953 (alk. paper)). Take only the 10 or 13-character token, which should appear first. The token is all numeric EXCEPT that the final character can be "X".

[5] There are two possible locations for the ISBN_13 in MARC records. Records from some sources, including LC, will have the ISBN-13 in an 020 field. Many records will have two 020 fields, one with the ISBN-10 and one with the ISBN-13. Records from sources other than LC may have the ISBN-13 in the 024 field. There can be other 13-digit EANs in the 024 field, so the ISBN is identified by a "3" in the first indicator position.

[6]The LCCN field is not necessarily "clean" – there can be trailing data ($a 3400058678 /rev). If you wish to use the LCCN for matching, take only the numeric token from the subfield.

[7] this field is a potential facet for display and selection

History Created April 9, 2008 · 3 revisions

March 31, 2009 Edited by Edward Betts warning, out of date
August 17, 2008 Edited by Karen Coyle added subtitle
April 9, 2008 Created by Alexis Rossi adding page