Open Library logo
Last edited by tracey pooh
April 2, 2009 | History

Book URLs - PRELIMINARY

Note: This document is in process -- if interested please send feedback to mang at archive org

Goals

Bookreader URLs have the following goals:

Key-Value

Bookreader URLs are composed of key-value pairs. The keys and values are separated by '/'. We specify a canonical order in which the key-value pairs should occur but accept the key-value pairs in whatever order the user-agent specifies. A user supplied URL will be remapped to the canonical order when given back to the user (e.g. by redirecting, so it appears in the address bar). The purpose for the remapping to canonical order is to reduce the number of URLs "out there" on the net that point to the same resource.

If a reader implementation does not understand a given key-value pair, it should be ignored.

Functionality

Bookreader URLs support the following functionality:

Example URLs

http://www.archive.org/stream/aliceinwonderlan00carriala/
http://www.archive.org/stream/aliceinwonderlan00carriala/page/23
http://www.archive.org/download/aliceinwonderlan00carriala/page/15/region/10,212,256,256

These two are equivalent and would be remapped to the canonical order:

http://www.archive.org/stream/aliceinwonderlan00carriala/highlight/20,20,30,500/mode/2up/page/23
http://www.archive.org/stream/aliceinwonderlan00carriala/mode/2up/page/23/highlight/20,20,30,500

Canonical order:

http://www.archive.org/stream/aliceinwonderlan00carriala/page/23/mode/2up/highlight/20,20,30,500

Referring to pages, leafs, indices

For a book with a set of numbered camera images we do not always have a mapping between these images and the page numbers (as printed in the book). In addition, certain pages are not numbered at all (e.g. a completely blank page may face a figure page, both of which are inserted between consecutively numbered pages). The image stack can also contain images which should not be considered for access (e.g. colour calibration cards).

When the page numbers are available they may be referenced with:

page/{page number}

The page numbers may be either numeric or a string (e.g. 'iii'). Our earlier Scribe 1 books may have Roman Numeral pages marked. Books scanned with Scribe 2 do not. String-based page numbers should be compared in lowercase. Named pages (such as the title page) may be referred to using the page name.

Examples:

page/2
page/iv
page/title

The following named pages are supported:
title cover first (corresponds to first page 1 if marked) last

Question: There exist books (e.g. compilations of articles) which may have more than one page with the same number. How do we handle these?

An external site or embedding should not assume that the page numbers are available or monotonically increasing. There may be foldouts, pages missing (e.g. damaged) or other reasons page numbers are not continuous.

Note: "Leaf" is a concept from the Archive's Scribe scanning software. It corresponds to the image sequence taken during the scanning process. The Archive.org scandata.xml refers to leafs. At the level of the bookreader and user-visible URLs the underlying leaf numbers should not be exposed unless necessary.

"Accessible page index" (i). Each page that should be included in the access formats (bookreader, PDF, etc) is given a monotonically increasing number starting from 0. For the Archive this corresponds to pages with addToAccessFormat true in the scandata.xml.

Examples:

i/0
i/23

For books where there are multiple leafs with the same page number both the index (i) and the page number should be specified.

Example:

i/23/page/4

Display options

Display options inform the bookreader how the book should be displayed to the user. The GnuBook reader supports the following options:

Note: If we force region to only use percentages that will provide some future-proofing against resolution changes. (mang)

Searching

Search terms can be highlighted by using search/ followed by the search string. The search string should be URL escaped and the slash character ('/') is not allowed. Spaces in the search query should be escaped to the '+' character and the '+' character in a search string should be escaped to '%2b'.

Examples:

Searching for "cats":

 search/cats

Searching for "cheshire cat":

 search/cheshire+cat

Searching for "cheshire+cat":

search/cheshire%2bcat

Highlighting

A region of a page can be highlighted using syntax similar to region. When highlight is specified a i or page should be specified.

Examples:

Full list of stream key-value pairs in canonical order

Downloading

The stream URL provides access to books in a format designed for online reading. The download URLs allows a book or portion of a book to be downloaded.

Key-value pairs (in canonical order):

Examples:
http://www.archive.org/download/alice/format/txt/page/23 http://www.archive.org/download/alice/format/png/i/24/reduce/2/region/0,0,256,256

Future Work

Fine-grained image scaling

We currently only support power of 2 reductions for download image sub-regions. We make this restriction since power of 2 reductions are efficient with our JPEG2000 image backend. More fine-grained image resolution requests could be supported if the backend was powerful enough to allow it.

Zoom to fit

Originally we thought to support the URL encoding whether to zoom to fit the page. If region is insufficient for this purpose we could allow a zoom key.

Existing Archive.org book URLs

The following example URLs also exist on Archive.org (as of March 26, 2009). These URLs should continue to be supported. This is not an exhaustive list.

Streaming full text (show the text file inside a basic online viewer):

http://www.archive.org/stream/happyhearts00isleiala/happyhearts00isleiala_djvu.txt

Streaming DJVU using a viewing applet:

http://www.archive.org/stream/happyhearts00isleiala/happyhearts00isleiala.djvu

Downloading files using /download:

http://www.archive.org/download/happyhearts00isleiala/happyhearts00isleiala.djvu

The old flipbook reader, opening at a specific page. In this case we should open the new flipbook reader if possible (the new reader should support #{pagenumber} as legacy.

http://www.archive.org/stream/happyhearts00isleiala#56

Questions / Issues for Existing URLs

We need some mechanism to distinguish when streaming or downloading a specific file inside the item is requested. Should it be possible that after an individual file is specified after stream or download that trailing key/value pairs could occur? It seems we want to support that behaviour. What happens in the case where a directory or file inside the item has the same name as one of the bookreader key names?

Do files inside an item always start with the item identifier in the name? Answer Not generally true but may be true for Scribe scanned books (unverified).

Other Documents

Ideas from meeting 2009-02-29

Concatenating multiple books (or sections of books):

stream/alice/pageRange/22-25/id/tomsawyer/pageRange/15-20

For canonical order, it should be possible to chop key-value pairs off from the right and have the URL still work (but be less precise).

If a page and index are given and conflict, we take the one to the left when the URL is put into canonical order.

Named pages for beginning and end.

For zoom give fit values (width, height). Do small, medium, large, original like Flickr?

New functionality - croprect/region. Returns an image which could be used in for example <img src='download/alice/page/15/region/10,212,256,256'>

All lowercase in URLs.

Use region instead of zoomrect. For region use something similar to djatoka (y,x,height,width)

Key-value pairs:

Drill on keywords and order them.

Title page detection

Feedback from Brewster 2009-03-19

Related documents:

Image tiling

Transclusion

History

May 17, 2018 Edited by tracey pooh use newer https://archive.org urls
December 5, 2011 Edited by mangtronix Edited without comment.
May 27, 2011 Edited by mangtronix Edited without comment.
May 27, 2011 Edited by mangtronix Live examples
April 2, 2009 Created by mangtronix Edited without comment.