Click here to skip to this page's main content.

Internet Archive logo

Site Search

Site Search
Last edited by mangtronix
December 5, 2011 | History

Book URLs

Goals

Bookreader URLs have the following goals:

  • permanency - should be stable over time
  • compactness - short enough to be printed on the cover of a book or included in an academic paper
  • "translucency" - while not being fully descriptive, BookReader URLs should give some indication to a human what they point to
  • resilience - display and other options should be accepted in any order
  • minimal/sufficient features - keep the supported feature set small yet sufficient for core tasks

Key-Value

Bookreader URLs are composed of key-value pairs. The keys and values are separated by '/'. We specify a canonical order in which the key-value pairs should occur but accept the key-value pairs in whatever order the user-agent specifies. A user supplied URL will be remapped to the canonical order when given back to the user (e.g. by redirecting, so it appears in the address bar). The purpose for the remapping to canonical order is to reduce the number of URLs "out there" on the net that point to the same resource.

If a reader implementation does not understand a given key-value pair, it should be ignored.

Functionality

Bookreader URLs support the following functionality:

  • referencing a specific page
  • highlighting search terms
  • specifying display options (zoom level, 2-page view)
  • downloading the book, or subsections

Example URLs

http://www.archive.org/stream/aliceinwonderlan00carriala#15
http://www.archive.org/stream/aliceinwonderlan00carriala#page/23
http://www.archive.org/download/aliceinwonderlan00carriala#page/15/region/10,212,256,256

These two are equivalent and would be remapped to the canonical order:

http://www.archive.org/stream/aliceinwonderlan00carriala#highlight/20,20,30,500/mode/2up/page/23
http://www.archive.org/stream/aliceinwonderlan00carriala#mode/2up/page/23/highlight/20,20,30,500

Canonical order:

http://www.archive.org/stream/aliceinwonderlan00carriala#page/23/mode/2up/highlight/20,20,30,500

Referring to pages, leafs, indices

For a book with a set of numbered camera images we do not always have a mapping between these images and the page numbers (as printed in the book). In addition, certain pages are not numbered at all (e.g. a completely blank page may face a figure page, both of which are inserted between consecutively numbered pages). The image stack can also contain images which should not be considered for access (e.g. colour calibration cards).

When the page numbers are available they may be referenced with:

page/{page number}

The page numbers may be either numeric or a string (e.g. 'iii'). Our earlier Scribe 1 books may have Roman Numeral pages marked. Books scanned with Scribe 2 do not. String-based page numbers should be compared in lowercase. Named pages (such as the title page) may be referred to using the page name.

Examples:

page/2
page/iv
page/title

The following named pages are supported:
title cover (automatically chosen "best" image to represent the book) cover0 (first marked cover, or 404 if no cover image was explicitly marked) first (corresponds to first page 1 if marked) last

Question: There exist books (e.g. compilations of articles) which may have more than one page with the same number. How do we handle these?

An external site or embedding should not assume that the page numbers are available or monotonically increasing. There may be foldouts, pages missing (e.g. damaged) or other reasons page numbers are not continuous.

Note: "Leaf" is a concept from the Archive's Scribe scanning software. It corresponds to the image sequence taken during the scanning process. The Archive.org scandata.xml refers to leafs. At the level of the bookreader and user-visible URLs the underlying leaf numbers should not be exposed unless necessary.

"Accessible page index" (n). Each page that should be included in the access formats (bookreader, PDF, etc) is given a monotonically increasing number starting from 0. For the Archive this corresponds to pages with addToAccessFormat true in the scandata.xml.

Examples:

page/n0
page/n23

For books where there are multiple leafs with the same page number the index form can be used to uniquely identify the page.

Display options

Display options inform the bookreader how the book should be displayed to the user. The GnuBook reader supports the following options:

  • mode - can be 1up for single page display or 2up for two page display
  • region - (not yet implemented) the reader should attempt to show the given rectangle specified in source image coordinates or source image percentages. The rectangle is specified as {leftx, topy, width, height} with the image origin at 0,0 at the top-left corner. The left/top/width/height values may be specified as integer values in post-scaling image pixels or as decimal percentage values (0.5 for 50%). When region is specified a i or page must be specified.
    • Examples:
      • page/p20/region/0.1,0.2,0.25,0.5 - Show region with top-left corner 10 percent to the right and 20 percent down from top-left corner and 25% width and 50% tall
      • page/56/region/10,20,20,500 - Show region with top-left corner 10 pixels to the right and 20 pixels down of top-left corner and 20 pixels wide and 500 pixels tall in the original source image
      • Considered possible alternative syntaxes:
        • region/0.1-0.2-0.3-0.2
        • region/0.1_0.2_0.3_0.2
        • region/x0.1,y0.2,w0.3,h0.2
        • region/10p,20p,30p,20p

Note: If we force region to only use percentages that will provide some future-proofing against resolution changes. (mang)

Searching

Search terms can be highlighted by using search/ followed by the search string. The search string should be URL escaped and the slash character ('/') is not allowed. Spaces in the search query should be escaped to the '+' character and the '+' character in a search string should be escaped to '%2b'.

Examples:

Searching for "cats":

 search/cats

Searching for "cheshire cat":

 search/cheshire+cat

Searching for "cheshire+cat":

search/cheshire%2bcat

Highlighting (not yet implemented)

A region of a page can be highlighted using syntax similar to region. When highlight is specified a i or page should be specified.

Examples:

  • i/10/highlight/10,20,256,30
  • page/20/highlight/0.1,0.2,1.0,0.1 - highlight region 10% right, 20% down from top-left corner and 10% tall by full width

Full list of stream key-value pairs in canonical order

  • page - show specific page (named page or 'n{index number}')
  • highlight - highlight a rectangular area in the page content
  • region - ensure that specified region is displayed by client
  • search - run the specified search (causes keyword highlighting)
  • mode - single page or two-page mode

Downloading / Linking Page Images

The stream URL provides access to books in a format designed for online reading. The download URLs allows a book or portion of a book to be downloaded.

Images of the individual pages will be accessible at the the following URLs (not yet implemented):

http://www.archive.org/download/{itemId}(/{path_to_book})/page/{page_specifier}({image_options}).jpg

The page specifier must be one of the following:

  • cover - best image to use to represent the book as a thumbnail or preview. Since not all books on archive.org have an appropriate cover scanned or known this may be the first cover page, title page, or simply the first page of the book.
  • cover0 - the first page marked as a cover, or 404 if not marked
  • title - title page (note, returns 404 if no title page marked)
  • page{page number string} - returns the named page (if the page numbers are known). e.g. pageiv.jpg or page5.jpg. In the case of identical page numbers in the book (e.g. a collection of magazine articles) the first matching page will be returned.
  • n{page index} - return the page at index, starting from 0 for the first page. e.g. n0.jpg is the first page, 'n1.jpg' the second page, etc. Accessing by index guarantees that all pages can be accessed sequentially.
  • leaf{leaf index} - request an image based on the leaf indices in the corresponding scandata.xml. Useful if you are processing a scandata.xml and want to request a page image.

The following image options are supported:

  • _h{height in pixels} - image should be close to height requested

  • _w{width in pixels} - images should be close to width requested

  • _s{scale factor} - image should be scaled down by factor relative to source resolution

  • _(thumb|small|medium|large) - image should be returned close to requested size

    • _thumb - 100 pixels on longest side
    • _small - 256 pixels on longest side
    • _medium - 512 pixels on longest side
    • _large - 2048 pixels on longest side
  • _x{x offset} - Horizontal offset in image from left side, in integer pixels or float (0 to 1)

  • _y{y offset} - Vertical offset in image from top, in integer pixels or float (0 to 1)

  • _rot{0, 90, 180, 270} - Rotate the image in 90 degree increments. The rotation is applied after the requested region has been taken from the source image using the x, y, width, and height values.

  • Examples:

    • _x0.1_y0.2_w0.25_h0.5 - Region with top-left corner 10 percent to the right and 20 percent down from top-left corner and 25% width and 50% tall
    • _x10_y20_w20_h500 - Region with top-left corner 10 pixels to the right and 20 pixels down of top-left corner and 20 pixels wide and 500 pixels tall in the full-resolution page image. E.g. if you request w200_h100 and a downscale factor of 2 you would likely receive an image approximately 100x500 pixels. In some cases the returned image may be larger than the requested width/height.

The x,y and w,h options are interpreted differently depending on their combinations:

  • For an image close to a certain width, specify only _w
  • For an image close to a certain height, specify only _h
  • Specify both _w and _h to get an image proportionally scaled to a bit larger than that size. For example you could create a grid of square divs in HTML with width=400, height=400 and set the image inside to max-height=400. If you request page images as _w400_h400 the image will show scaled proportionally to fit inside the square div. The image is returned slightly larger than requested since we have limited scaling options (power of 2) with our servers and going a little bigger ensure the image will be downscaled by the browser.
  • To request a cropped region of the page image, specify _x, _y, _w and _h. The coordinates are found in the full resolution image. You can specify a downscale amount using _s. The downscaling will be applied after the region is cropped from the full resolution image.

If multiple size specifiers are used simultaneously, the result is not defined (e.g. don't use page_w200_thumb.jpg).

Note: In general the size of the returned image size will not exactly match the size requested due to
server-side image processing constraints (the closest size that is efficient to process will be returned).
The client requesting the image should do final scaling of the image to the needed size. In general we return the next larger power of 2 reduction (2x, 4x, 8x, etc) compared to the requested size since it can be done efficiently when processing our JP2 source images.

Examples:

http://www.archive.org/download/coloritsapplicat00andriala/page/cover.jpg

  • Cover at full resolution. This may be the actual cover of the book, title page or first page as determined by heuristic.

http://www.archive.org/download/coloritsapplicat00andriala/page/title.jpg

  • Title page at full resolution if known/present, otherwise 404.

http://www.archive.org/download/coloritsapplicat00andriala/page/cover_thumb.jpg

  • Cover as thumbnail (100 pixels on longest side)
Cover thumbnail

http://www.archive.org/download/coloritsapplicat00andriala/page/cover_w200.jpg

  • Cover at roughly 200 pixels wide
Cover 200 pixels wide

http://www.archive.org/download/coloritsapplicat00andriala/page/page35.jpg

  • Page 35 at full resolution - Note: It's only possible to reference by page number if the page numbers have been marked during the scanning process.

http://www.archive.org/download/coloritsapplicat00andriala/page/n25_s4.jpg

  • Page with index 5 at 4x downscaling

http://www.archive.org/download/panamapacificint00moor/page/page15_x280_y292_w668_h1080_s4.jpg

  • A fancy letter "T" using the region options
T

Books inside sub-directories and multi-book items

For books inside a subdirectory in the item the "sub-prefix" can be specified to indicate which
book is being requested. Here's an example book in a sub-directory:
http://www.archive.org/download/BozorSobirkhonavodaParokandaShud/Bozor_Sobir_Khonavoda_Parokanda_Shud/page/n0.jpg

If the sub-prefix is not specified the first book found (alphabetically, by prefix) is returned.
For items containing multiple books, the sub-prefix must be specified to access books other than the first book
inside the item.

http://www.archive.org/download/SubBookTest/subdir/subsubdir/book3/Rfp008011ResponseInternetArchive-without-resume/page/cover.jpg

  • Cover for book in subdir

http://www.archive.org/download/SubBookTest/subdir/book2/brewster_kahle_internet_archive/page/cover.jpg

  • Cover for another book in a different subdir

http://www.archive.org/download/SubBookTest/book1/GPORFP/page/cover.jpg

  • Yet another book

http://www.archive.org/download/SubBookTest/page/cover.jpg

  • Same as above, since sub-prefix book1/GPORFP is the first in asciibetical order

Future Work

Fine-grained image scaling

We currently only support power of 2 reductions for download image sub-regions. We make this restriction since power of 2 reductions are efficient with our JPEG2000 image backend. More fine-grained image resolution requests could be supported if the backend was powerful enough to allow it.

Zoom to fit

Originally we thought to support the URL encoding whether to zoom to fit the page. If region is insufficient for this purpose we could allow a zoom key.

  • zoom - the zoom level with these supported values:
    • fit - fit width and height inside container
    • fitwidth - fit width inside container
    • fitheight - fit height inside container

Existing Archive.org book URLs

The following example URLs also exist on Archive.org (as of March 26, 2009). These URLs should continue to be supported. This is not an exhaustive list.

Streaming full text (show the text file inside a basic online viewer):

http://www.archive.org/stream/happyhearts00isleiala/happyhearts00isleiala_djvu.txt

Streaming DJVU using a viewing applet:

http://www.archive.org/stream/happyhearts00isleiala/happyhearts00isleiala.djvu

Downloading files using /download:

http://www.archive.org/download/happyhearts00isleiala/happyhearts00isleiala.djvu

The old flipbook reader, opening at a specific page. In this case we should open the new flipbook reader if possible (the new reader should support #{pagenumber} as legacy.

http://www.archive.org/stream/happyhearts00isleiala#56

Questions / Issues for Existing URLs

We need some mechanism to distinguish when streaming or downloading a specific file inside the item is requested. Should it be possible that after an individual file is specified after stream or download that trailing key/value pairs could occur? It seems we want to support that behaviour. What happens in the case where a directory or file inside the item has the same name as one of the bookreader key names?

Do files inside an item always start with the item identifier in the name? Answer Not generally true but may be true for Scribe scanned books (unverified).

Other Documents

Ideas from meeting 2009-02-29

Concatenating multiple books (or sections of books):

stream/alice#pageRange/22-25/id/tomsawyer/pageRange/15-20

For canonical order, it should be possible to chop key-value pairs off from the right and have the URL still work (but be less precise).

If a page and index are given and conflict, we take the one to the left when the URL is put into canonical order.

Named pages for beginning and end.

For zoom give fit values (width, height). Do small, medium, large, original like Flickr?

New functionality - croprect/region. Returns an image which could be used in for example <img src='download/alice/page/15/region/10,212,256,256'>

All lowercase in URLs.

Use region instead of zoomrect. For region use something similar to djatoka (y,x,height,width)

Key-value pairs:

  • format - txt, jpg, png, jp2

Drill on keywords and order them.

Title page detection

  • Take page with largest text, near start of book, with words similar to title and call it the title page

Feedback from Brewster 2009-03-19

  • Spec what happens when an image region outside the book is requested
  • Should we have a more general scale key even if we only currently support pow 2 size reductions?

Related documents:

Image tiling

Transclusion

History Created April 21, 2010 · 52 revisions

December 5, 2011 Edited by mangtronix Edited without comment.
May 27, 2011 Edited by mangtronix Edited without comment.
May 27, 2011 Edited by mangtronix Live examples
May 27, 2011 Edited by mangtronix Edited without comment.
April 21, 2010 Edited by mangtronix Edited without comment.