Goals
Bookreader URLs have the following goals:
-
permanency - should be stable over time
-
compactness - short enough to be printed on the cover of a book or included in an academic paper
-
"translucency" - while not being fully descriptive, BookReader URLs should give some indication to a human what they point to
-
resilience - display and other options should be accepted in any order
- minimal/sufficient features - keep the supported feature set small yet sufficient for core tasks
Key-Value
Bookreader URLs are composed of key-value pairs. The keys and values are separated by '/'. We specify a canonical order in which the key-value pairs should occur but accept the key-value pairs in whatever order the user-agent specifies. A user supplied URL will be remapped to the canonical order when given back to the user (e.g. by redirecting, so it appears in the address bar). The purpose for the remapping to canonical order is to reduce the number of URLs "out there" on the net that point to the same resource.
If a reader implementation does not understand a given key-value pair, it should be ignored.
Functionality
Bookreader URLs support the following functionality:
-
referencing a specific page
-
highlighting search terms
-
specifying display options (zoom level, 2-page view)
- downloading the book, or subsections
Example URLs
https://archive.org/stream/aliceinwonderlan00carriala#15
https://archive.org/stream/aliceinwonderlan00carriala#page/23
https://archive.org/download/aliceinwonderlan00carriala#page/15/region/10,212,256,256
These two are equivalent and would be remapped to the canonical order:
https://archive.org/stream/aliceinwonderlan00carriala#highlight/20,20,30,500/mode/2up/page/23
https://archive.org/stream/aliceinwonderlan00carriala#mode/2up/page/23/highlight/20,20,30,500
Canonical order:
https://archive.org/stream/aliceinwonderlan00carriala#page/23/mode/2up/highlight/20,20,30,500
Referring to pages, leafs, indices
For a book with a set of numbered camera images we do not always have a mapping between these images and the page numbers (as printed in the book). In addition, certain pages are not numbered at all (e.g. a completely blank page may face a figure page, both of which are inserted between consecutively numbered pages). The image stack can also contain images which should not be considered for access (e.g. colour calibration cards).
When the page numbers are available they may be referenced with:
page/{page number}
The page numbers may be either numeric or a string (e.g. 'iii'). Our earlier Scribe 1 books may have Roman Numeral pages marked. Books scanned with Scribe 2 do not. String-based page numbers should be compared in lowercase. Named pages (such as the title page) may be referred to using the page name.
Examples:
page/2
page/iv
page/title
The following named pages are supported:
title
cover (automatically chosen "best" image to represent the book)
cover0 (first marked cover, or 404 if no cover image was explicitly marked)
first (corresponds to first page 1 if marked)
last
Question: There exist books (e.g. compilations of articles) which may have more than one page with the same number. How do we handle these?
An external site or embedding should not assume that the page numbers are available or monotonically increasing. There may be foldouts, pages missing (e.g. damaged) or other reasons page numbers are not continuous.
Note: "Leaf" is a concept from the Archive's Scribe scanning software. It corresponds to the image sequence taken during the scanning process. The Archive.org scandata.xml refers to leafs. At the level of the bookreader and user-visible URLs the underlying leaf numbers should not be exposed unless necessary.
"Accessible page index" (n). Each page that should be included in the access formats (bookreader, PDF, etc) is given a monotonically increasing number starting from 0. For the Archive this corresponds to pages with addToAccessFormat true in the scandata.xml.
Examples:
page/n0
page/n23
For books where there are multiple leafs with the same page number the index form can be used to uniquely identify the page.
Display options
Display options inform the bookreader how the book should be displayed to the user. The GnuBook reader supports the following options:
-
mode
- can be1up
for single page display or2up
for two page display
-
region
- (not yet implemented) the reader should attempt to show the given rectangle specified in source image coordinates or source image percentages. The rectangle is specified as {leftx, topy, width, height} with the image origin at 0,0 at the top-left corner. The left/top/width/height values may be specified as integer values in post-scaling image pixels or as decimal percentage values (0.5 for 50%). Whenregion
is specified ai
orpage
must be specified.-
Examples:
-
page/p20/region/0.1,0.2,0.25,0.5
- Show region with top-left corner 10 percent to the right and 20 percent down from top-left corner and 25% width and 50% tall -
page/56/region/10,20,20,500
- Show region with top-left corner 10 pixels to the right and 20 pixels down of top-left corner and 20 pixels wide and 500 pixels tall in the original source image -
Considered possible alternative syntaxes:
-
region/0.1-0.2-0.3-0.2
-
region/0.1_0.2_0.3_0.2
-
region/x0.1,y0.2,w0.3,h0.2
-
region/10p,20p,30p,20p
-
-
-
Examples:
Note: If we force region
to only use percentages that will provide some future-proofing against resolution changes. (mang)
Searching
Search terms can be highlighted by using search/
followed by the search string. The search string should be URL escaped and the slash character ('/') is not allowed. Spaces in the search query should be escaped to the '+' character and the '+' character in a search string should be escaped to '%2b'.
Examples:
Searching for "cats":
search/cats
Searching for "cheshire cat":
search/cheshire+cat
Searching for "cheshire+cat":
search/cheshire%2bcat
Highlighting (not yet implemented)
A region of a page can be highlighted using syntax similar to region
. When highlight is specified a i
or page
should be specified.
Examples:
-
i/10/highlight/10,20,256,30
-
page/20/highlight/0.1,0.2,1.0,0.1
- highlight region 10% right, 20% down from top-left corner and 10% tall by full width
Full list of stream key-value pairs in canonical order
-
page
- show specific page (named page or 'n{index number}')
-
highlight
- highlight a rectangular area in the page content
-
region
- ensure that specified region is displayed by client
-
search
- run the specified search (causes keyword highlighting)
-
mode
- single page or two-page mode
Downloading / Linking Page Images
The stream
URL provides access to books in a format designed for online reading. The download
URLs allows a book or portion of a book to be downloaded.
Images of the individual pages will be accessible at the the following URLs (not yet implemented):
https://archive.org/download/{itemId}(/{path_to_book})/page/{page_specifier}({image_options}).jpg
The page specifier must be one of the following:
-
cover
- best image to use to represent the book as a thumbnail or preview. Since not all books on archive.org have an appropriate cover scanned or known this may be the first cover page, title page, or simply the first page of the book.
-
cover0
- the first page marked as a cover, or 404 if not marked
-
title
- title page (note, returns 404 if no title page marked)
-
page{page number string}
- returns the named page (if the page numbers are known). e.g.pageiv.jpg
orpage5.jpg
. In the case of identical page numbers in the book (e.g. a collection of magazine articles) the first matching page will be returned.
-
n{page index}
- return the page at index, starting from 0 for the first page. e.g.n0.jpg
is the first page, 'n1.jpg' the second page, etc. Accessing by index guarantees that all pages can be accessed sequentially.
-
leaf{leaf index}
- request an image based on the leaf indices in the correspondingscandata.xml
. Useful if you are processing ascandata.xml
and want to request a page image.
The following image options are supported:
_h{height in pixels}
- image should be close to height requested
_w{width in pixels}
- images should be close to width requested
_s{scale factor}
- image should be scaled down by factor relative to source resolution
_(thumb|small|medium|large)
- image should be returned close to requested size
-
_thumb
- 100 pixels on longest side -
_small
- 256 pixels on longest side -
_medium
- 512 pixels on longest side -
_large
- 2048 pixels on longest side
-
_x{x offset}
- Horizontal offset in image from left side, in integer pixels or float (0 to 1)
_y{y offset}
- Vertical offset in image from top, in integer pixels or float (0 to 1)
_rot{0, 90, 180, 270}
- Rotate the image in 90 degree increments. The rotation is applied after the requested region has been taken from the source image using the x, y, width, and height values.
Examples:
-
_x0.1_y0.2_w0.25_h0.5
- Region with top-left corner 10 percent to the right and 20 percent down from top-left corner and 25% width and 50% tall -
_x10_y20_w20_h500
- Region with top-left corner 10 pixels to the right and 20 pixels down of top-left corner and 20 pixels wide and 500 pixels tall in the full-resolution page image. E.g. if you request w200_h100 and a downscale factor of 2 you would likely receive an image approximately 100x500 pixels. In some cases the returned image may be larger than the requested width/height.
-
The x,y and w,h options are interpreted differently depending on their combinations:
-
For an image close to a certain width, specify only
_w
-
For an image close to a certain height, specify only
_h
-
Specify both
_w
and_h
to get an image proportionally scaled to a bit larger than that size. For example you could create a grid of square divs in HTML with width=400, height=400 and set the image inside to max-height=400. If you request page images as_w400_h400
the image will show scaled proportionally to fit inside the square div. The image is returned slightly larger than requested since we have limited scaling options (power of 2) with our servers and going a little bigger ensure the image will be downscaled by the browser.
-
To request a cropped region of the page image, specify
_x
,_y
,_w
and_h
. The coordinates are found in the full resolution image. You can specify a downscale amount using_s
. The downscaling will be applied after the region is cropped from the full resolution image.
If multiple size specifiers are used simultaneously, the result is not defined (e.g. don't use page_w200_thumb.jpg
).
Note: In general the size of the returned image size will not exactly match the size requested due to
server-side image processing constraints (the closest size that is efficient to process will be returned).
The client requesting the image should do final scaling of the image to the needed size. In general we return the next larger power of 2 reduction (2x, 4x, 8x, etc) compared to the requested size since it can be done efficiently when processing our JP2 source images.
Examples:
https://archive.org/download/coloritsapplicat00andriala/page/cover.jpg
- Cover at full resolution. This may be the actual cover of the book, title page or first page as determined by heuristic.
https://archive.org/download/coloritsapplicat00andriala/page/title.jpg
- Title page at full resolution if known/present, otherwise 404.
https://archive.org/download/coloritsapplicat00andriala/page/cover_thumb.jpg
- Cover as thumbnail (100 pixels on longest side)

https://archive.org/download/coloritsapplicat00andriala/page/cover_w200.jpg
- Cover at roughly 200 pixels wide

https://archive.org/download/coloritsapplicat00andriala/page/page35.jpg
- Page 35 at full resolution - Note: It's only possible to reference by page number if the page numbers have been marked during the scanning process.
https://archive.org/download/coloritsapplicat00andriala/page/n25_s4.jpg
- Page with index 5 at 4x downscaling
https://archive.org/download/panamapacificint00moor/page/page15_x280_y292_w668_h1080_s4.jpg
- A fancy letter "T" using the region options

Books inside sub-directories and multi-book items
For books inside a subdirectory in the item the "sub-prefix" can be specified to indicate which
book is being requested. Here's an example book in a sub-directory:
https://archive.org/download/BozorSobirkhonavodaParokandaShud/Bozor_Sobir_Khonavoda_Parokanda_Shud/page/n0.jpg
If the sub-prefix is not specified the first book found (alphabetically, by prefix) is returned.
For items containing multiple books, the sub-prefix must be specified to access books other than the first book
inside the item.
- Cover for book in subdir
https://archive.org/download/SubBookTest/subdir/book2/brewster_kahle_internet_archive/page/cover.jpg
- Cover for another book in a different subdir
https://archive.org/download/SubBookTest/book1/GPORFP/page/cover.jpg
- Yet another book
https://archive.org/download/SubBookTest/page/cover.jpg
-
Same as above, since sub-prefix
book1/GPORFP
is the first in asciibetical order
Future Work
Fine-grained image scaling
We currently only support power of 2 reductions for download
image sub-regions. We make this restriction since power of 2 reductions are efficient with our JPEG2000 image backend. More fine-grained image resolution requests could be supported if the backend was powerful enough to allow it.
Zoom to fit
Originally we thought to support the URL encoding whether to zoom to fit the page. If region
is insufficient for this purpose we could allow a zoom
key.
-
zoom
- the zoom level with these supported values:-
fit
- fit width and height inside container -
fitwidth
- fit width inside container -
fitheight
- fit height inside container
-
Existing Archive.org book URLs
The following example URLs also exist on Archive.org (as of March 26, 2009). These URLs should continue to be supported. This is not an exhaustive list.
Streaming full text (show the text file inside a basic online viewer):
https://archive.org/stream/happyhearts00isleiala/happyhearts00isleiala_djvu.txt
Streaming DJVU using a viewing applet:
https://archive.org/stream/happyhearts00isleiala/happyhearts00isleiala.djvu
Downloading files using /download
:
https://archive.org/download/happyhearts00isleiala/happyhearts00isleiala.djvu
The old flipbook reader, opening at a specific page. In this case we should open the new flipbook reader if possible (the new reader should support #{pagenumber}
as legacy.
https://archive.org/stream/happyhearts00isleiala#56
Questions / Issues for Existing URLs
We need some mechanism to distinguish when streaming or downloading a specific file inside the item is requested. Should it be possible that after an individual file is specified after stream
or download
that trailing key/value pairs could occur? It seems we want to support that behaviour. What happens in the case where a directory or file inside the item has the same name as one of the bookreader key names?
Do files inside an item always start with the item identifier in the name? Answer Not generally true but may be true for Scribe scanned books (unverified).
Other Documents
Ideas from meeting 2009-02-29
Concatenating multiple books (or sections of books):
stream/alice#pageRange/22-25/id/tomsawyer/pageRange/15-20
For canonical order, it should be possible to chop key-value pairs off from the right and have the URL still work (but be less precise).
If a page and index are given and conflict, we take the one to the left when the URL is put into canonical order.
Named pages for beginning and end.
For zoom give fit values (width, height). Do small, medium, large, original like Flickr?
New functionality - croprect/region. Returns an image which could be used in for example <img src='download/alice/page/15/region/10,212,256,256'>
All lowercase in URLs.
Use region instead of zoomrect. For region use something similar to djatoka (y,x,height,width)
Key-value pairs:
- format - txt, jpg, png, jp2
Drill on keywords and order them.
Title page detection
- Take page with largest text, near start of book, with words similar to title and call it the title page
Feedback from Brewster 2009-03-19
-
Spec what happens when an image region outside the book is requested
-
Should we have a more general
scale
key even if we only currently support pow 2 size reductions?
Related documents:
-
Citing an online book (Raj)
-
Purdue Guide to MLA Formatting and Style Guide: Works Cited: Electronic Source
-
Proposal started 2009-01-29
- Proposal at 2009-02-29
Image tiling
Transclusion
-
Fine-Grained Transclusion in the Hypertext Markup Language - 1997
-
Methods for implementing transclusion of text into HTML pages - 1996
- purple-include - client-side JS library for transclusion using xpath
History
- Created April 2, 2009
- 53 revisions
May 17, 2018 | Edited by tracey pooh | use newer https://archive.org urls |
December 5, 2011 | Edited by mangtronix | Edited without comment. |
May 27, 2011 | Edited by mangtronix | Edited without comment. |
May 27, 2011 | Edited by mangtronix | Live examples |
April 2, 2009 | Created by mangtronix | Edited without comment. |