hardware deployment proposal

Current and proposed deployment of Openlibrary hardware. NA means the machine does not exist. This proposal adds two new machines (fulltext1 and fulltext2), needed for the projected increased size of the fulltext search index. It also rearranges existing services so that:

no production services depend on black boxes, and no primary production servers run development tasks. Runaway dev tasks have accidentally clobbered production machines in the past.
OL lighttpd web listener is moved to archive.org www setup so it automatically gets redundancy through existing roundrobin DNS and multiple apaches (Ralf's suggestion). Application redundancy is done through the mod_fastcgi configuration and within the application logic.
main OL services (application fastcgi, solr and postgres) are each replicated to a machine on a separate power circuit from that service's primary machine, so we can operate through a power outage on a single circuit (we have had several such outages due to circuit breakers tripping). We will still be off the air in a larger outage such as a city-wide blackout.
In normal operation, search requests for web and API users are handled by separate machines (i.e. we use the primary and backup bibliographic solr servers in parallel); this mix can also be adjusted in the application software. We should be able to use parallel fastcgi's the same way but the performance gain may be less important than for the solrs.

See this link for description of existing machines.

Machine	Current use	Proposed use
Pharosdb (black box)	Postgres database	development, staging fastcgi, pdregistry solr. This box is big enough to host a full scale development psql (suitable for load/stress testing) and mini-db's per developer, which we don't have now and which would be useful.
Pharosdb-bu (black box)	Bibliographic solr	development. It would be good if we could put another disk in this box.
ia301537 (wiki-beta)	Lighttpd, fastcgi apps for production and staging, plus mercurial repo, plus some development stuff	Production fastcgi, hg repo, fulltext solr shard #2 hot backup, fulltext solr shard #1 cold (static) backup. Lighttpd is replaced by a virtual host in the archive.org apaches. See notes below for explanation of cold backup for fulltext #1.
ia311530	Development, staging solr	Bibliographic solr web primary, bibliographic solr API backup
ia311538	Development, pdregistry solr	Postgres database primary
ia301442 (apollonius)	Development	Upgrade hardware and use for postgres database backup and fastcgi application backup
ia301443 (zenodotus)	Fulltext solr	Bibliographic solr web backup, bibliographic solr API primary. For this purpose, zenodotus should be upgraded to 8GB ram (ia311530 already has 8gb).
Fulltext1 (new)	NA	Fulltext solr shard #1. Preferably on a separate power circuit from fulltext2 (below) and from wiki-beta. It is probably ok to also run a fastcgi app here if it gets in the way of postgres.
Fulltext2 (new)	NA	Fulltext solr shard #2. Should be on a separate power circuit from wiki-beta and preferably also separate from fulltext1. Can run a backup fastcgi app.
Desktop workstations	Remote access	Remote access + local development

Basic theory:

Pharosdb and pharosdb-bu are IA Black boxes, powerful but one-off machines that are unsupported by IA Operations. We should stop running production services on them but can keep using them for development, crawling, etc.
We now often run development tasks on primary production servers. This should stop, both for load and reliability reasons. It was probably a dev task that took down Wiki-beta a couple weekends back and caused some inconvenient downtime. It is ok to run dev tasks on the production backup servers, with the understanding that the dev stuff would be bumped in case of a production primary outage.
Zenodotus (current fulltext solr) is a little bit strained and adding a lot more fulltext may require splitting fulltext search to 2 machines. This table assumes such a split but we won't know without further experiments what will really be needed. In the worst case it may even take 3 machines but I think 2 will be enough.
All production services in this proposal have backup servers running on separate power circuits from the primaries. There are 3 or 4 circuits going into each petabox rack and 40 machines per rack. While the distribution of circuits within racks is not always uniform, if two petabox host numbers differ by 15 or more, they will generally be on separate circuits.
Of the hypothetical two fulltext solr servers, Shard #1 would be static (containing say 55% of the fulltext index) and shard #2 would start with 45%, with the daily updates being added to shard #2 so it would grow slowly. The backup machine (wiki-beta) would have a cold copy of shard #1 on its hard drive and would run a synchronized mirror of shard #2 in case the shard #2 primary fails. In the event of failure of the shard #1 primary we'd switch the backup to the pre-stored #1 snapshot and put it in service. In that way we'd be able to recover from failure from either of the two shard servers, but not both failing simultaneously. If wiki-beta, fulltext1, and fulltext2 are on three separate circuits then we can operate through power outage of any one of them.
We should package the OL server and a mini-database for easy installation (maybe as an LVM container) on a desktop workstation. That will let us do some development locally instead of tying up cluster nodes running dev services (e.g. the stuff I currently use apollonius for). So the table above lets go of a couple of dev machines. In practice we may be able to run some dev tasks on the production backup servers but we shouldn't run such tasks on production primaries.
I also propose running the bibliographic search API on the backup for the main solr server and vice versa, i.e. when both servers are up we would handle API and web requests on separate machines. That is because of the potentially higher level of API traffic which on the other hand would be less cpu-intensive (since we don't facet API results in the current design). Note that web and API search indexes are the same, so there would be just one solr on each box despite it being used for two sort-of-different services. Note also that API outages are likely to cause more drama than short-term web service outages, so this scheme generally puts more priority on keeping API services running.
In the event that a black box really does fail unrecoverably, it can be replaced with a red box.
This number of machines is getting large enough to start needing more serious automatic management than we currently have. That's outside the scope of this page. Also not addressed is remote replication, which we also ought to think about. Also not currently addressed is improved monitoring (nagios, webmetrics) since it is not a hardware issue, but it is closely enough related that it should probably be added. Other software issues such as failover implementation within the application code are considered separate (maybe add another section for this).

Further notes:

Anand suggests sharing a single backup between solr and psql, reducing the number machines by 1, though maybe adding a little bit more recovery time and giving up some search performance.
Thanks to Anand, Ralf, and DanH for their helpful discussion and suggestions.

New hardware

ia311503 - phr's dev machine
ia311504 - Edward's dev machine
ia331506 - Anand's dev machine and staging server (dev.openlibrary.org)
ia3315[07-11] - solr
ia3315[25-26] - production database
ia3315[32-33] - production webapp (fastcgi)

History

Created July 25, 2008
68 revisions

May 4, 2009	Edited by raj	add new hardware
August 5, 2008	Edited by phr	Edited without comment.
August 5, 2008	Edited by phr	Edited without comment.
August 5, 2008	Edited by phr	Edited without comment.
July 25, 2008	Created by 207.241.238.217	new page