class: center, middle, albatross # Building
web-platform-tests.live
2019-08-19 (press the
p
key to view presenter's notes)
This presentation is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License
.
Laysan albatross and Midway sunset
is licensed
CC BY-NC 2.0
by
USFWS - Pacific Region
--- class: center, middle, albatross **server layout** / error recovery / submission previews
Laysan albatross and Midway sunset
is licensed
CC BY-NC 2.0
by
USFWS - Pacific Region
--- ### **server layout** / error recovery / submission previews ## Legacy deployment .-- W3C server --. | wptserve | '----------------' `+` publicly accessible `-` TLS certificate maintained manually and at cost `-` regularly offline `-` recovery required human intervention Why? --- ### **server layout** / error recovery / submission previews ## Fundamentally flawed `wptserve` is built on Python's `SimpleHTTPServer` > **Warning:** `SimpleHTTPServer` is not recommended for production. It only > implements basic security checks. https://docs.python.org/2.7/library/simplehttpserver.html --- ### **server layout** / error recovery / submission previews ## Initial redesign .- Amazon EC2 instance --. | systemd | | wptserve Certbot | '------------------------' `+` Free and automated certificate renewal thanks to Let's Encrypt (via Certbot) `+` Improved uptime thanks to systemd `-` Still regularly falling offline (just recovering faster) --- ### **server layout** / error recovery / submission previews ## Second redesign .----- GCP instance -----. .----- GCP instance -----. | .- Docker container -. | | .- Docker container -. | | | wptserve | | | | wptserve | | | '--------------------' | | '--------------------' | '------------------------' '------------------------' .----- GCP instance -----. | .- Docker container -. | | | Certbot | | | '--------------------' | '------------------------' * GCP - Google Cloud Platform `+` Improved uptime further `-` Significantly more complex ??? Each container is actually running a number of processes, all managed by the supervisord init system. That's typically frowned upon by Docker users, so the rationale is included in the project documentation. --- ### **server layout** / error recovery / submission previews ## Distributed certificate management *Let's Encrypt* *GitHub* | | [TLS certificate] [WPT source code] | .------------. | V .-->| wpt server |<---+ .--------------. +++++++++++++++ | '------------' | | cert-renewer |--->+ certificate +---+ | '--------------' + store + | .------------. | +++++++++++++++ '-->| wpt server |<---' '------------' Legend .---. +++++ * * external | | GCE + + object [ ] message service '---' instance +++++ store contents ??? The server is run by multiple Google Compute Engine (or "GCE") instances deployed in parallel. Many of the web-platform-tests concern the semantics of the HTTP protocol, so load balancing is provided at the TCP level in order to avoid interference. In addition to serving the web-platform-tests, each server performs a few tasks on a regular interval. These include: - fetching the latest revision of the web-platform-tests project from the canonical git repository hosted on GitHub.com - fetching TLS certificates from the internally-managed object store When any of these periodic tasks complete, the web-platform-tests server process is restarted in order to apply the changes. A separate Google Compute Engine instance interfaces with the Let's Encrypt service to retrieve TLS certificates for the WPT servers. It integrates with Google Cloud Platform's DNS management in order to prove ownership of the system's domain name. It stores the certificates in a Google Cloud Platform Storage bucket for retrieval by the web-platform-tests servers. --- class: center, middle, albatross server layout / **error recovery** / submission previews
Laysan albatross and Midway sunset
is licensed
CC BY-NC 2.0
by
USFWS - Pacific Region
--- ### server layout / **error recovery** / submission previews ## Layer 1: Process Failure container GCE Instance | | x | err! .---restart---' | | | okay ??? If the WPT server fails (as indicated by its process exiting), then Docker as running in the Google Compute Engine Instance will automatically restart the Docker container. Restarting the container completely refreshes runtime state, and this is expected to resolve many potential problems in the deployment. --- ### server layout / **error recovery** / submission previews ## Layer 2: Machine failure container GCE Instance GCE Managed Group | | | x x | err! .-----restart-----' | | .---restart---' | | | | | okay | | | | | | okay ??? In the case of the web-platform-tests server, an additional layer of error recovery is provided via a Google Cloud Platform "health check." If the Google Compute Engine instance fails to respond to HTTP requests, then it will be destroyed and a new one created in its place. That new instance will subsequently create a Docker container to run the WPT server. This second recovery mechanism guards against more persistent problems, e.g. those stemming from state on disk (since even a running GCE instance will fail HTTP health checks if restarting the Docker container has no effect). --- class: center, middle, albatross server layout / error recovery / **submission previews**
Laysan albatross and Midway sunset
is licensed
CC BY-NC 2.0
by
USFWS - Pacific Region
--- ## The feature ----------------------------- GitHub ------------------------------ | | | | [master] [pr#13451] | | | | [pr#13452] | | | | [ etc. ] v v v v -------------------------- w3c-test.org --------------------------- v v v v editors, implementors & developers WPT contributors ??? w3c-test.org automatically publishes the contents of many patches that are submitted to the project through GitHub. We had to replicate this feature before our system would be considered a viable replacement. It's a fundamentally insecure feature because patches may include arbitrary Python code, and we have to run that. As the canonical location to run the tests on the web, we expect this deployment to be referenced from web specifications, [wpt.fyi](https://wpt.fyi), and more. We want it to be as stable as possible. -- ----------------------------- GitHub ------------------------------ | | | | [master] [pr#13451] | | | | [pr#13452] | | | | [ etc. ] v v v v ---- web-platform-tests.live ----- --- web-platform-tests.pr --- v v v v editors, implementors & developers WPT contributors ??? One of the strongest decisions we made was to deploy it to a separate server. Instability resulting from untrusted patches can only annoy people contributing; it won't diminish availability for the wider audience. --- ### server layout / error recovery / **submission previews** ## Legacy system Contributor GitHub.com w3c-test.org git repository | | | | '---[pull request]---.| | | v | | '--[web hook]--.| | v | '------[git fetch]----. .---------------------' | | V -- Main problems: - trust - state - scalability ??? - trust - in order to prevent exploits or denial-of-service attacks, GitHub and w3c-test.org need to share a secret - state - the set of currently-deployed pull requests is known only to w3c-test.org; if it goes down, then the operator has to perform a fair amount of analysis to determine which pull requests should be mirrored by the replacement - scalability - if the reponsibility of running the server is to be shared by multiple machines, then we'd need still more complexity to ensure that each message from GitHub.com is routed to all machines --- ### server layout / error recovery / **submission previews** ## Redesigned system Contributor GitHub.com git repository web-platform-tests.live | | | | | | .------[git fetch]----' | | '---------------------. '---[pull request]---.| | | v | | '--[git tag]--.| | v | | | .------[git fetch]----' '---------------------. V (fetching continues on a regular interval) --- ### server layout / error recovery / **submission previews** ## A small extension `wpt-server-submissions.Dockerfile`: FROM web-platform-tests-live-wpt-server-tot COPY src/mirror-pull-requests.sh /usr/local/bin/ COPY src/supervisord-pull-requests.conf /etc/supervisor/conf.d/ `+` Easier to maintain than a standalone implementation `+` Safer than branching on runtime flags ??? This server could have been built completely standalone from the "tot" or ("tip-of-tree") server. There would have been a lot of duplication, though, and that's hard to maintain. Alternatively, we could have built a single server that had all functionality, and simply disabled the "submission preview" part in the "tot" deployment. Runtime flags are too easy to toggle, so this would be susceptible to accidental enabling of the "submissions preview" functionality. Docker and supervisord both offer clean extension mechanisms. That allows us to define a distinct image for the submissions server in terms of the "tot" (or "tip-of-tree") server. --- class: center, middle, albatross # Thanks! Source code & documentation: https://github.com/bocoup/web-platform-tests.live
Laysan albatross and Midway sunset
is licensed
CC BY-NC 2.0
by
USFWS - Pacific Region