+ - 0:00:00
Notes for current slide
Notes for next slide

Building web-platform-tests.live

2019-08-19

(press the p key to view presenter's notes)

Creative Commons License This presentation is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Laysan albatross and Midway sunset is licensed CC BY-NC 2.0 by USFWS - Pacific Region

1 / 18

server layout / error recovery / submission previews

Laysan albatross and Midway sunset is licensed CC BY-NC 2.0 by USFWS - Pacific Region

2 / 18

server layout / error recovery / submission previews

Legacy deployment

.-- W3C server --.
| wptserve |
'----------------'

+ publicly accessible

- TLS certificate maintained manually and at cost

- regularly offline

- recovery required human intervention

Why?

3 / 18

server layout / error recovery / submission previews

Fundamentally flawed

wptserve is built on Python's SimpleHTTPServer

Warning: SimpleHTTPServer is not recommended for production. It only implements basic security checks.

https://docs.python.org/2.7/library/simplehttpserver.html

4 / 18

server layout / error recovery / submission previews

Initial redesign

.- Amazon EC2 instance --.
| systemd |
| wptserve Certbot |
'------------------------'

+ Free and automated certificate renewal thanks to Let's Encrypt (via Certbot)

+ Improved uptime thanks to systemd

- Still regularly falling offline (just recovering faster)

5 / 18

server layout / error recovery / submission previews

Second redesign

.----- GCP instance -----. .----- GCP instance -----.
| .- Docker container -. | | .- Docker container -. |
| | wptserve | | | | wptserve | |
| '--------------------' | | '--------------------' |
'------------------------' '------------------------'
.----- GCP instance -----.
| .- Docker container -. |
| | Certbot | |
| '--------------------' |
'------------------------'
* GCP - Google Cloud Platform

+ Improved uptime further

- Significantly more complex

6 / 18

Each container is actually running a number of processes, all managed by the supervisord init system. That's typically frowned upon by Docker users, so the rationale is included in the project documentation.

server layout / error recovery / submission previews

Distributed certificate management

*Let's Encrypt* *GitHub*
| |
[TLS certificate] [WPT source code]
| .------------. |
V .-->| wpt server |<---+
.--------------. +++++++++++++++ | '------------' |
| cert-renewer |--->+ certificate +---+ |
'--------------' + store + | .------------. |
+++++++++++++++ '-->| wpt server |<---'
'------------'
Legend
.---. +++++
* * external | | GCE + + object [ ] message
service '---' instance +++++ store contents
7 / 18

The server is run by multiple Google Compute Engine (or "GCE") instances deployed in parallel. Many of the web-platform-tests concern the semantics of the HTTP protocol, so load balancing is provided at the TCP level in order to avoid interference.

In addition to serving the web-platform-tests, each server performs a few tasks on a regular interval. These include:

  • fetching the latest revision of the web-platform-tests project from the canonical git repository hosted on GitHub.com
  • fetching TLS certificates from the internally-managed object store

When any of these periodic tasks complete, the web-platform-tests server process is restarted in order to apply the changes.

A separate Google Compute Engine instance interfaces with the Let's Encrypt service to retrieve TLS certificates for the WPT servers. It integrates with Google Cloud Platform's DNS management in order to prove ownership of the system's domain name. It stores the certificates in a Google Cloud Platform Storage bucket for retrieval by the web-platform-tests servers.

server layout / error recovery / submission previews

Laysan albatross and Midway sunset is licensed CC BY-NC 2.0 by USFWS - Pacific Region

8 / 18

server layout / error recovery / submission previews

Layer 1: Process Failure

container GCE Instance
| |
x |
err!
.---restart---'
| |
| okay
9 / 18

If the WPT server fails (as indicated by its process exiting), then Docker as running in the Google Compute Engine Instance will automatically restart the Docker container.

Restarting the container completely refreshes runtime state, and this is expected to resolve many potential problems in the deployment.

server layout / error recovery / submission previews

Layer 2: Machine failure

container GCE Instance GCE Managed Group
| | |
x x |
err!
.-----restart-----'
| |
.---restart---' |
| | |
| okay |
| | |
| | okay
10 / 18

In the case of the web-platform-tests server, an additional layer of error recovery is provided via a Google Cloud Platform "health check." If the Google Compute Engine instance fails to respond to HTTP requests, then it will be destroyed and a new one created in its place. That new instance will subsequently create a Docker container to run the WPT server.

This second recovery mechanism guards against more persistent problems, e.g. those stemming from state on disk (since even a running GCE instance will fail HTTP health checks if restarting the Docker container has no effect).

server layout / error recovery / submission previews

Laysan albatross and Midway sunset is licensed CC BY-NC 2.0 by USFWS - Pacific Region

11 / 18

The feature

----------------------------- GitHub ------------------------------
| | | |
[master] [pr#13451] | |
| | [pr#13452] |
| | | [ etc. ]
v v v v
-------------------------- w3c-test.org ---------------------------
v v v v
editors, implementors & developers WPT contributors
12 / 18

w3c-test.org automatically publishes the contents of many patches that are submitted to the project through GitHub.

We had to replicate this feature before our system would be considered a viable replacement.

It's a fundamentally insecure feature because patches may include arbitrary Python code, and we have to run that.

As the canonical location to run the tests on the web, we expect this deployment to be referenced from web specifications, wpt.fyi, and more. We want it to be as stable as possible.

The feature

----------------------------- GitHub ------------------------------
| | | |
[master] [pr#13451] | |
| | [pr#13452] |
| | | [ etc. ]
v v v v
-------------------------- w3c-test.org ---------------------------
v v v v
editors, implementors & developers WPT contributors

 

----------------------------- GitHub ------------------------------
| | | |
[master] [pr#13451] | |
| | [pr#13452] |
| | | [ etc. ]
v v v v
---- web-platform-tests.live ----- --- web-platform-tests.pr ---
v v v v
editors, implementors & developers WPT contributors
13 / 18

w3c-test.org automatically publishes the contents of many patches that are submitted to the project through GitHub.

We had to replicate this feature before our system would be considered a viable replacement.

It's a fundamentally insecure feature because patches may include arbitrary Python code, and we have to run that.

As the canonical location to run the tests on the web, we expect this deployment to be referenced from web specifications, wpt.fyi, and more. We want it to be as stable as possible.

One of the strongest decisions we made was to deploy it to a separate server. Instability resulting from untrusted patches can only annoy people contributing; it won't diminish availability for the wider audience.

server layout / error recovery / submission previews

Legacy system

Contributor GitHub.com w3c-test.org git repository
| | | |
'---[pull request]---.| | |
v | |
'--[web hook]--.| |
v |
'------[git fetch]----.
.---------------------'
| |
V
14 / 18

server layout / error recovery / submission previews

Legacy system

Contributor GitHub.com w3c-test.org git repository
| | | |
'---[pull request]---.| | |
v | |
'--[web hook]--.| |
v |
'------[git fetch]----.
.---------------------'
| |
V

Main problems:

  • trust
  • state
  • scalability
15 / 18
  • trust - in order to prevent exploits or denial-of-service attacks, GitHub and w3c-test.org need to share a secret
  • state - the set of currently-deployed pull requests is known only to w3c-test.org; if it goes down, then the operator has to perform a fair amount of analysis to determine which pull requests should be mirrored by the replacement
  • scalability - if the reponsibility of running the server is to be shared by multiple machines, then we'd need still more complexity to ensure that each message from GitHub.com is routed to all machines

server layout / error recovery / submission previews

Redesigned system

Contributor GitHub.com git repository web-platform-tests.live
| | | |
| | .------[git fetch]----'
| | '---------------------.
'---[pull request]---.| | |
v | |
'--[git tag]--.| |
v |
| |
.------[git fetch]----'
'---------------------.
V
(fetching continues on a regular interval)
16 / 18

server layout / error recovery / submission previews

A small extension

wpt-server-submissions.Dockerfile:

FROM web-platform-tests-live-wpt-server-tot
COPY src/mirror-pull-requests.sh /usr/local/bin/
COPY src/supervisord-pull-requests.conf /etc/supervisor/conf.d/

+ Easier to maintain than a standalone implementation

+ Safer than branching on runtime flags

17 / 18

This server could have been built completely standalone from the "tot" or ("tip-of-tree") server. There would have been a lot of duplication, though, and that's hard to maintain.

Alternatively, we could have built a single server that had all functionality, and simply disabled the "submission preview" part in the "tot" deployment. Runtime flags are too easy to toggle, so this would be susceptible to accidental enabling of the "submissions preview" functionality.

Docker and supervisord both offer clean extension mechanisms. That allows us to define a distinct image for the submissions server in terms of the "tot" (or "tip-of-tree") server.

server layout / error recovery / submission previews

Laysan albatross and Midway sunset is licensed CC BY-NC 2.0 by USFWS - Pacific Region

2 / 18
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow