This is the Borg. Resistance is futile.
A median Borg cell is 10K machines. And all those machines in a cell are served by a logically centralized control: the Borgmaster.
Where is the bottleneck in the centralized Borg architecture? The paper says it is still unclear whether this architecture would hit a practical scalability limit. Anytime Borg was given a scalability target, they managed to achieve it by applying basic techniques: caching, loose-synchronization, and aggregation.
What helped the most for achieving scalability was decoupling the scheduler component from the Borgmaster. The scheduler is loosely-synchronized with the Borgmaster: it operates on a cached cached copy of the cell state and acts as a counsel/advisor to the Borgmaster. If the scheduler makes a decision that is not feasible (because it is based of an outdated state: machine failed, resource gone, etc.), the Borgmaster will not take that advice and ask the scheduler to reschedule the job this time hopefully with better up-to-date state.
To provide high-availability, the Borgmaster is Paxos-replicated over 5 machines. Replicas serve read-only RPC calls to reduce the workload on the Borgmaster leader. In addition to the Paxos log, there is also periodic checkpoints/snapshots to restore the Borgmaster's state to an arbitrary point in the past. A fauxmaster can also use this functionality in debugging of the Borgmaster and scheduling performance.
A Borglet is the local Borg agent on every machine in a cell. (In Mesos this corresponds to the Mesos slave, or in the new terminology the Mesos agent.) Borgmaster replica runs a stateless link shard to handle the communication with some subset of borglets. The link shard aggregates and compresses and reports only diffs to the state machines to reduce update load at the elected master.
Jobs and tasks
A job consists of many tasks (which are same binary programs). 50% of machines run 9+ tasks, and 90%ile machine has ~25 tasks and run ~4500 threads.
Google's Borg workload consists of 2 main categories. Production jobs are long running services serving short user requests and they require low-latency. Batch jobs on the other hand are less-sensitive to performance fluctuations. The workload has dynamic surges: batch jobs come and go, and productions jobs have a diurnal pattern. (A representative Borg workload trace is publicly available.) Borg needs to handle this dynamic demand while providing as high utilization of the cluster machines as possible.
It turns out tight-packing scheduling is not optimal for high-utilization, because it is too strict and fails to accommodate for bursty loads and misestimations from Borg clients. Instead a hybrid packing is used, which provides 5% better packing efficiency than the tight-packing/best-fit policy. Borg uses priorities for tasks. If a machine runs out of resources to accommodate its assigned tasks (e.g., due to burst in demands), lower priority tasks on that machine are killed and added to the scheduler's pending queue for re-placement.
Users operate on jobs by issuing remote procedure calls (RPCs) to Borg, most commonly from a command-line tool or from other Borg jobs. To help users manage their jobs, Borg provides declarative job specification language, and job monitoring/management tools. Borg uses the concept of allocation set for a job, which corresponds to the concept of pod in Kubernetes.
Task startup latency at a machine is about 25seconds, 20 sec of which is package installation time. To reduce the latency from package installation, Borg tries to schedule tasks where the packages are already available. In addition, Borg employs tree and torrent-like protocols to distributes packages to machines in parallel. Finally, Borg also tries to schedule tasks to reduce correlation of failures for a given job.
Almost every task contains a builtin HTTP server that publishes health and performance info. Borg monitors the health-check URL and restarts tasks that fail to respond.