Kyle Brandt

False Dichotomies of Huge Dickensians

Wed, 16 Mar 2016 16:29:41 -0400

I’m not talking about the author. Young, and some not so young engineers sometimes have some positive associations with being disrespectful, dismissive, or otherwise a Huge Dickensian.

This is shame, because it hurts their careers and makes people around them unhappy. So I’d like to dispel some ideas I’ve held or seen other people hold, that are just not true.

People will think I'm Smarter if I'm a Jackass to Others

In both fiction and reality there are geniuses that have a reputation for being rude to others. To name a few:

Sheldon Cooper (TV: Big Bang Theory)
Dr House (TV)
Linus Torvalds (At least at times)

They are role models that create stereotypes, and ones that many of us who were into computers growing up can relate to. So based on that stereotype it seems to follow that “I’m smart, and maybe if I act like these people, people will think I’m a genius”. It does not. Likely the only people you will fool are the ones that don’t matter. Instead, people will think you are just insecure and only want to work with you if they have to.

I'd rather hire someone smart that is a jerk than someone nice but useless

This idea follows from the above stereotype. If you ask people this question you will get different answers and hear some good stories. If faced with that dilemma I would probably have to say “It depends.” But generally it is a false dilemma. The best places to work are picky, and will hire people who are not intolerable and are also smart. In other words, they can get people with good traditional intelligence and emotional intelligence.

There are of course exceptions, but generally these are extreme and if you think you are one of those exceptions, you are probably wrong.

I want to be direct with people, and that often means I have to be mean

This one is understandable, as telling hard truths to people without being mean is difficult. However, people are often aware of you motivation when doing this. Generally one wants to change some behavior that is making their lives difficult. But in addition to that there can be a desire to help that person and see them grow, or put them down to make yourself feel better. If it is the latter they will know.

You don’t have to be bubbly and outgoing. People like Spock, but that is because they know he isn’t mean. He doesn’t take pleasure in seeing other people fail and is analytical. Another way to think about this is the difference between being assertive vs aggressive.

If you are smart I wouldn’t have to explain myself

I think the opposite is true. Smart people will think of multiple possible scenarios based on vague communication. People also won’t think you are too smart to explain yourself, they will just think you are not articulate or haven’t actually thought out your ideas and are just relying on heuristics.

All of this doesn’t apply to people that actually have disorders related to social interaction, but rather reflect some naive thoughts that I bet most of us held at one point. If these thoughts ring true, there are some simple cheap books that can get you on the right path such as “How to Win Friends and Influence People” and “Nonviolent Communication”. You’re smart, right? So it won’t take much time to learn to be nice and it will help your career.

Conflating The Roles of Alerting and Dashboards

Fri, 04 Mar 2016 08:36:42 -0500

Monitoring dashboards and alerts have distinct roles in monitoring. Problems arise when people fail to form a contradistinction between them. Both are communication tools, but communicate differently. Monitoring dashboards communicate the current status of system’s and their attributes, whereas alerts communicate abnormal status that requires human intervention. Currently at Stack Overflow, we use Grafana and Opserver as dashboards and Bosun for alerting. Bosun has a dashboard that shows the current alerts, but when I refer to dashboards I mean things like Grafana and Opserver.

Mistake 1: Using Alerts as Status Dashboards

When alerts are used as if they were status dashboards, it leads to alert desensitization or Alarm Fatigue. If you have alerts that come in that are just informational – it is human nature to start to ignore them due to Sensory Gating (and anecdotally, it is not uncommon for people to literally filter them with something like Gmail when alerts are in email). Even if discipline is good, people sometimes won’t even see them anymore since our brains naturally filter out repeated information.

The underlying philosophy is that too many alerts are just as bad some “missed alerts”. This is expressed well in Owning Attention: Considerations for Alert Design:

Example within this Philosophy

As an example, take a 30 node cluster where it is common for a couple of nodes to be down at a time. The number of nodes that are out, and which nodes they are out should be in a dashboard. However, that shouldn’t generate alerts because it is normal behavior. Rather, you should have a dashboard that shows all this information, and only alert where the behavior deviates from its normal pattern and requires human intervention. Ideally the alert notification has most of the contextual information you would see in the status dashboard, but in practice web dashboards provide more capability than most notification mediums like email can. Alerting strategies in this case include the following:

Cluster vs host based alerts: Notifications can be sent per cluster or per node in the cluster (I call this “scope”). Both scopes are valid, if your nodes tend to fail in groups, it is generally better for a broader scope (cluster based) so there are less notifications that can lead to alarm fatigue.
Maintenance aware alerts: If nodes normally only go out during maintenance, alerts should only trigger when nodes are not in maintenance (and maybe if too many nodes are in maintenance). Maintenance can be identified by tracking maintenance as a metric, or by notifying the monitoring system of maintenance.
Number or percentage of down nodes: You can alert only if N or more nodes down, a certain percentage of nodes in the cluster are down, and/or the number of down nodes is abnormal compared to history.
Abnormal host downtime: If nodes normally are only down for certain periods of time, you could alert on the node or cluster if a node has been down for longer than the expected duration.

What combination of the above to use would depend on your specific environment. It is also possible that you do want to alert when a node is down in the cluster at all. The right balance is when you actually react to almost all the alerts. If you are getting alerts that you don’t do anything about, they are informational and are more appropriate for dashboards.

Mistake 2: Relying on Alerting or Dashboards at the Exclusion of the Other

The downfall when depending on dashboards to show you abnormal status is that it requires that someone is looking at, and interpreting the dashboard correctly. If you have the human resources for this it can make sense in certain situations, but as a rule of thumb if you notice something that needed human action on a dashboard and didn’t get an alert for it than an alert needs to be created.

Dashboards are vital to monitoring and one shouldn’t depend on solely on alerts:

Visualizations provide a lot of condensed information about systems rapidly
Dashboards help build operator’s mental models of systems. By watching normal behavior and internalizing, operators understand the system better and are more prepared for unexpected emergency situations.
It is faster to create dashboards that show a lot of information then it is create and tune alerts that are low in noise.
Humans are very good at picking up anomalies visually, doing this with alerting code is more difficult. Dashboards can help you discover alerts that are missing.

Status dashboards and alerting are essential companions to each other in monitoring, but they need to be used correctly, and when they are treated as one-to-one mappings of each other in purpose you end up limiting both your dashboarding and alerting capabilities.

Stack Overflow's Bosun Architecture

Tue, 01 Mar 2016 14:49:12 -0500

Bosun is a significant piece of our monitoring infrastructure at Stack Overflow. In forums such as Bosun’s Slack Chat Room, we are frequently asked to describe how Bosun is set up in our environment.. This is a detailed post about our architecture surrounding Bosun. This is not the only way to set things up and may not always be the best way either – but this is our current setup.

The Pieces

There are a lot of components we use with Bosun, but they all serve their purposes in our monitoring infrastructure. The following table outlines them.

Bosun	Alerting System	Expression language for evaluating time series from OpenTSDB, Graphite, Elastic, and InfluxDB
		Templates for rich notifications: HTML, Graphs, Tables, CSS-inlining
		Web interface for viewing alerts, writing expressions, graphing, creating alerts and templates, and testing alerts over history.
		A store for metric metadata and string data about tags (for example, host IPs, Serial numbers, etc)
scollector	Collection Agent	Runs on Windows and Linux. Polls local OS and applications via APIs; also polls external services via SNMP, ICMP, etc.
scollector	Collection Agent	With no configuration, monitors anything it auto-discovers (IIS, Redis, Elastic, Etc); configuration rarely needed.
BosunReporter.NET	App Metrics	Sends application metrics to bosun.
OpenTSDB	Time Series Database Daemon	Takes time series data and stores it in HBase
OpenTSDB	Time Series Database Daemon	Time series are tagged, so it is a multi-dimensional time series database
Opserver	Monitoring dashboard	Pulls from SolarWinds Orion or Bosun
Opserver	Monitoring dashboard	Also includes indepth independent monitoring and control of Redis, SQL Server, HAProxy, and Elastic
Grafana	User Dashboards	Uses the [Grafana Bosun plugin](https://github.com/grafana/grafana-plugins/tree/master/datasources/bosun) to create graphs and tables via Bosun expressions
TPS	Parses Web Logs	Sends raw logs to SQL and summaries in time series format to Bosun.
tsdbrelay	Relay Time Series and Meta Data	Cross replicates time series and meta data across data centers
tsdbrelay	Relay Time Series and Meta Data	Creates denormalized OpenTSDB metrics for faster querying speed
HBase	Where OpenTSDB Stores Data	Hadoop: The distributed filed system (HDFS), MapReduce framework, and Cluster Management (Zookeeper).
		HBase: A clone of Google's Bigtable runs on top of Hadoop
		Cloudera Manager: What we use to manage HBase
HAProxy	Load Balancer Used at Stack	We use it in front of Bosun, OpenTSDB daemons, and tsdbrelay
Redis	In-memory Store	Optionally used by Bosun to store state, metadata, metric indexes, etc. If Redis is not used, a built-in implementation in Go called LedisDB is used
Elastic	Document Searching	Where we send system logs. Can be queried with Bosun expressions. The Bosun Grafana plugin can use this method to graph time series info about the logs as well.

Metric and Query Flow

Metrics are collected many ways:

Our homegrown applications calculate their own metrics and send them to Bosun. C# programs use BosunReporter.NET and Go programs use the Go collect package.
scollector gathers stats from the OS, services it auto-discovers.
Network equipment and other devices must be polled from a third-party machine since they can not run scollector. We designate 2 hosts to run scollector with polling mode enabled. They collect data via SNMP (network devices), ICMP pings, and can be expanded via plug-ins.

No matter how data is collected, it is not send it directly to Bosun. They send to tsdbrelays which have two purposes:

They relay the data to two different bosun clusters. (more on that later)
They denormalize certain datapoints.

They actually do not send directly to tsdbrelay. We front-end tsdbrelay through a load balancer (HAProxy) so that we can scale horizontally.

We also front-end the Bosun service itself for the same reason. One Bosun server is in read-only mode, and flipping between the two requires us to sync the state and take the other bosun out of read only mode. OpenTSDB performance is improved if all write requests go to the same node and our HAProxy setup allows us to funnel all write requests to a single node when all nodes are up.

Many different systems query the time series database. We call this the “query flow”:

Grafana: Queries OpenTSDB using Bosun expressions via /api/expr. It can also query Elastic directly, as well as query Elastic via: Grafana -> Bosun -> Elastic.
Opserver: Queries Bosun’s /api/host endpoint and also queries OpenTSDB directly.
Bosun itself: when running alerts, or users interact with the UI, it queries OpenTSDB or Elastic (if it exists)

Our OpenTSDB and HBase Setup

In our main datacenter in NY to we have two HBase clusters. One is a three node cluster running on NY-TSDB0{1,2,3}. The other is a local replica running on NY-BOSUN01. The local replica is for backups (See how we Backup below).

In our secondary datacenter (named “CO”) we have a three node cluster that is a mirror setup of the NY-TSDB cluster. The HBase clusters don’t replicate across datacenters, rather tsdbrelay relays the datapoints to each independent data store.

We use a HDFS replication factor of 3 in the two main HBase clusters. The HBase replica is currently a single machine, so it has no HDFS replication.

We use Cloudera Manager to manage our OpenTSDB clusters. We have found that HBase can be stable, but it is difficult for a shop that doesn’t happen to use it anywhere else.

The Numbers

3.7 Billion datapoints a day (43,000 Datapoints a second) per cluster
~8 Gigabytes a day of HDFS unreplicated growth per cluster (24GB replicated)
Currently ~ 7TB of replicated storage consumed

OpenTSDB Appends vs Compaction

HBase was very unstable when we first started using it in 2015. The main thing that improved our stability was to switch to OpenTSDB’s “appends model” instead of the “hourly compactions model”. (Note: These are unrelated to HBase compactions).

OpenTSDB stores time series as metric+tags in hourly rows. By storing them as per-hour, OpenTSDB only needs to store the delta (the offset from the base hour) for each datapoint. This is more space-efficient.

In the appends model, storing new datapoints only requires appending to the row. However this requires HBase to read the existing data block, then write it out again with the new datapoint appended. This generates a lot of network traffic.

In the compaction model, it writes the data in a less efficient form but one that requires less network bandwidth. Once each hour it then rewrites the data in the more compact, delta, format. As a result the network activity is reduced to an hourly burst instead of constant reading and rewriting.

In this graph you can see the Append Model’s network traffic load versus Hourly Compaction’s greatly reduced network load.

This network traffic was unexpectedly huge considering that our input into OpenTSDB (gzip compressed HTTP JSON) is on average 4 Mbit/sec. Appends can be seen in the hourly seasonality of the above graph. As the row grows over the hour, so do the size of the rows being re-written and and replicated across HDFS and HBase replication.

Zookeeper

We found that for stability we have had to greatly increase the timeouts around zookeeper:

Zookeeper:

tickTime: 30000
maxSessionTimeout: 600000

HBase:

zookeeper.session.timeout: 600000

With low zookeeper timeouts, if operations took a long time or spent a long time in garbage collection, then HBase servers would eject themselves taking the cluster down.

Denormalization, Query Speed, and Last

HBase only has one index. OpenTSDB is optimized around the metric. So for example, querying all the metrics about a host (cpu, mem, etc) is slower than a query against a single metric for all hosts. The HBase schema for OpenTSDB reflects this:

00000150E22700000001000001
'----''------''----''----'
metric  time   tagk  tagv

Since the metric and time fields are the start of the key, and the key is the only thing indexed, the more tags you have on a metric, the more rows that have to be scanned to find your particular host. This reduces query speed.

However, host based views are common. In this schema, they have to multiple metrics, and if those metrics have the host=something tagset, the query gets slower as you add more hosts. Combine that with the number of rows that need to be scanned for longer durations and the queries become untenable.

As an Example, os.cpu has only one tag key host, and about 500 possible values in our environment. To query one year of data (downsampled) to a single host in this metric takes about 20 seconds. If you add more tags it only gets worse (i.e. os.net.bytes which has host, interface, and direction.

We work around this query speed issue to generate views in two ways:

Redis stores the last value (the last two for counters) for all series. When generating these views, we can get current values for anything nearly instantly. This data is used in Boson’s /api/host endpoints and can also be queried directly at /api/last
Denormalize the metrics

To denormalize metrics we put the host name in the metric, and only give it a single tag key / value pair (at least one tag key with a value is required). We use the following argument to tsdbrelay to do this:

-denormalize=os.cpu__host,os.mem.used__host,os.net.bytes__host,os.net.bond.bytes__host,os.net.other.bytes__host,os.net.tunnel.bytes__host,os.net.virtual.bytes__host

This results in a metric like __ny-web01.os.cpu. A year’s worth of data (also downsampled) is queried in ~100 milliseconds as compared to 20 seconds for the normalized metric.

HBase Replication

We have struggled some with HBase replication. The secondary cluster has had to be rebuilt when we found corruption, and also when we upgraded to SSDs. With our most recent case of replication breaking we found that force splitting some of the unbalanced regions on the secondary cluster seems to have fixed in (You can see the unbalance in size via the HBase Web GUI).

When replication breaks, the logs grow rapidly, and drain at a much slower rate then they grow:

Since replication backing up uses so much disk space, we have to factor in head room for this when sizing the HBase clusters.

Hardware

NY-TSDB0{1,2,3} (Per Sever)

Area	Details	Utilization
Model	Dell R620
CPU	2x Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz	~15%
Ram	128 GB Ram (8x 16GB Chips)	34 GB RSS, Rest Used in File Cache
Disk	2 Spinny OS Disks, 8x SSDs in JBOD (INTEL SSDSC2BB480G4)	Avg Read: 1.6 MByte/sec, Avg Write: 57 MByte/sec
Network	Redundant 10Gigabit	~500 MBit/sec

NY-BOSUN02 (Also HBase Replica, see previous diagram)

Area	Details	Utilization
Model	Dell R620
CPU	2x Intel(R) Xeon(R) CPU E5-2643 v2 @ 3.50GHz	~15%
Ram	64 GB Ram (4x 16GB Chips)	20 GB RSS, Rest Used in File Cache
Disk	2 Spinny OS Disks, 8x SSDs in JBOD (INTEL SSDSC2BB480G4)	Avg Read: 1.1 MByte/sec, Avg Write: 33 MByte/sec
Network	Redundant 10Gigabit	~300 MBit/sec

Reference Configs

How we deploy

The CI framework we use is called TeamCity. We use it to build packages for Deploy scollector, tsdbrelay, to directly deploy Bosun binaries and configurations into production. TeamCity calls each configuration a “build”.

scollector and tsdbrelay (Linux): We make RPM packages using fpm combined with this script to easily wrap binaries into RPMs. The script reads mk_rpm_fpmdir.*.txt to manufacture input to FPM.
scollector (Windows): A TeamCity build produces a binary and does the right thing so that it is distributed to our Windows hosts.
Bosun is deployed directly by TeamCity. We do this because we tend to build pretty often and use branches, so the lag of RPM would be annoying to our developers.
Our Bosun configuration is kept in a Git repository. TeamCity triggers on changes to this repo, runs tests, pushes the new config to production, then restarts Bosun. (Bosun currently doesn’t have live configuration reloading).

How we Backup

The are three things in and around Bosun that are Backed up:

Bosun’s Configuration File
Bosun’s State and Metadata Storage
The Time Series Data

The configuration file is stored in Git which is backed up independently (we use Gitlab internally). We now use redis to store bosun’s state and metadata. Bosun will use ledis if no redis server is provided, but we haven’t explored how this would be backed up as it is there for small test setups. In redis we use both RDB and AOF for redis persistence:

RDB allows us to have snapshots of the datastore. These can be restored easily (even for local development).
AOF allows us persistence across restarts of redis.

Backing up the time series database, in our case OpenTSDB, is one of our pain points. HBase is designed to scale to petabytes. With that much data, a standalone full backup becomes impractical. So HBase isn’t really designed for these sort of backups (See Approaches to Backup and Disaster Recovery in Hbase. This is a shame for us, as we don’t have petabytes of data in another system standalone backups might be an option. Therefore we do the following in place of traditional standalone backups:

Datapoint level replication: tsdbrelay sends all datapoints to its local and remote cluster (one in NY, one in CO). Although the two HBase clusters are not exactly consistent with each other (different IDs in the database, missing data when a datacenter is out) they are “good enough” copies of each other.
Restore: Rotating hourly HBase snaphosts
Backup cluster in the main datacenter: An HBase replica (This is distinct from HDFS replication - that is within a cluster, whereas HBase replication replicates to another cluster).

Without standalone full backups we are frankly a bit afraid of HBase, but we haven’t had any significant data losses since we started using it.

Possible Future Changes

Our largest pain point is using OpenTSDB and HBase. Since we don’t use HBase anywhere else, the administrative overhead eats up a lot of time. Also, OpenTSDB has had some long standing issues that have been difficult, for example counters don’t downsample correctly and our data is in counter format when possible so we don’t lose resolution. That being said we still feel for us it is currently our best option. We will be expanding our two main clusters with two more nodes each (running tsdbrelay, opentsdb, and region servers) to increase space.

We are very interested in InfluxDB. It is written in Go which we are familiar with and has a rapid devlopment cycle. Bosun has it as a backend, and scollector can send to it since it has a OpenTSDB api style endpoint. However, we are waiting for InfluxBD to have a more stable clustering story before we could consider switching to it in production.

Other writing

Sun, 21 Feb 2016 14:53:00 -0500

Server Fault Blog

The Importance of Observability Nov 2013
OODA For Sysadmins Jul 2012
Escaping the Cycle of Technical Debt Feb 2012
Per Second Measurements Don’t Cut it Jun 2011
The Limits of Cost-Benefit Analysis in IT May 2011
Views of the Same Problem: Network Admin, DBA, and Developer March 2011
Know your RAM Jan 2011
Serverfault Trees Jun 2010

talks

Sun, 21 Feb 2016 14:06:24 -0500

Presentations

Monitorama 2015 (PDX): Bosun Workshop This presentation has screencasts of building an alert starting about 15 minutes in. The first 15 minutes explain some of the design principles behind Bosun. After the first screen cast of building an alert there are also some example alerts.

LISA 2015: Precise Alerting with Bosun This presentation is similar to Monitorama PDX, but the intro and theme focuses more on the important of monitoring in general.
LISA 2014: A New Age in Alerting with Bosun: The First Alerting IDE: This was the first presentation on Bosun which was given right after the initial release. An introduction to the underlying thoughts on problems with alerting and how Bosun solves them.

Video Tutorials

Bosun Fundementals Series

projects

Sun, 21 Feb 2016 13:49:54 -0500

Bosun / Time Series Alerting System and IDE developed at Stack Overflow
scollector / Monitoring agent developed at Stack Overflow
Bosun Grafana Plugin / Allows graphing and tables of Bosun expressions

about

Sat, 20 Feb 2016 14:44:37 -0500

I’m an orginial co-author of the Bosun monitoring system and the scollector monitoring agent.

Currently I work at Stack Overflow as the director of the SRE (Site Reliability Engineering) team.