It's the metrics stupid!

Hubert Behaghel

hubert.behaghel@bskyb.com

Disclaimer

Business value? What do you mean?

Features with less bugs

Ready for future changes

That actually make customers happy

By not being slow

And because they let them achieve what they want

Efficiently

When is business value delivered?

[ ] When the code is checked in
[ ] When the code is checked in and tests all pass
[ ] When the code is checked in, tested and deployed
[X] When it runs and the KPIs say so

Monitoring fact #1

We need to know what our code does when it runs.

What's Monitoring anyway?

The capability to continuously measure business value and code behaviours in production.

map ≠ territory

Do you know which code is faster

items.sort_by { |i| i.name }

items.sort { |a,b| a.name <=> b.name }

we don't know

def sort_by (&blk)
  sleep(100) # FIXME
  super(&blk)
end

def sort (&blk)
  raise Exception.new("Oh Noes!") # :-)
end

we need to know

measure

when is it too slow for your users?

measure

Justifying an investment in performance

\begin{equation} \frac{(KPI(today) + \Delta_{KPI}) * \Delta_{value / KPI\ unit}}{cost_{performance\ improvement}} = ROI\ (in\ days) \end{equation}

why Google doesn't show you more results on each pages?

customers asked for more
Google experimented in 2006, increasing this number from 10 to 30
triggered a drop by 20% of traffic and revenues.
- page load time from .4s to .9s
Similar observation at Amazon: 100ms => 1% of revenue.

Monitoring fact #2

System performance affects success.

Mind the gap between what customers say they want and what they actually expect

measure

The forgotten Agile practice?

Agile

"working software is the primary measure of progress"
shared and always up-to-date view of the project
focus on business value
continuous deployment

Metrics and Monitoring

only way to assess software is working
shared and always up-to-date view of how the product is doing
correlates business value and system performance
only effective safety net to true continuous deployment

Monitoring fact #3

Monitoring is more important than testing.

because knowing the territory is more important that knowing the map.

Metrics, the ultimate enabler

Metrics help you making better decision

Metrics can be shared and communicated

Metrics enable experimentation / personalisation / ABtesting

Metrics enable performance management

Once you are able to measure, you should benchmark your solution

Performance testing

nominal latency/throughput under nominal load

aka

put your app on the treadmill and make it run at normal speed.

Load testing

…make it run fast and secure SLAs.

Stress testing

…make it run faster and faster up until it falls on the treadmill.

Notice the speed at which it happens.

Mitigate the consequences and automate recovery.

Resilience testing

…cut one of its legs while it's running and observe.

Again: mitigate the consequences and automate recovery.

Monitoring enables anti-fragile systems

Resilience is actually not the real goal.
Regenerative systems
warrants another talk :-)

Increase your speed

Monitoring Fact #4

You can't own your code if you don't own its metrics.

Monitoring give you the ultimate power

Auto-rollback

Availability testing

Chaos monkey
Couple to your monitoring for total control.

Monitoring fact #5

Nothing can beat the level of control you get from your monitoring.

How to…

Instrument your code

holistic via runtime instrumentation

monitoring == configuration
0 assumption
all-in-one
overhead: ~7% CPU
proprietary software only: dynaTrace, New Relic, AppDynamics

bespoke via ad-hoc code

pretty much the opposite benefits / drawbacks from the holistic approach.
OSS powered: codahale/metrics, statsd, collectd, graphite
Used by all the serious players: Yammer, dotCloud, Amazon …and LWS!

define a common language

Examples taken from CodaHale/metrics lib:

Gauge

the instantaneous value of something

metrics.gauge("orders") { orders.size }

Counter

incrementing/decrementing value

val counter = metrics.counter("connection")
counter.inc()
counter.dec()

Meters

average rate of events over a period of time

val requests = metrics.meter("requests", SECONDS)
requests.mark()

Histograms

the statistical distribution of values in a stream of data

val histogram = metrics.histogram("response-size")
requests.update(response.completions.size)

min, max, avg, std-dev: not good enough
quantiles: p75, p90, p95, p99, p99.9…
- reservoir sampling

Timer

a histogram of durations and meter of calls

val timer = metrics.timer("requests", MILLISECONDS, SECONDS)
timer.time { handle(req, resp) }

At ~300 req/sec, our p90 latency jumps from 13ms to 453ms.

Collect

Aggregate

Monitor and alarm

Watch and correlate

~~dashboards~~
graphing tool and search engine

1 service => ~50 metrics exported

Food for thoughts

Know what you want to measure

Think. Learn.
Be good at identifying what needs monitoring and with what kind of metrics.

Monitoring an API

Patterns and schemes

Structure your metrics into schemes
All REST endpoints that only support GET should capture similar metrics.
When captured metrics are the same in 2 different endpoints, they should be recorded with the same name.
- e.g. if "latency" and not "duration" or "elapsed" or …

Client-level

if multiple clients, track the per client load on each API.

Expose an Is-It-Down monitor

It is good practice to integrate your apps with the monitoring of your dependencies
Live adaptation in e.g. your retry policy
Show a message to the users to signal the degraded service level

Monitoring a website

UX

latency, latency vs traffic
journey success rate

Analytics

collect arbitrary data to do segmentation (customer facetting?)
behaviours / intents

Availability

backbone & final mile testing

Usual application metrics

Don't divorce front-end and back-end monitoring

Monitoring a mobile experience

Impact of mobile network

Disconnected experience

Native vs Embedded Web Experience

Build and contribute to status.online.sky.com

inspired by status.github.com

You are in charge

Metrics Everywhere!

Vote

Are you a slightly better engineer after this talk?

Appendices

Sources and Inspiration

Metrics, Metrics everywhere by Coda Hale
- https://www.youtube.com/watch?v=czes-oa0yik
Application Monitoring Infrastructure at Yammer
- http://eng.yammer.com/application-monitoring-infrastructure-at-yammer/
It's the Latency, Stupid
- http://rescomp.stanford.edu/~cheshire/rants/Latency.html
Greg Linden / Marissa Mayer at Web 2.0
- http://glinden.blogspot.co.uk/2006/11/marissa-mayer-at-web-20.html
Impact of web latency on conversion rates
- http://www.slideshare.net/bitcurrent/impact-of-web-latency-on-conversion-rates

Tools

Etsy Statsd
- https://github.com/etsy/statsd
Statsite (statsd on steroid)
- https://github.com/armon/statsite
Coda Hale metrics for Java
- https://github.com/dropwizard/metrics
- http://metrics.codahale.com/manual/core/
Collectd
- https://github.com/collectd/collectd
- http://collectd.org
Graphite
- https://github.com/graphite-project/graphite-web

It's the metrics stupid!

Hubert Behaghel

hubert.behaghel@bskyb.com

Disclaimer

Business value? What do you mean?

Features with less bugs

Ready for future changes

That actually make customers happy

By not being slow

And because they let them achieve what they want

Efficiently

When is business value delivered?

Monitoring fact #1

What's Monitoring anyway?

map ≠ territory

Do you know which code is faster

we don't know

we need to know

when is it too slow for your users?

Justifying an investment in performance

why Google doesn't show you more results on each pages?

Monitoring fact #2

The forgotten Agile practice?

Agile

Metrics and Monitoring

Monitoring fact #3

Metrics, the ultimate enabler

Metrics help you making better decision

Metrics can be shared and communicated

Metrics enable experimentation / personalisation / ABtesting

Metrics enable performance management

Performance testing

Load testing

Stress testing

Resilience testing

Monitoring enables anti-fragile systems

Increase your speed

Monitoring Fact #4

Monitoring give you the ultimate power

Auto-rollback

Availability testing

Monitoring fact #5

How to…

Instrument your code

holistic via runtime instrumentation

bespoke via ad-hoc code

define a common language

Collect

Aggregate

Monitor and alarm

Watch and correlate

1 service => ~50 metrics exported

Food for thoughts

Know what you want to measure

Monitoring an API

Patterns and schemes

Client-level

Expose an Is-It-Down monitor

Monitoring a website

UX

Analytics

Availability

Usual application metrics

Don't divorce front-end and back-end monitoring

Monitoring a mobile experience

Impact of mobile network

Disconnected experience

Native vs Embedded Web Experience

Build and contribute to status.online.sky.com

You are in charge

Metrics Everywhere!

Vote

Appendices

Sources and Inspiration

Tools

NFT / Performance management