It's the metrics stupid!

Hubert Behaghel

hubert.behaghel@bskyb.com

Disclaimer

Business value? What do you mean?

Features with less bugs

Ready for future changes

That actually make customers happy

By not being slow

And because they let them achieve what they want

Efficiently

When is business value delivered?

  • [ ] When the code is checked in
  • [ ] When the code is checked in and tests all pass
  • [ ] When the code is checked in, tested and deployed
  • [X] When it runs and the KPIs say so

Monitoring fact #1

We need to know what our code does when it runs.

What's Monitoring anyway?

The capability to continuously measure business value and code behaviours in production.

map ≠ territory

mind-the-gap.jpg

Do you know which code is faster

items.sort_by { |i| i.name }
items.sort { |a,b| a.name <=> b.name }

we don't know

def sort_by (&blk)
  sleep(100) # FIXME
  super(&blk)
end
def sort (&blk)
  raise Exception.new("Oh Noes!") # :-)
end

we need to know

measure

when is it too slow for your users?

measure

Justifying an investment in performance

\begin{equation} \frac{(KPI(today) + \Delta_{KPI}) * \Delta_{value / KPI\ unit}}{cost_{performance\ improvement}} = ROI\ (in\ days) \end{equation}

why Google doesn't show you more results on each pages?

  • customers asked for more
  • Google experimented in 2006, increasing this number from 10 to 30
  • triggered a drop by 20% of traffic and revenues.
    • page load time from .4s to .9s
  • Similar observation at Amazon: 100ms => 1% of revenue.

Monitoring fact #2

System performance affects success.

Mind the gap between what customers say they want and what they actually expect

measure

The forgotten Agile practice?

Agile

  • "working software is the primary measure of progress"
  • shared and always up-to-date view of the project
  • focus on business value
  • continuous deployment

Metrics and Monitoring

  • only way to assess software is working
  • shared and always up-to-date view of how the product is doing
  • correlates business value and system performance
  • only effective safety net to true continuous deployment

Monitoring fact #3

Monitoring is more important than testing.

because knowing the territory is more important that knowing the map.

Metrics, the ultimate enabler

Metrics help you making better decision

Metrics can be shared and communicated

Metrics enable experimentation / personalisation / ABtesting

Metrics enable performance management

Once you are able to measure, you should benchmark your solution

Performance testing

nominal latency/throughput under nominal load

aka

put your app on the treadmill and make it run at normal speed.

Load testing

…make it run fast and secure SLAs.

Stress testing

…make it run faster and faster up until it falls on the treadmill.

Notice the speed at which it happens.

Mitigate the consequences and automate recovery.

Resilience testing

…cut one of its legs while it's running and observe.

Again: mitigate the consequences and automate recovery.

Monitoring enables anti-fragile systems
  • Resilience is actually not the real goal.
  • Regenerative systems
  • warrants another talk :-)

Increase your speed

Morpheus-increase-your-speed.jpg

Monitoring Fact #4

You can't own your code if you don't own its metrics.

Monitoring give you the ultimate power

matrix-ultimate-control.jpg

Auto-rollback

Availability testing

  • Chaos monkey
  • Couple to your monitoring for total control.

Monitoring fact #5

Nothing can beat the level of control you get from your monitoring.

How to…

monitoring-infra.png

Instrument your code

red-pill-blue-pill.jpg

holistic via runtime instrumentation

  • monitoring == configuration
  • 0 assumption
  • all-in-one
  • overhead: ~7% CPU
  • proprietary software only: dynaTrace, New Relic, AppDynamics

bespoke via ad-hoc code

  • pretty much the opposite benefits / drawbacks from the holistic approach.
  • OSS powered: codahale/metrics, statsd, collectd, graphite
  • Used by all the serious players: Yammer, dotCloud, Amazon …and LWS!

define a common language

Examples taken from CodaHale/metrics lib:

Gauge
the instantaneous value of something
metrics.gauge("orders") { orders.size }
Counter
incrementing/decrementing value
val counter = metrics.counter("connection")
counter.inc()
counter.dec()
Meters
average rate of events over a period of time
val requests = metrics.meter("requests", SECONDS)
requests.mark()
Histograms
the statistical distribution of values in a stream of data
val histogram = metrics.histogram("response-size")
requests.update(response.completions.size)
  • min, max, avg, std-dev: not good enough
  • quantiles: p75, p90, p95, p99, p99.9…
    • reservoir sampling
Timer
a histogram of durations and meter of calls
val timer = metrics.timer("requests", MILLISECONDS, SECONDS)
timer.time { handle(req, resp) }

At ~300 req/sec, our p90 latency jumps from 13ms to 453ms.

Collect

Aggregate

Monitor and alarm

Watch and correlate

  • dashboards
  • graphing tool and search engine

1 service => ~50 metrics exported

Food for thoughts

Know what you want to measure

  • Think. Learn.
  • Be good at identifying what needs monitoring and with what kind of metrics.

Monitoring an API

Patterns and schemes
  • Structure your metrics into schemes
  • All REST endpoints that only support GET should capture similar metrics.
  • When captured metrics are the same in 2 different endpoints, they should be recorded with the same name.
    • e.g. if "latency" and not "duration" or "elapsed" or …
Client-level
  • if multiple clients, track the per client load on each API.
Expose an Is-It-Down monitor
  • It is good practice to integrate your apps with the monitoring of your dependencies
  • Live adaptation in e.g. your retry policy
  • Show a message to the users to signal the degraded service level

Monitoring a website

UX
  • latency, latency vs traffic
  • journey success rate
Analytics
  • collect arbitrary data to do segmentation (customer facetting?)
  • behaviours / intents
Availability
  • backbone & final mile testing
Usual application metrics
Don't divorce front-end and back-end monitoring

Monitoring a mobile experience

Impact of mobile network
Disconnected experience
Native vs Embedded Web Experience

Build and contribute to status.online.sky.com

  • inspired by status.github.com

You are in charge

Metrics Everywhere!

Vote

Are you a slightly better engineer after this talk?

Appendices

Sources and Inspiration

Tools

NFT / Performance management