Executive summary

Getting started with Refinery for data sampling involves understanding key rules and philosophies to efficiently collect data. A sample Rules File Template is provided, which includes seven rules to help reduce event volume without much customization. The template follows a philosophy of 'dropping boring data and keeping rare/interesting data' by keeping errors, abnormally long traces, and occasionally sampling fast successfull traces.

The file uses three different samplers (Rules-Based Sampler, EMA Dynamic Sampler, and EMA Throughput Sampler) with rules ordered from most specific to the least general. The first match determines what action or sample rate is applied, making it essential to start with keep and drop rules followed by dynamic sampling rules and ending with a catch-all rule. Each rule serves a unique purpose such as keeping all status codes >= 500 (Rule 1), dropping health checks (Rule 3), and dynamically sampling non-http requests at a rate of 10 (Rule 5).

Getting Started With Refinery: Rules File Template

By Max Aguirre

Sampling is a necessity for applications at scale. We at Honeycomb sample our data through the use of our Refinery tool, and we recommend that you do too. But how do you get started? Do you simply a set rate for all data and a handful of drop and keep rules, or is there more to it? What do these rules even mean, and how do you implement them?

To answer these questions, let’s look at a rules file template that we use for customers when first trying out Refinery.

rules:
  RulesVersion: 2

  Samplers:
    __default__:
      RulesBasedSampler:
        Rules:
          #Rule 1
          - Name: Keep 500 status codes
            SampleRate: 1
            Conditions:
              - Fields: 
                  - http.status_code
                  - http.response.status_code
                Operator: '>='
                Value: 500
                Datatype: int
          #Rule 2
          - Name: Keep where error field exists
            SampleRate: 1
            Conditions:
              - Field: error
                Operator: exists
          #Rule 3
          - Name: drop healthchecks
            Drop: true
            Scope: span
            Conditions:
              - Field: root.http.route
                Operator: starts-with
                Value: /healthz
              - Fields: 
                  - http.status_code
                  - http.response.status_code
                Operator: "="
                Value: 200
                Datatype: int
          #Rule 4
          - Name: Keep long duration traces
            SampleRate: 1
            Scope: span
            Conditions:
              - Field: trace.parent_id
                Operator: not-exists
              - Field: duration_ms
                Operator: ">="
                Value: 5000
                Datatype: int
          #Rule 5
          - Name: Dynamically Sample 200s through 400s
            Conditions:
              - Fields: 
                  - http.status_code
                  - http.response.status_code
                Operator: ">="
                Value: 200
                Datatype: int
            Sampler:
              EMADynamicSampler:
                GoalSampleRate: 10              # This is a sample rate itself
                FieldList:
                  - service.name
                  - root.http.route
                  - http.method
          #Rule 6
          - Name: Dynamically Sample Non-HTTP Request
            Conditions:
              - Field: status_code
                Operator: "<"
                Value: 2
                Datatype: int
            Sampler:
              EMADynamicSampler:
                GoalSampleRate: 10              # This is a sample rate itself
                FieldList:
                  - service.name
                  - grpc.method
                  - grpc.service
          #Rule 7
          - Name: Catchall rule
            Sampler:
              EMAThroughputSampler:
                GoalThroughputPerSec: 500 # This is spans per second for the entire cluster
                UseClusterSize: true # Ensures GoalThroughputPerSec is for the full refinery cluster and not per node
                FieldList:
                  - service.name

This file might look long, but it’s just seven rules that are general enough to help reduce event volume without much customization.

Sampling philosophy

TL;DR: drop boring data, keep rare/interesting data.

Boring data means things like fast, successful requests and health checks. Rare/interesting data would be things like anomalies, errors, unexpectedly slow requests, traces from especially important services or customers, etc. Things that you will take action on, want to alert on, or that are otherwise not the ideal state of things.

The above rules file follows this philosophy by keeping errors and abnormally long traces, dropping noisy health checks, and keeping only one out of every 10 fast, successful traces. This may seem like a high number of events to drop, but keep in mind that because Honeycomb weights calculations for sampling when Refinery is used, you don’t need to worry about confusion as to why the numbers don’t reflect expected traffic.


New to Honeycomb? Get your free account today.


Rules breakdown

This file uses three different samplers (in order): The Rules-Based Sampler, the EMA Dynamic Sampler, and the EMA Throughput Sampler. When a trace is assessed, the rules are reviewed sequentially from top to bottom and the first match (the first rule that applies to the trace) determines what action or sample rate is applied.

As such, we recommend starting with the most specific rules and working down to the most general. This file is no exception: it begins with keep and drop rules, followed by dynamic sampling rules, and ends with a catch-all rule utilizing the EMA Throughput Sampler.

Rule 1: Errors are (almost) always important. We want to make sure all status codes >= 500 are kept. A sampleRate of 1 means that 1 out of every 1 trace is kept, so traces that match are not sampled. In this rule, we’re looking for 500s in both the http.status_code and http.response.status_code fields.

Rule 2: Another error keep rule. We keep all traces that contain an error field that is not null.

Rule 3: Drop all health checks. Health checks are noisy, they skew your data, and can lead to event overages, so we want to drop them all. First, we set the scope to span. This means that every condition set must match on a single span. We do this because we want to be extra careful when dropping data. So, a trace will be dropped if any span contains a root.http.route that starts-with/healthz AND http.status_code or http.response.status_code fields with 200. Keep in mind that if you may need to modify the field or root.http.route value to match your own health check endpoint.

Rule 4: Another keep rule (SampleRate:1). We set the scope to span to define that the conditions must both be true on a single span. The first condition is looking for spans where trace.parent_idnot-exists, meaning it’s only looking at root spans. The next condition is that duration for the root span (and therefore the entire trace) must be >= 5000ms.

Rule 5: Here, we start to actually sample things! As status codes >= 500 are kept, we’re going to sample the rest at a rate of 10 (meaning, one in 10 is kept).

We use the EMA Dynamic Sampler here, which requires us to set a FieldList. This section sets which fields are used to build the sampling key. The key determines when the sampling rate should be increased/decreased based off of how frequently it occurs. For example, if key x is represented 100 times in a time window and key y 10 times, then traces matching key x will be dropped much more than those that match key y.

The FieldList should use fields that have some, but not too much cardinality. When cardinality is too high, everything is seen as rare and unique, therefore the sampler will retain more traces than it should. After setting the FieldList, you can check the number of keys being created with VISUALIZE COUNT_DISTINCT(meta.refinery.sampler_key . We advise keeping the cardinality to a combined value of less than 500.

Rule 6: Same as rule 5, but a different condition (non-http requests) and a different FieldList to build the sampling key. These fields better represent the data being targeted by the rule.

Rule 7: Lastly, we have our catchall rule using the EMA Throughput Sampler to target any traces that don’t fit any of the above situations. In our situation, we use GoalThroughputPerSec because we don’t know what traffic patterns are for the data coming in. This works to keep throughput at ~500 spans per second. We also set UseClusterSize to true because we’re targeting our entire Refinery cluster rather than per node.

Your rules will probably change over time

Refinery rules are meant to evolve as you bring new services online and instrument different parts of your code. While this file is a good place to start, you’ll get the best results by adding and customizing rules to better match your own data.

Regular review of sample rates and margin of error introduced, as well as making sure new environments are being accounted for, will help keep your sampling in optimal shape.

For help with rules customization or just to chat all things Refinery, check out the #discuss_sampling or #discuss_refinery channels in our community Slack!