Google Cloud Operations, formerly known as Stackdriver Logging and Monitoring, can be very confusing to set up. It’s easy to monitor something simple, but more complex cases quickly get confusing. One of the more flexible but confusing types of alert policies in Stackdriver Monitoring is a Logs-Based Metrics policy, which gives you the ability to monitor entries in Stackdriver Logging.
Create a Logs-Based Metric
The first step is to go to Stackdriver (Operations) Logging and create a user-defined metric. Go to the Logs Viewer and build a query to return the subset of log entries that you want to monitor (the details of query building are beyond the scope of this article). As of Sept. 2020, I highly recommend enabling the Preview Mode of the Logs Viewer interface, which is much better at guiding you through the process of building the query. If you must use the Classic interface, click the little black arrow at the right side of the filter box, near the top of the Logs Viewer screen. Select “Convert to advanced filter” from the drop-down menu, and create an advanced filter using the query language. Once you have created the right filter (query) that selects the log entries that you want to monitor, proceed to the next step:
- If you’re using Classic mode, click the “Create Metric” button near the top of the screen.
- If you’re using Preview Mode, there is an “Actions” menu just above the “Query Results” menu. Select “Create Metric” from that menu.
Set the name and description for the metric. You can leave “Units” as 1 and “Type” as Counter.
Create an Alert Policy for a Logs-Based Metric
The easiest way to create the alerting policy is from the “Logs-based metrics” page within Stackdriver Logging. Find your new metric in the “User-Defined Metrics” section. Click the three dots to the right and select “Create alert from metric.” You will be taken to the Monitoring interface to set up a condition on a new alert policy. Here’s where things get confusing…
Alert Condition Dialog
Here’s an explanation of what to set for each field:
- Filter: empty You already set up a filter to select log entries in your user-defined metric, so you don’t need anything here.
- Group By: empty Again, not required for log metrics
- Aggregator: none or sum If you’re monitoring a single resource, select “none.” If you have multiple resources logging entries, such as multiple web servers in a cluster, select “sum” so that the metrics for the servers will be added together.
- Period: 1 minute Since events from different servers don’t happen at the exact same time, the condition aggregates events over a time span. 1 minute is good starting point for “fast firing” alerts. You may want a longer time span if the condition you’re looking for only shows up over a longer interval.
- Aligner: rate You generally want to alert on the rate of log entries. For example, you may want to know if some web servers are suddenly returning 500 errors at a much higher rate than usual.
- Secondary Aggregator: none Not required
Finally, set up the alert condition trigger. I also recommend having two alert conditions per policy: one is a “quick reaction” threshold that will alert if the metric goes very high in a short period of time, and a second “slow burn” condition that will alert if a the metric reaches a “higher-than-average-but-not-critical” state for an extended period of time.