I wanted to index the span tag "error" to be able to filter spans by this tag and to create alerts based on this tag. I tried to add a custom MetricSet. Unfortunately, after I start the analysis, I don't see the check mark action to activate my new MetricSet:
I have followed the instructions on this page:
https://docs.splunk.com/observability/en/apm/span-tags/index-span-tags.html#index-a-new-span-tag-or-...
Hi,
"error" is actually a case where you don't need to index a tag to be able to filter on it. Here is a screen shot of filtering spans where error=true.
And here is an example of filtering traces that contain errors:
PS - The reason it won't allow you to index "error" as a APM metricset is because "error" isn't actually a span tag so there is nothing to index.
Thanks for the quick reply. Yes, I've seen this filter switch in the Trace Analyzer, but I also want to create an alert to get notified in case of traces with an error span. It's not possible with the present fields.
Actually I have a dashboard, where I use the metric traces.count and the auto-generated filter field sf_error:true. I can see the results there, but when I create an alert based on the same metric and filter, it is not triggered.
I use a static threshold condition with the following settings:
P. S. You're right "error" is not a tag. I also tried to index on the tag "otel.status_code", but this also wasn't possible.
Hi,
Can you try using service.request.count as your signal (filter by sf_error:true and any other relevant filters) and see if that works?
This doesn't trigger the alert either. My original alert (with traces.count) was triggered once during my tests, when I had 3 traces with errors in a short time period, but then it wasn't triggered anymore.
Is there maybe a better way to create an alert for such single events in Splunk? I think, the "static threshold" should be rather used for continuous metrics like CPU usage. But I didn't find any other option so far.
Oh, since it triggered for you once but then didn't trigger again, that might be explained by the alert condition never being cleared. This could be even more likely in a test environment with little traffic. The alerts won't fire again until the previous alert condition has been cleared. There is a setting in the alert to automatically clear after X amount of time if that signal isn't reported. You might want to try that setting. Or try generating successful traffic with no errors over the period of time you're detecting on (e.g., past 15 mins).
Yes, I've seen the auto-clear setting and activated it. Still, the alert is not triggered. I think, that this kind of alert (or alert condition) is not suited for single-time events like "an error occurred in a trace", because there is no metric that goes up and down (like CPU usage). This can rather be implemented with Log alerts (in Search & Reporting).
Do you know a different possibility, how one create an alert for single events that occur in Splunk?