Skip to content
Search
Generic filters
Exact matches only

Setting up ‘production-issue alerts’ made my life easier. Here’s how.

<div class="field field-name-body field-type-text-with-summary field-label-above"><div class="field-label"></div><div class="field-items"><div class="field-item even" property="content:encoded">Large companies can have hundreds of microservices – some on Virtual Machines (VM) and others on Kubernetes, where new features are built and deployed every day. And, without efficient monitoring, it becomes hard to identify any specific application or critical API that&#x2019;s failing.&#xA0;<br><br>
Let&#x2019;s take the complaints automation system in a ride-hailing application, for instance. This system automatically solves complaints raised by the drivers and customers &#x2013; losing belongings in the cab, or the driver not picking up the customer etc.<br><br><noscript><img alt="Complaints automation system" class="image-retina_ready" height="291" src="https://insights-images.thoughtworks.com/Complaints20automation20system_39be9d0be5d988b27f51ca598a4986e0.png" width="1357"></noscript><img alt="Complaints automation system" class="image-retina_ready" height="291" src="https://static.thoughtworks.com/images/1×1-transparent.gif" width="1357" data-lazy="true" data-src="https://insights-images.thoughtworks.com/Complaints20automation20system_39be9d0be5d988b27f51ca598a4986e0.png"><h5 style="text-align: center;"><em>Complaints automation system</em></h5>

<h3>When not monitored, bad things get worse</h3>
The above mentioned ride-hailing app could be going through deployments involving database migration changes and major code rewrites. Now, if all possible scenarios have not been tested, production deployment could break down.&#xA0;<br><br>
This could put automation on pause for several hours and all complaint tickets will have to be redirected for manual verification. Without appropriate monitoring, it falls to agents to report the production issue to the company.
<h3>Automated monitoring</h3>
By capturing the above described event and similar in the automation service, we could:&#xA0;

<ul><li>Keep track of the issues being automated</li>
<li>Capture statistics for how many complaints were successfully resolved and how many failed (with 4XX or 5XX) errors</li>
<li>Keep track of response-time for database queries and API requests</li>
<li>Carefully monitor critical APIs by adding alerts for every single failure</li>
<li>Follow system-related metrics like disk space, memory, CPU usage, etc.&#xA0;</li>
</ul>
With the necessary monitoring in place, production issues can be fixed that much quicker, and we will also know how well or not new features are performing on production and on deployment.<br><br><noscript><img alt="Complaints automation system with monitoring setup" class="image-retina_ready" height="684" src="https://insights-images.thoughtworks.com/Complaints20automation20system20with20monitoring20setup_871716065aa7e00641876c3e4e3f0e50.png" width="1236"></noscript><img alt="Complaints automation system with monitoring setup" class="image-retina_ready" height="684" src="https://static.thoughtworks.com/images/1×1-transparent.gif" width="1236" data-lazy="true" data-src="https://insights-images.thoughtworks.com/Complaints20automation20system20with20monitoring20setup_871716065aa7e00641876c3e4e3f0e50.png"><h5 style="text-align: center;"><em>Complaints automation system with monitoring setup</em></h5>

<h3>Building effective monitoring systems</h3>
A monitoring system consists of metrics, monitoring and alerting.&#xA0;<br><br>
To build one&apos;s own monitoring system, one would need a collector to collect metrics, a store to store metrics, a visualizer to set up dashboards, and an alerter to alert when something goes wrong. There are multiple ways we can monitor applications and systems.&#xA0;<br><br>
This blog discusses the use of the <strong>TIG (Telegraf, Influx and Grafana) stack</strong>, an end-to-end open-source solution for monitoring applications. It has three components – Telegraf for collecting metrics, Influx database for storing data and Grafana for visualization and alerting.<br><br><noscript><img alt="TIG stack monitoring system" class="image-retina_ready" height="936" src="https://insights-images.thoughtworks.com/TIG20Stack_5e179c9ff27db8b32f7c4d57081cf15c.png" width="1600"></noscript><img alt="TIG stack monitoring system" class="image-retina_ready" height="936" src="https://static.thoughtworks.com/images/1×1-transparent.gif" width="1600" data-lazy="true" data-src="https://insights-images.thoughtworks.com/TIG20Stack_5e179c9ff27db8b32f7c4d57081cf15c.png"><h3>Why TIG stack?</h3>
There are a lot of other monitoring systems in the market like Prometheus, Datadog and more. Ideally, choosing between these monitoring systems should depend on the scale of the task at hand, if the system is open source, if push or pull-based monitoring is a requirement etc.&#xA0;<br><br>
For instance, Prometheus and TIG stack are open source. Also, Prometheus is a pull-based system and is known to work well for large scale requirements. InfluxDB is a push-based system and supports multiple data types.<br><br>
In this article, I&#x2019;m not comparing between monitoring systems but, just picking one system i.e TIG stack to explain how monitoring systems work.
<h3>Telegraf</h3>
<noscript><img alt="Telegraf, a metrics collecting agent, in a TIG Stack monitoring system" class="image-retina_ready" height="595" src="https://insights-images.thoughtworks.com/Telegraf_4fc1128da9935d368eea33832ecc20a6.png" width="947"></noscript><img alt="Telegraf, a metrics collecting agent, in a TIG Stack monitoring system" class="image-retina_ready" height="595" src="https://static.thoughtworks.com/images/1×1-transparent.gif" width="947" data-lazy="true" data-src="https://insights-images.thoughtworks.com/Telegraf_4fc1128da9935d368eea33832ecc20a6.png"><br><br><a href="https://www.influxdata.com/time-series-platform/telegraf/" target="_blank">Telegraf</a> is a metrics collecting agent and is optimized to write to the Influx database. It runs on a VM or as a pod or as a <a href="https://www.magalix.com/blog/the-sidecar-pattern" target="_blank">sidecar</a> on the Kubernetes cluster that can output metrics. It is written in Go and compiles into a single binary with no external dependencies.&#xA0;<br><br>
It&#x2019;s plugin driven and supports collections of metrics from 100+ popular services by using plugins. The four types of plugins include -&#xA0;
<ul><li><strong><a href="https://docs.influxdata.com/telegraf/v1.14/plugins/plugin-list/#input-plugins" target="_blank">Input plugins</a></strong>&#xA0;are&#xA0;used to collect metrics from systems, services and third party APIs. For example, <a href="https://docs.influxdata.com/telegraf/v1.14/plugins/plugin-list/#postgresql" target="_blank">Postgresql</a> plugin is used to get metrics from the Postgres database</li>
<li><strong><a href="https://docs.influxdata.com/telegraf/v1.14/plugins/plugin-list/#output-plugins" target="_blank">Output plugins</a></strong>&#xA0;are used by Telegraf to write the metrics to various sources. For example, InfluxDB output plugin sends metrics to influxDB</li>
<li><a href="https://docs.influxdata.com/telegraf/v1.14/plugins/plugin-list/#aggregator-plugins" target="_blank">Aggregator plugins</a>&#xA0;are used&#xA0;to create aggregate metrics. For example, <a href="https://docs.influxdata.com/telegraf/v1.14/plugins/plugin-list/#merge" target="_blank">Merge</a> is used to merge multiple metrics&#xA0; and generate influxdb line protocol</li>
<li><strong><a href="https://docs.influxdata.com/telegraf/v1.14/plugins/plugin-list/#processor-plugins" target="_blank">Processor plugin</a></strong>s are&#xA0;used to transform, decorate, and filter metrics. For example, <a href="https://docs.influxdata.com/telegraf/v1.14/plugins/plugin-list/#regex" target="_blank">Regex</a> plugin transforms data based on regular expressions</li>
</ul>
It&#x2019;s extremely easy to add a plugin in Telegraf. Here&#x2019;s the image of configuration needed to add a &apos;mem&apos; input plugin which is used to get metrics for memory usage. This configuration is written in its configuration file.

<pre>
#Read metrics about memory usage

[[inputs.mem]]</pre>
<br>
Telegraf can work on both <a href="https://medium.com/@steve.mushero/push-vs-pull-configs-for-monitoring-c541eaf9e927" target="_blank">pull or push-based models</a> and has plugins for pulling and pushing metrics.&#xA0;<br><br>
In the pull-based model, monitoring agents pull the metrics from systems periodically. They pull data from targets, format the metrics into influxDB line protocol, and send them off to influxDB.&#xA0;<br><br>
In the push-based model, metrics are pushed to the monitoring agent. Telegraf sends the metrics from a system like a database running in VM, and will also pull data from the VM using plugins like cpu, mem. To receive metrics from an application, we use statsD plugin which follows a push-based model.
<h3>StatsD</h3>
<a href="https://github.com/statsd/statsd" target="_blank">StatsD</a> is a simple daemon to collect and aggregate application metrics and consists of the client, server and backend.&#xA0;<br><br>
In our application code, we invoke the statsD client to send metrics to the statsD daemon which runs on Telegraf. There are language-specific libraries available for statsD clients. For example, Ruby has a statsD client called <a href="https://github.com/Shopify/statsd-instrument" target="_blank">statsd-instrument</a>.&#xA0;<br><br>
A statsD server aggregates metrics by default for 10 seconds and flushes the metrics to the backend like an influx database.<br><br>
StatsD client communicates with the statsD server using the UDP protocol – fire and forget. Our code does not wait for a response, making it faster. StatsD server pushes metrics to the backend chosen by the project.&#xA0;
<pre>
&lt;metrics_name&gt;:&lt;metrics_value&gt;|&lt;metrics_type&gt;

example:
ticket.automation.time:100|ms</pre>
<br>
Metrics name is also a bucket. Metrics value is the number associated with the metrics name. And, metrics type could be one of the following:
<ul><li><strong>Timers</strong>&#xA0;measure&#xA0;the amount of time taken to complete the task&#xA0;</li>
<li><strong>Counters</strong>&#xA0;determine&#xA0;the frequency at which the event is happening. One could increase or decrease the counter&#xA0;</li>
<li><strong>Gauge&#xA0;</strong>takes arbitrary value. For example, we could have active database connections</li>
</ul><h3>Influx database</h3>
<noscript><img alt="Influx database, a time-series database in a TIG Stack monitoring system" class="image-retina_ready" height="601" src="https://insights-images.thoughtworks.com/Influxdatabase_95b121d2e778aadea9dca3723a4baa5f.png" width="950"></noscript><img alt="Influx database, a time-series database in a TIG Stack monitoring system" class="image-retina_ready" height="601" src="https://static.thoughtworks.com/images/1×1-transparent.gif" width="950" data-lazy="true" data-src="https://insights-images.thoughtworks.com/Influxdatabase_95b121d2e778aadea9dca3723a4baa5f.png"><br><a href="https://www.influxdata.com/products/influxdb-overview/" target="_blank">Influx database</a> is a time-series database and has a <a href="https://www.influxdata.com/blog/simplifying-influxdb-retention-policy-best-practices/" target="_blank">retention policy</a> feature to automatically delete data. Additionally, it&#x2019;s easy to learn because of its SQL-like query language called <a href="https://docs.influxdata.com/influxdb/v1.8/query_language/spec/" target="_blank">InfluxSQL</a>.<br><br><a href="https://v2.docs.influxdata.com/v2.0/reference/syntax/line-protocol/" target="_blank">InfluxDB line protocol</a> is the text-based format for writing points (or a single data record) to the database. It&#x2019;s the text-based format that provides measurement, tag set, fieldset and timestamp.&#xA0;<br><br><br><noscript><img alt="Image iconInfluxDB line protocol where the table name is called measurement, indexed data is called tag set, and non-indexed data is called a field. " class="image-retina_ready" height="294" src="https://insights-images.thoughtworks.com/InfluxDB20line20protocol_2014eabfc5e57c8cd9e38f2e59f7e427.png" width="1440"></noscript><img alt="Image iconInfluxDB line protocol where the table name is called measurement, indexed data is called tag set, and non-indexed data is called a field. " class="image-retina_ready" height="294" src="https://static.thoughtworks.com/images/1×1-transparent.gif" width="1440" data-lazy="true" data-src="https://insights-images.thoughtworks.com/InfluxDB20line20protocol_2014eabfc5e57c8cd9e38f2e59f7e427.png"><br><br><br>
In an influx database, the table name is called measurement, indexed data is called tag set, and non-indexed data is called a field.
<h3>Grafana</h3>
<noscript><img alt="Grafana is used to visualize metrics in the dashboard and to set up alerts" class="image-retina_ready" height="586" src="https://insights-images.thoughtworks.com/Grafana_ed07b9d462fcb3d7d83ebbef37bb8834.png" width="1600"></noscript><img alt="Grafana is used to visualize metrics in the dashboard and to set up alerts" class="image-retina_ready" height="586" src="https://static.thoughtworks.com/images/1×1-transparent.gif" width="1600" data-lazy="true" data-src="https://insights-images.thoughtworks.com/Grafana_ed07b9d462fcb3d7d83ebbef37bb8834.png"><br><br><a href="https://grafana.com/docs/grafana/latest/guides/what-is-grafana/" target="_blank">Grafana</a> is used to visualize metrics in the dashboard and to set up alerts. We can create the dashboard and graphs for metrics from data sources like the influx database, <a href="https://prometheus.io/" target="_blank">Prometheus</a>, <a href="https://www.elastic.co/" target="_blank">elastic search</a>, etc.&#xA0;<br><br>
We can also set a threshold for receiving <a href="https://grafana.com/docs/grafana/latest/alerting/rules/" target="_blank">alerts</a> and they can be <a href="https://medium.com/@_oleksii_/grafana-alerting-and-slack-notifications-3affe9d5f688" target="_blank">sent to slack</a>, email, pager phones etc.<br><br><br><noscript><img alt="A dashboard in grafana" class="image-retina_ready" height="780" src="https://insights-images.thoughtworks.com/Dashboard20in20Grafana_75341975b105cd5b626dfdb4057b22b1.png" width="1600"></noscript><img alt="A dashboard in grafana" class="image-retina_ready" height="780" src="https://static.thoughtworks.com/images/1×1-transparent.gif" width="1600" data-lazy="true" data-src="https://insights-images.thoughtworks.com/Dashboard20in20Grafana_75341975b105cd5b626dfdb4057b22b1.png"><h5 style="text-align: center;"><em>Dashboard in Grafana</em></h5>
<br>
The simplicity of a monitoring system that leverages TIG Stack lies in it&#x2019;s &#x2018;plug and play&#x2019; nature. Grafana can be replaced by <a href="https://www.influxdata.com/time-series-platform/chronograf/" target="_blank">Chronograf</a> and <a href="https://www.influxdata.com/time-series-platform/kapacitor/" target="_blank">Kapacitor</a>. Similarly, we can use <a href="https://prometheus.io/" target="_blank">Prometheus</a> instead of InfluxDB.&#xA0;<br><br>
Also, Telegraf can collect metrics from different sources. What&#x2019;s more, all the components are open source and are easy to <a href="https://medium.com/@nagaraj.kamalashree/how-to-install-tig-stack-telegraf-influx-and-grafana-on-mac-os-b989b2faf9f8" target="_blank">install</a>.</div></div></div><div class="field field-name-body field-type-text-with-summary field-label-above"><div class="field-label"></div><div class="field-items"><div class="field-item even" property="content:encoded"><link href="//fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,600italic,700italic,800italic,400,300,600,700,800" rel="stylesheet" type="text/css"><link href="https://assets.thoughtworks.com/blog_promo/blog_promo_new.css" rel="stylesheet"><div class="blog-promo" id="technology-cta-promo">
<div class="blog-promo-wrapper">
<div class="blog-promo-content">
<div class="blog-desc">
<h3>Technology Hub</h3>

<p>An in-depth exploration of enterprise technology and engineering excellence.</p>
</div>

<div class="blog-promo-cta"><a class="blog-promo-btn" href="https://www.thoughtworks.com/insights/technology?utm_source=insights-cta&amp;utm_medium=website&amp;utm_campaign=insights-technology" id="technology-cta">Explore</a></div>
</div>
</div>
</div></div></div></div>

error: Content is protected !!