Circonus’ Linux Host Monitoring Speeds and Optimizes Performance Management

Circonus’ new Linux host monitoring dashboard is the most comprehensive host monitoring available in the market – yet also extremely easy to install and use, allowing users to more efficiently and accurately monitor their Linux hosts and diagnose and resolve issues. The turnkey dashboard uses our new Circonus Unified Agent (CUA) – a single collection agent that consolidates all host and services monitoring. CUA is written in GO, has several plug-ins, supports 300 integrations, and can run on multiple operating systems, including: Debian & RPM packages for AMD64 and ARM64; FreeBSD on ARM64 and x86 64bit; Linux on ARM64 and x86 64bit; and Windows x86 64bit.

Setup is quick and easy. Simply download and extract the appropriate release on your server, and the single configuration file automatically activates the default plug-ins, which include: CPU; Disk; Memory; Network; and Swap. No editing it required to get started, but the agent is easy to configure if needed, with examples in the configuration file.

Users run just a single command to get started (sbin/circonus-unified-agentd --config=etc/circonus-unified-agent.conf), and the agent will automatically interface with Circonus to immediately start data flowing.

CUA-driven Linux Host Dashboard

The CUA-driven Linux host dashboard is a system-wide performance dashboard with no extra setup required. Overall, the dashboard displays 350-500 metrics (and more data can be collected and stored in our time-series database, IRONdb). It includes a “summary” tab for an overview of your host, as well as multiple in-depth tabs that each group related telemetry metrics for digging deeper into CPU, Memory, Disk, and Network performance. All graphs stream new values in real-time every minute.

System Tab

The System tab provides an overview glance to ensure that the system is working as should. It includes:

  • 1/5/15 minute system load averages
  • CPU/memory/uptime stats
  • Processes Forked (too many can mean that bad software isn’t letting child processes exit correctly)
  • Context Switches (too many can mean that some software is multitasking and eating up too much CPU time since context switches require CPU time)
  • Interrupts (too many can mean that some hardware is eating up too much CPU time since handling interrupts requires CPU time)
  • Entropy Available (not enough bits of entropy can stall software which requires truly random numbers from /dev/random)
  • Zombie Processes (quickly-accumulating zombie processes can mean that bad software may exhaust all available process IDs)
  • Users currently logged into the server

CPU Tab

The CPU tab drills down into which processes are consuming your CPU. It includes:

  • User (time spent executing user tasks, per processor)
  • System (time spent executing system tasks, per processor)
  • IO Wait (time spent waiting on hardware IO, per processor)
  • Idle (time spent doing nothing, per processor)
  • IRQ (time spent handling hardware interrupts, per processor)
  • Soft IRQ (time spent handling software interrupts, per processor)
  • Guest (time spent running a virtual CPU, per processor. This is included in the User value above)
  • Nice (time spent running tasks at below-normal priority, per processor)
  • Steal (time taken and given to other processes when running in a virtualized environment, per processor)

Memory Tab

The Memory tab graphs multiple types of memory, allocated by system, so you can identify what may be running out of memory and causing performance issues. It includes:

  • MMapped (memory currently used in mmap’d objects)
  • Available (memory currently available for use)
  • Committed (memory currently committed/promised to all processes, not the amount actually being used…note this may exceed total system memory)
  • Commit Limit (how much memory the system is allowed to commit
  • Dirty (data waiting to be written to disk)
  • Write Back (data currently being written to disk)
  • Used/Free (free, used, cached, & buffered memory)
  • Active/Inactive (active: memory which was used recently, inactive: memory which hasn’t been used in a while and can probably be cleared if necessary)
  • Slab (a cache of commonly-used objects in memory for use by the kernel)
  • Swap (free and used swap space on disk)
  • Virtual Memory (all committed free and used memory, both physical memory and swap space. May be larger than actual available memory + swap space if virtual memory is overcommitted, in which case processes will be killed when they try to use too much memory. More useful for 32-bit systems than 64-bit systems, since 32-bit systems have such a limited memory address space)
  • Huge Pages (number of HugePages being used. When configured, HugePages allows applications to use large multi-MB memory pages instead of the standard 4kB page size)

Disk Tab

The Disk tab monitors overall space usage and inode usage, as well as I/O performance (for non-virtualized hosts).

  • Disk Usage (percentage of how full each disk is, per mount point, not physical hardware)
  • Used Space (actual used space for each disk, per mount point)
  • Free Space (actual free space for each disk, per mount point)
  • Used Inodes (count of files and directories on disk, per mount point)
  • Free INodes (available inodes, per mount point. Each disk can only hold a certain number of files and directories and this is the maximum number allowed minus the used ones)
  • Reads/Writes (the number of physical disk read and write operations. This is only available for non-virtualized hosts)
  • Read/Write Bytes (the number of bytes in physical disk read and write operations. This is only available for non-virtualized hosts)
  • Read/Write Time (the percentage of total available time spent in physical disk read and write operations. This is only available for non-virtualized hosts)
  • IOps in Progress (the number of physical disk read or write operations happening at any given time. This is only available for non-virtualized hosts)
  • IO Time (the percentage of total available time spent performing physical disk access. This is only available for non-virtualized hosts)
  • Weighted IO Time (the percentage of total available time spent performing physical disk access, including operations which haven’t completed yet. This can be a good measure of both IO time and IO backlogs that are accumulating. Only available for non-virtualized hosts)

Network Tab

The Network tab is the most in-depth tab. It breaks down all types of network traffic, allowing users to do a deep dive and identify errors by protocol, such as identifying packet loss in real-time or UDP traffic that has no associated app destination. It includes:

  • Network Traffic and Errors
    • Traffic Received (total network traffic received, per interface)
    • Traffic Sent (total network traffic sent, per interface)
    • Reception Errors (total network errors when receiving data, per interface)
    • Transmission Errors (total network errors when sending data, per interface)
    • Packets Received (total data packets of data received, per interface)
    • Packets Sent (total data packets sent, per interface)
    • Incoming Packets Dropped (total incoming data packets dropped due to network congestion, per interface)
    • Outgoing Packets Dropped (total outgoing data packets dropped due to network congestion, per interface)
  • ICMP Statistics
    • ICMP Messages (total ICMP messages received and sent)
    • ICMP Errors (total incoming ICMP messages received but having ICMP-specific errors, and total outgoing ICMP messages not sent due to ICMP-specific errors)
    • ICMP Incoming (all incoming ICMP messages, per message type)
    • ICMP Outgoing (all outgoing ICMP messages, per message type)
    • IP Unknown Protocol (the number of incoming packets discarded because the protocol was unsupported or unknown)
    • IP Invalid Address (the number of incoming packets discarded because of an invalid address)
    • IP Header Errors (the number of incoming packets discarded because of IP header errors)
    • IP Forwarded Datagrams (the number of forwarded datagrams)
  • UDP Statistics
    • UDP Datagrams (the number of incoming and outgoing datagrams)
    • UDP Errors (the number of datagrams received with errors or for which there was no application available on the destination port
  • IP Statistics
    • IP Packet Fragmentation (the number of packets successfully fragmented)
    • IP Incoming (the counts of incoming packets received, incoming packets delivered, and incoming packets discarded for queue overflow or buffer problems)
    • IP Outgoing (the counts of outgoing requests, outgoing fragments dropped after timeouts, and outgoing packets discarded for queue overflow or buffer problems)
    • IP Packet Reassemblies (the counts of packets which could not be reassembled, packets which were reassembled successfully, fragments received which required reassembly, and timeouts which occurred when trying to reassemble packets due to all required fragments not being received)
  • TCP Statistics
    • TCP Opens (the counts of active TCP connections with clients and passive TCP connections open with listeners)
    • TCP Connections (the counts of currently established connections, failed connection attempts, and connections which were reset)
    • TCP Segments (the counts of segments which were received, segments which were transmitted, and segments which were retransmitted)
    • TCP Errors (the counts of TCP reception errors and TCP connections closed abnormally while the client was still sending data)

Alerts

Once your Linux host metrics begin flowing into Circonus, users can add alerting for key metrics. You can add rulesets for when to issue alerts, such as absence alerts, threshold alerts, and group alerts, which allow you to use pattern-based rulesets to cover all hosts with a single group of rulesets.

More Insights, Better Decisions

One of the top issues we hear from companies who are frustrated with their monitoring systems is lack of clarity into their data and time to identify and resolve issues. Our Linux host monitoring dashboard directly addresses this challenge. It collects and graphs more metrics than other solutions, giving users immediate, deep insights into their systems. As a result, they can more efficiently make sense of their data, troubleshoot issues, and validate that everything is working as expected.

Contact us to learn more about our Linux host monitoring dashboard, or if you would like a demo of our unified monitoring and analytics platform.