System Monitoring with Xymon/Other Docs/FAQ/Generic Monitoring System Features

= Requirements for a Monitoring System =

Alerts

 * Send (Email/SMS/etc)
 * Acknowledge (display who is working on the issue)
 * Delay
 * Send to certain groups/individuals
 * Escalation Path


 * Ability to set severity of levels for each service test (eg, disk on a production server vs disk on a development server)
 * Different actions for different levels, i.e.
 * Level 1 (disk 95% full)   alert Help Desk
 * Level 2 (disk 98% full)   alert IT team

Display

 * Include or integrate with a real-time display system (with colours: Red, Yellow, Green, Purple,White and Blue)
 * Red:
 * Yellow:
 * Green:
 * White:
 * Purple:


 * Display a time of last check


 * Show high level "summary" of status. eg. group Unix boxes together and show if any have issues


 * Ability to customise the display. e.g. summary page for IT helpdesk, Unix page for Unix admins, Network page for Networking Team.


 * Ability to restrict access to the monitoring system (we do not want the general community to see everything monitored)


 * Ability to search for a host

Monitor

 * Microsoft Windows: Windows NT, Windows XP, Windows Vista.
 * Be able to process windows event logs and performance monitoring
 * UNIX: Solaris, AIX, HP-UX, IRIX, Linux, MacOS X, Tru64.
 * Services (DNS/FTP/SMTP/LDAP/etc)
 * Applications (Outlook, Calendar, Exchange, Certificate Services, Apache, Tomcat, etc)
 * HTTP Application Monitoring
 * Expected Content returned
 * Acceptable response time (10 seconds to load a web page is not okay)
 * Simulate a windows client application. e.g. click on an icon to launch Word.  Enter some text.  Save the document to a drive.  Close word.  Ensure the whole process worked.
 * Service level testing
 * e.g. a web application requires a web server, DNS, LDAP, etc. If the DNS server fails, then so will the web application.
 * Allow for cluster testing (e.g. 1 web server out of a cluster of 5 fails, notify about the web server outage, but not the web service outage)
 * Network File shares
 * SAN Monitoring
 * Citrix Servers and Services
 * Printers
 * Printer errors e.g. low toner
 * Print Queues
 * SNMP Devices
 * Hardware (i.e. Dell DRAC, Sun Solaris), both via hardware card and OS software.
 * UPS
 * Other environmental inputs (temperature, humidity, etc)
 * Nightly backup
 * Warn if backups take longer than expected
 * Alarm if some backups fail

Networking

 * Provide integration with Cisco Works, or have similar functionality
 * WAN links, LAN links, VLANs, etc
 * Verify link is up
 * Verify Bandwidth is not saturated
 * Cisco/Networking hardware
 * CPU load
 * Environmental e.g. Power supplies, temperature alarms, etc
 * Ability to interact with probes (break down traffic to type and size)
 * Capture and track changes to hardware configurations

OS Monitoring

 * Disk
 * Memory
 * Processes
 * Response time
 * CPU Load
 * Hardware failures
 * OS Alerts ( systems event logs and syslog )

Database monitoring

 * Oracle
 * MySQL
 * MSSQL
 * Ingres

File Monitoring

 * file growths, if exist etc

Customise

 * Easy to extend/Customise your own tests (API to integrate with)

Trending

 * Alert on trends, ie 10% growth over 1 month might be ok but over 2 hours isn't.
 * Provide trending for network bandwidth usage or any data collected

Integration

 * Integrate with a helpdesk/Trouble Ticket system
 * Automatically Submit Tickets
 * Automatically Update existing Tickets


 * Integrate with (or include) an Asset management system
 * Display serial number, manufacturer, warranty periods, history of repairs/replacement, etc


 * Integrate with other monitoring systems e.g. Ciscoworks, Oracle Enterprise Manager, HP, Compaq Insight Manager, etc


 * Integrate with with Microsoft Operations Manager (MOM) or offer the similar functionality as available in MOM

Agents

 * Locally installed agent to collect data (and temporarily store data locally)
 * Ability of central polling server to contact agent to get gathered data
 * Local agent has ability to send data to polling server
 * Ability to remotely update agents

Misc

 * History retention


 * Provide reports


 * Must be able to assign multiple IP addresses to each device and test each IP address individually if needed.


 * Minimal impact on service being monitored
 * Minimal effort to monitor (and manage) clients (remote devices)
 * Do not require upgrades to existing infrastructure (e.g. must run latest version of software before it can be monitored)


 * Ability for remote monitoring servers to report to a cental server


 * Dependency aware (if a core router fails, do not send 100 alarms for devices behind it)


 * Allow for scheduled downtime (disable a test in the future)
 * Require authorisation
 * Require a reason to be displayed


 * Allow for regular maintenance windows (application is restarted every sun night - do not send out alarms)


 * Ability to delegate testing to other devices (eg. tier management structure)


 * Audit history in monitoring system ( server added date, when was monitoring disabled and why etc )


 * The system must be able to self-monitor


 * Be able to monitor 1000+ devices


 * Allow variable polling (some tests every 5 mins, some tests every 1 min)


 * Highly Reliable


 * Redundancy (if your main monitoring server fails, have a second server on standby)


 * Apply default thresholds to groups of devices. Allow "one off" exceptions to these thresholds.  e.g. all file systems must be less than 90% full.  For serverX /opt must be less than 94% full since it currently is at 93% and should not change.