Database metrics and alerts: Don't freak out – it’s not as bad as you think

    10943747 884899944874767 3896852335649968961 oBill Rehm, ATS Database Lead

    Out of the box, there are a lot of database alerts active and they all have 'recommended' thresholds. You may already know this, but you're going to have to change those alert thresholds. Every database is unique and what might be a problem on one database could be business as usual on another one. You'll need to adjust the alerts several times over the life of the database, but especially after you first install it and then again, after you start applying a production load.

    The general rule for setting alert thresholds is to set them all just a tiny bit above what is normal, if you're looking for a peak value, and for max values you want to give yourself enough notice that you have a chance to fix the situation. For example, at peak times, you run about 900 database connections, so you might want to be alerted when the number of database connections exceeds 1,000. On your data files you might set your first alert to be at 80%, because that gives you enough time to add a new data file without having to scramble at the last second.

    The alerts and thresholds shipped with various databases are set pretty reasonably and you can even activate additional alerts should you want even more information. It really depends on your specific needs and those of the business. Some companies decide that that they need even more flexibility and monitoring, and they might bring in a third-party tool such as GSI's GENIUS or Dell Foglight.

    With third-party monitoring, you introduce a new problem to your alerts and thresholds: Over-monitoring. By default, many packages have every possible alert turned on, including ones that don't matter. Even worse, there are some alerts that DO matter, but only when viewed in context with other alerts and behavior.

    The best example of this is the "full scans/sec" metric on Microsoft SQL databases. On the surface, that alert seems very important. Nobody wants any full table scans. Those are time consuming and can kill performance. If you're getting large volumes of full scan alerts, you must have a problem right?

    Well, maybe not. The full scans/sec doesn't key on only full table scans, and it's not really even a FULL scan. It's actually the number of unrestricted scans per second and they can be table OR index scans. These scans could actually be a full traverse of an entire table, or they could be an index "skip-scan" where the query jumps right to the end of the index.

    To get to the bottom of any alert, someone really needs to get on the database and dig in to the specific alert and see where it's coming from. It could really be a problem with a SQL statement not using indexes, containing NOT IN conditions, or nested joins (or all of them). On the other hand, the database could be doing a very efficient range or skip scan and there's no problem at all. Maybe the database is very busy and MSSQL is driving a high number of I/O requests. That could produce a lot of scans, increases in page lookups and reads, and high page I/O latch waits.

    Notice I mentioned some other alert metrics in there? That's right. Alerts are often only going to make sense when you take in to context other factors. If you only monitor on individual thresholds, you're going to end up constantly racing to a fire only to find it's a false alarm. You just can't get the big picture by looking at one counter.

    Keep your alerts coming, but make sure you understand the context. Not all alerts are bad, and some are only bad when you have additional data points to support them. Just don't act on an alert when you're not sure what it's there or why you're getting it. You might end up making changes that have no effect, or simply making things worse.

    A key tip I want to leave you with regarding alerts:

    Limit who gets database alerts, especially at first. Some people consider any alert a poor reflection of your job as a DBA. It doesn't matter that it's informational or out of context. They don't know that. All they know is that their mailbox is filling up with messages, so you must not be doing your job. Keep database alerts close to your chest until you figure out where you want your thresholds. When the system is stabilized and running smoothly, you'll know what your database is doing and what alerts are important. Then you can start including non-technical people.

    For assistance with JD Edward, please email us at inquiries@GetGSI.com.