System Monitoring Software | Incident Management Tools | IT Management Software

Strategy | The HDPlus Story

30-years of working in IT as a CIO, CTO, system administrator, software developer, and electrical engineer has taught me one thing – “use a system to stay out in front of problems and out of trouble.”

The below “Real Life” example of what can go horribly wrong when adequate system monitoring and processes are not made a priority, is a good example of what not to do. This observation, as a visitor, actually led to the development of HDPlus. I happened to be on a trip to Europe, and was visiting one of our manufacturing facilities in Germany. The below serves as an affirmation of why properly set up system monitoring software backed with sound incident management tools, automated escalations, and workflow are needed and so important to a well-managed IT department.

When that concept is not made a priority, the results can be a very costly experience, as this team learned.

When a prominent European tier-one automotive seat manufacturer experienced a hard drive problem, the result was a major system outage that quickly propagated, and eventually impacted their customer. Having an internal IT issue is one thing, but having it stop your customer’s production is by far, much worse.

This was the product of inadequate support, and an almost unbelievable situation of not having a system monitoring software solution in place. Adequate system monitoring and alerting is such a fundamental, low-cost benefit, it is hard to believe there are companies running without proper monitoring and alerting. What makes it even more remarkable is that these systems are not expensive, and can be deployed, up and running, in just a few days.

In the below example, it was a situation where the local IT staff did not understand the priority of monitoring and alerting.

By not having a properly set-up monitoring system in place, it resulted in no one receiving a notification that a critical business system was down the entire weekend, and that led to a major issue that not only impacted their operation, but also their customer. It ultimately resulted in a costly penalty and damage to the company’s reputation with their customer.

It wasn’t until Sunday night at mid-night, when the third shift manufacturing employees arrived for work that the problem was discovered. The system was completely down and had been all weekend. This led to a call in the middle of the night for help, and when IT staff arrived they discovered that not only was the supply chain management system down, the storage array was down, and two (2) of the disk drives in the array were defective.

The team struggled to comprehend and explain why two drives could have failed at once. Flying blind and not monitoring their environment, I suspect that one drive failed weeks earlier, and because the team was not properly monitoring the system, they were unaware that a drive had failed until the second one did. In a raid array a drive can fail and the entire system will continue to function because of party checking and correction of data presented to the system. With one drive failure the system can limp along correcting the lost data with parity checking, but when two drives fail, it means the array is damaged beyond repair, necessitating not only a hardware repair, but an array reconstruction, then a full restore. It’s a major outage.

With the system completely down all weekend, the entire manufacturing plant was caught off guard and unable to meet production schedules and deliveries to their customer. Just in time inventory means that orders are placed via EDI transactions and expected delivery timed to coincide with vehicles being manufactured in line-sequence order. Thousands of EDI transactions are received on a daily basis. When everything is working as properly trucks backup to the manufacturing facility, and in this case “seats” are preconfigured and delivered to match vehicle build orders as each vehicle being assembled moves down the assembly line. Everything is dependent on systems being operational and available. The unpleasant reality was that due to this error and oversight by IT, the plant could not meet delivery schedules for a well-known German vehicle manufacturer, and the customer’s assembly line came to a stop.

Shutting the customer’s production line down is bad news. In the auto manufacturing industry everything is driven by ASN transactions send via EDI transmissions – just in time inventory – means just in time shipments. All shipment orders, acknowledgements, and delivery notifications are via an ASN transmission. If a failure results in a disruption of the assembly-line a penalty is accessed to compensate the customer for lost revenue, increased labor costs, etc. In the U.S. a comparable manufacturer to the German car company in this example is an automobile manufacturer headquartered in Detroit, this is generally not made public, but the penalty is $60,000 per minute.

Let’s work the math. In this example the line was stopped ~1 hour. That equates to a penalty of $3.6M, 1-hour = 3,600 seconds x $1,000 = $3.6M. With a potential loss of nearly $4 million dollars, isn’t it worth making sure monitoring and alerting are working correctly?

The design of HDPlus is focused on eliminating costly problems just like this one. It is a system with built-in checks and balances from beginning to end. Workflow and logic enabled from the moment a Helpdesk ticket is created until it is closed. Throughout the entire lifecycle of the ticket, alerts are triggered with built-in alerting that sends notifications when a ticket remains in an unattended state, or an action is required.

A critical component of HDPlus, up and above what is found in most incident management tools, is the ability to not only auto-generate tickets, but to parse an email from a monitoring system, determine the level of importance of the ticket, and distribute real-time email alerts based on this criteria to the right departments and staff, as this is happening management is made aware to ensure no delays should an escalation be required. This safety net is what makes the difference.

Secondly, just as important to the design of HDPlus is that non-IT people can create tickets with ease; this is due to the logical flow of HDPlus as questions are presented to users in English language, one does not need to have a Computer Science degree to create a ticket. Tickets can even be created via a phone call or from mobile devices.

HDPlus can be deployed in less than 1-business day, and the cost effective fee structure is based on the number of users per month. Had HDPlus been deployed at this company, there would not have been an outage, this problem would have been identified and corrected long before the incident occurred, and the resulting $4M in penalties.

Using well-designed and innovative solutions to catch problems before that cause down time has side-benefits.

Case in point, during an engagement whereby I led the carve-out of a IT division from a Fortune 500 I needed to deploy a ticket management system for global IT including the Helpdesk department.

I deployed an HDPlus design that maintained a very cost effective global IT strategy. The approach did not compromise service level or SLA, and with technologies such an auto-ticket generation, automatic escalation, engaging responses during ticket creation, our users were able to support themselves. We operated the company without a formal Helpdesk. Gartner’s direction and recommendation was a target IT budget for a company our size of 2.2% GP, our budget for the 2 ½ years before the company was snapped up, was less than 1% of GP, our service was good, and our users happy.

I used technology to drive out waste and to drive down cost. As mentioned at the beginning of this article, please recall, “use a system to stay out in front of problems” let me modify that slightly…

Use the RIGHT system to stay out in front of problems while reducing cost and improving SLA. Contact HDPlus today!

Free 30 Day Trial

HDPlus Story

About Us

HD Plus