Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chris Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Anyone gotten rid of their server monitoring system? 2

Status
Not open for further replies.
Mar 13, 2016
16
US
Have you gotten rid of your server monitoring system? Do you wish that you could? I am not going to mention any particular monitoring system because all of them seem to suck big time in my experience. Here is a case in point.

Yesterday morning, I noticed some alerts came in via email. Of course, I generally ignore the email alerts because we get so many of them. Eventually, I asked the guy who admins the monitoring system about it. He said that we did have an issue with the monitoring system.

A little later, the pages started coming in. I spent probably 20 minutes just acknowledging them and deleting them from my phone. Of course, we didn't have any problems other than that stupid monitoring system. No users and application owners called us. As far as the end users were concerned, nothing was wrong.

This was just typical of the time that the systems waste for us. I've lost count of how many times a server was "down." Then I find that the server has been turned off on purpose or that someone just happens to be working on it at the time. Most of these alerts are just noise, and often one of these stupid alerts wastes a half hour or more of my time.

Of course, we gain little or nothing by having this alerting. Before we had this system, if something actually was down, we would hear about it from users anyway. And I certainly don't see how an alerting system buys us any time.

Have you gotten rid of your alerting system? I just see an expensive waste of time and money. We have even at least one person who administers the thing full time. I wish we would get rid of ours.

 
A monitoring system is just another tool. If it's not working for you, you're either using it wrong, or you don't actually need it.

If you miss real alerts because you "generally ignore the email alerts because we get so many of them", then you are using it wrong. The only alerts you should see are real actionable problems. If you have that much noise coming from the system, then you're doing it wrong. You need to look at the configuration and what kinds of events actually result in a page/text going out. Fix that first.

johnnygage said:
Before we had this system, if something actually was down, we would hear about it from users anyway.

You must be new. In most companys, the impressions of end users can not only determine the budget that IT gets, but also whether you should be replaced or outsourced. It's not only good for users to not know when your systems have had a problem, but it might also mean your survival. You goal in life in IT is to not have the users in the business know there was a problem. If their work hasn't been disrupted, then the business is doing whatever it does to bring in money. If you want to go back to using disgruntled users to let you know when the systems are down, go for it. Just keep your resume/CV updated at all times. You will be seen as not doing your job and you will be replaced.

So no, I would not even want to work in any environment that didn't have a system monitoring tool in place. In fact where I work, there are multiple layers of monitoring tools in place. One used by the data center staff and sysyadmins to monitor hardware and OS, one by the database teams to monitor all aspects of the databases, and multiple tools, both purchased and developed in-house, used by the application teams. Across all these tools is a lot of overlap, but issues are handled quickly. Some teams even collect information from the monitoring tools to do predictive modelling. They sometimes to upgrades or replacements before something fails. It's really easy to justify raises and bonuses for IT when you can show a 99.997% system uptime to the business.

 
sambones said:
A monitoring system is just another tool. If it's not working for you, you're either using it wrong, or you don't actually need it.

If you miss real alerts because you "generally ignore the email alerts because we get so many of them", then you are using it wrong. The only alerts you should see are real actionable problems. If you have that much noise coming from the system, then you're doing it wrong. You need to look at the configuration and what kinds of events actually result in a page/text going out. Fix that first.

I don't control the POS monitoring system. I am simply responsible for responding to the noise. Many of our alerts are for decommissioned servers. Many of them are for servers that are being serviced. Very few alerts are actionable problems.

And you are making a WHOLE LOT OF ASSUMPTIONS.

samjones said:
You must be new. In most companies, the impressions of end users can not only determine the budget that IT gets, but also whether you should be replaced or outsourced. It's not only good for users to not know when your systems have had a problem, but it might also mean your survival. You goal in life in IT is to not have the users in the business know there was a problem. If their work hasn't been disrupted, then the business is doing whatever it does to bring in money. If you want to go back to using disgruntled users to let you know when the systems are down, go for it. Just keep your resume/CV updated at all times. You will be seen as not doing your job and you will be replaced.

When there actually is a problem, the users find out anyway. The monitoring system may buy us five minutes at the most. But it's very little help.

And I have worked at my current job for two and a half years. I also worked with other POS monitoring systems at another job for over three years. Neither company got the system right. The systems were constantly making noise.

Perhaps you should come to work for my company and can show them how to get the monitoring system, right?

I can tell you have worked in information technology for a long time.
 
>I don't control the POS monitoring system

Not sure that SamBones was suggesting that you personally fix the monitoring system in place at your company - he mostly seems to have been using 'you' generically (i.e mentally replace 'you' with 'your company')

>Many of our alerts are for decommissioned servers. Many of them are for servers that are being serviced.

Which just goes to show that the monitoring system isn't being used or controlled properly, not that monitoring systems are inherently 'POS'. For example we have at least two processes designed to reduce/eliminate such noise. Firstly a change management system which will alert the monitoring team to any servers being decommissioned or serviced so that monitoring of those can be halted (or alerts disabled). Secondly, the aforementioned monitoring team, part of whose role is to perform triage on alerts so responders such as you only see valid alerts.

>The monitoring system may buy us five minutes at the most

Sure - there are unfortunately always likely to be certain scenarios where the users are indeed affected before certain alerts can be resolved. However, in those circumstances, the monitoring alerts should be able to provide you with more detail (User: "I can't open my Word document", Monitoring Alert: "RAID controller 1 catastrophic failure on MasterFileServer"), and also allow the service desk taking reports to at least say "we know about the issue, it is being worked on" which helps with the ' the impressions of end users' that SamBones mentioned. However there will also be many, many alerts that users are will not spot, but which IT do need to be made aware of - for example failure of a node in an HA cluster, loss of a disk in a RAID 5 array, backup failures. There are also alerts of incipient failure (e.g. interrogating SMART for indications that a disk may be likely to fail, or alerting if storage is beginning to run low before it actually runs out)
 
Yes, the 'you' was shorthand for 'your company'. When someone posts on Tek-Tips, it's hard to tell if they are in a one man IT shop, or an organization with 10,000 employees. You did refer to the monitoring systems as 'ours', so I did assume some 'ownership'. And since you were bringing up the possibility of getting rid of it, I assumed you had some level of responsibility for it.

In my world, if I see something that needs fixing, I get the appropriate people and teams together and work on a solution. If something doesn't make sense for some reason, I make a business case for eliminating or replacing it. But it's based on the needs of the business, not the fact that I'm getting too many annoying emails.

And I agree with strongm completely. It sounds like he's spent a lot of time in the trenches too.



 
I apologize for the "You must be new" comment. It was flippant and rude. I didn't mean for it to come off that way. I'm sorry.

johnnygage said:
Perhaps you should come to work for my company and can show them how to get the monitoring system, right?

I probably could. Where I work, it's being done right. As strongm mentions, we have the monitoring team tied into the change control and asset tracking systems so decommissioned servers are not monitored. We have monitoring blackout windows for known maintenance windows. We even do some predictive analysis to not only replace drives before they fail, but to stop using certain vendor's drives if they start showing a higher rate of early failure. But that kind of thing is overkill for most organizations. For ours it's important.

johnnygage said:
I can tell you have worked in information technology for a long time.

Yes, over 40 years in IT. I'm getting pretty close to retirement. In fact I'm really looking forward to the day I no longer get these fricken system outage alerts! [bigsmile]





 
First of all, maybe I should have mentioned what specific monitoring systems we have used at my organizations. However, I feared it would turn into a debate over which product is better. I wanted to have a discussion about monitoring systems in general.

I am glad to hear that there are companies out there that have actually made these systems work well. My current employer and previous employer were both basically keystone-kops IT departments. The IT department is a joke, and the monitoring systems are also a joke. In a way, it just reflects the department.

I was mainly referring to the many "server down" alerts that we get from SCOM and Nagios. This has been with two employers--one which was a Fortune 500 company and another which is perhaps even bigger. Most of the "server down" alerts are just noise. The server is sometimes being turned off for decommission or is down for maintenance. Some of them are also test servers, so they are up and down all the time.

At my past employer, we had temperature monitors in our server room. Yes, that's something you should have. Of course, our idiot supervisor had it set up so that it would page when one hit 75. Needless to say, this lead to a bunch of pages just because a thermometer got knocked out of place or if someone simply left a door open. Why not have it page at 85 or 90? That would make too much sense.

Again, it's nice to know that you can get monitoring systems to work. Would either of you guys like to come work at my place? My guess is that you would just bring in too much common sense.
 
Yeah, you were right to not mention which product because for the most part it doesn't really matter.

And I have been in environments where the monitoring system(s) were not configured correctly and we were getting useless noise. The path forward out of that hell is what I mentioned in an earlier post. You have to either fix it, or get rid of it. I would start with collecting stats on how much useless noise there is. That is, find the signal to noise ratio. Then write up a concise report on how the monitoring is not working due to the useless alerts disguising/burying the real alerts. Use real numbers and include suggestions on how to fix it. Put it in terms of impact to the business and include hard dollar impact if you can. Present that to your management as a problem that needs to be fixed. Engage the person responsible for the monitoring and their management too.

I realize a lot of IT shops don't work that way, but it sounds like it's not going to fix itself. My approach is if I'm in a situation like that, work to make it better.
 
Sambones said:
I would start with collecting stats on how much useless noise there is. That is, find the signal to noise ratio. Then write up a concise report on how the monitoring is not working due to the useless alerts disguising/burying the real alerts. Use real numbers and include suggestions on how to fix it. Put it in terms of impact to the business and include hard dollar impact if you can. Present that to your management as a problem that needs to be fixed. Engage the person responsible for the monitoring and their management too.

I might consider that. Who has time for it, of course?
 
johnnygage said:
Who has time for it, of course?

Isn't that like, I don't know.... like YOUR JOB ? [ponder]


---- Andy

There is a great need for a sarcasm font.
 
Andy beat me to it. [bigsmile]

I would also say, who has time for weeding through all those bogus email alerts.

Also, that kind of thing often leads to raises and promotions.

 
johnnygage said:
Who has time for it, of course?
Professionals that like what they do and try to improve the way they do it when its needed (which seems to be the case here).

Just don't tell the others teams that it is wrong and leave it like that - being proactive is what is required and giving solutions even when its not your work to give them, on the good companies, something that managers like.

Regards

Frederico Fonseca
SysSoft Integrated Ltd

FAQ219-2884
FAQ181-2886
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top