I am using the nagios3 v1.1.5 jumpbox with automatic backups to s3. After a few months with no problems, the system started to send mails regarding backup errors. The mail reads:
--
There was an error while performing your JumpBox backup.
Details:
Hostname: nagios3-jb
JumpBox: Nagios 3
Version: 1.1-242
:::STDOUT:::
:::STDERR:::
tar: Removing leading `/' from member names
tar: Removing leading `/' from member names
--
After this mail, the jumpbox is not working any more. I can't connect to the web administration or the nagios website. When I try to connect to the VMware console, I get the error message:
"Unable to connect to the MKS: Cannon connect to host 10.28.90.1: Unknown error 10060 (0x274c)."
###
The last 3-4 times the above happend, I started a new jumpbox and restored from the backup the day before the problem occured. Normally I could use the "new" jumpbox for another 2-3 weeks, before this error repeats again.
But now I am getting this error about every 3 days and I am tired of always starting a new jumpbox and restoring from an old backup.
Can you give me any help on how to resolve this problem?
Greetings from Germany
Max
Jumpbox dead after Backup Error
That's no good. This is sounding awful fishy, there is no apparent error in the email and I have never seen anything like your VMware error except in cases where VMware itself was having problems.
When you say you cannot connect to Nagios or the Web Administration pages does that mean there is NO HTTP response at all or that you get the "JumpBox is in maintenance mode" message?
Have you tried rebooting the JumpBox and seeing what happens?
Austin
Jumpbox dead after Backup Error
What shall I do now? Do you need any more information, so you can help me?
Max
Jumpbox dead after Backup Error
Finding out why the nagios config file is corrupt is probably the key point here.
I haven't had a chance to look up the command but there is a command to validate the nagios config files. It will give you fairly verbose output and tell you which line or lines are wrong. The output from "sudo /etc/initi.d/nagios start" might be sufficient ... in fact, the init script it self might have a "validate" or "check".
Next time it happens, report that information back here. If I get a chance to look up that command I will, but its not looking good.
Austin
Jumpbox dead after Backup Error
But why is the whole JumpBox dead after reboot? This can't just be a Nagios error...
Max
Jumpbox dead after Backup Error
How much of the boot do you see before you see that error?
Do you see the VMware BIOS screen splash up? Do you see any of the JumpBox logo and progress bar? The error on the console is not coming from the JumpBox itself I don't believe, its coming from vmware. It doesnt even sound like a missing virtual disk. It seems lower level than that.
Austin
Jumpbox dead after Backup Error
When the backup error occurs and the email ist sent to me, I am still able to open the Nagios website which then shows the home screen and the Nagios version. But when I click on any link like for example "Tactical Overview", Nagios writes:
--
Whoops!
Error: Could not read host and service status information!
The most common cause of this error message (...) is the fact that Nagios is not actually running. (...)
--
At this point I can still ssh into the box. When I then try to start nagios on the command line, it tells me that the configuration is corrupt (or something like this, I don't remember the exact words).
After rebooting the jumpbox there is no http response any more and I can't ssh into the system.
I have also rebooted the vmware host system, but that did not resolve the issue. I also have 3 other jumpboxes running on the same host and making backups to S3 without any problems. *knock-on-wood*
Max