DL380 – HP Insight Manager Goodness

Over the weekend I had the opportunity to experience the HP Insight Manager goodness. We were on our way to Melbourne to go shopping for they day, early Christmas shopping, and about half way there I received an email from one of our servers, “IDM”, which had detected a drive failure. The drive that failed is part of a RAID5 array so we could replace the disk and it should rebuild successfully, as long as we could replace the drive before another disk failure!

We have seen a few disk failures before on the DL380 servers and and have had no issues with replacing the disks and DL380rebuilding them, but it was only recently that we started updating all the servers with the latest firmware and configuring them to send alerts on failures. So this was the first time we’d seen the email alerts for a disk failure, which obviously meant that we could deal with it straight away instead of waiting for someone to notice the RED light on the failed disk when they were in the server room.

With the drive failure occurring  on the weekend, and as I was an hour or so away, we had to do a quick call out to our IT Staff to see who was available to perform the disk swap. As luck would have it, someone was heading in to catch up on a few hours and was only a few minutes away. This particular member of Staff is our web developer and has pretty good knowledge of hardware but hasn’t had experience with our DL380 servers before. So, over the phone, I talked him through changing the disks over and starting the rebuild process. When we had our first disk failure on a DL380 we found it hard to find documentation on what to do, and this was probably because it was so easy and that we didn’t expect the RAID controller to do so much of the work by itself.

All Jeff had to to was unpack a new drive from its box, removed the failed drive and then insert the new disk. The new disk is already boxed inside its caddy, ready to slide into the server. So there’s no need to find a screwdriver and remove the disk and insert the new one in the caddy. After inserting the new disk the raid controller detects the disk, initialises and then begins the rebuilding process. This particular RAID array wasn’t too large and took less than an hour to rebuild, and as every change in disk status occurred, the server detected the change and sent a notification message. So as Jeff was replacing the disk, I was getting the notification messages instantly on my phone.

I’ve included the emails from the Insight Manager below. At the moment we only a few of our DL380s with the current firmware and Insight Manager, this is because of a problem we found with the SCSI backplane and the newer firmware. The latest update caused a problem where the Status LED’s on the SCSI disks fail to light up, green or red, for one of the disks in the server. We held off on continuing with the firmware updates but may reconsider that for the moment when we get such comprehensive information from the Insight Manger and alert emails it seems like the gains out weigh the inconvenience of LED issues.

Initial email – detected drive failure
—————————————————————————————————————————–
From: <ProLiant@>
Date: Sat, 22 Nov 2008 09:31:01 +1100
Subject: Storage Agents: Physical Drive Status Change

The system has detected the following event:
SNMP Trap:      3046
Date time:      11/22/2008  09:31:00 AM
Computer:       IDM
Source:         Storage Agents
Type:           Error
Category:       (4)
Description:
A ‘Physical Drive Status Change’ trap signifies that the agent has detected a change in the status of a drive array physical drive.
Details:
IDA Physical Drive Status ‘FAILED’
Drive Type 2
Location  ‘SCSI Port 1 Drive 3’
Error Code 13
Bus # 1
Controller Slot # 2
Model ‘COMPAQ  BD14689BB9      ‘
Serial Number ‘DAA1P6909WNS0637’
Firmware Revision ‘HPB1’

Second Email –new disk inserted and initialised ‘OK’
—————————————————————————————————————————–
From: <ProLiant@>
Date: Sat, 22 Nov 2008 11:47:07 +1100
Subject: Storage Agents: Physical Drive Status Change

The system has detected the following event:
SNMP Trap:      3046
Date time:      11/22/2008  11:47:06 AM
Computer:       IDM
Source:         Storage Agents
Type:           Informational
Category:       (4)
Description:
A ‘Physical Drive Status Change’ trap signifies that the agent has detected a change in the status of a drive array physical drive.
Details:
IDA Physical Drive Status ‘OK’
Drive Type 2
Location  ‘SCSI Port 1 Drive 3’
Error Code 0
Bus # 1
Controller Slot # 2
Model ‘COMPAQ  BF14684970      ‘
Serial Number ‘        J4W1PB3C’
Firmware Revision ‘HPB5’

Third Email – new disk is being rebuilt in the RAID5 array
—————————————————————————————————————————–
From: <ProLiant@>
Date: Sat, 22 Nov 2008 11:47:07 +1100
Subject: Storage Agents: Logical Drive Status Change

The system has detected the following event:
SNMP Trap:      3034
Date time:      11/22/2008  11:47:06 AM
Computer:       IDM
Source:         Storage Agents
Type:           Warning
Category:       (4)
Description:
A ‘Logical Drive Status Change’ trap signifies that the agent has detected a change in the status of a drive array logical drive.
Details:
IDA Logical Drive Status ‘REBUILDING’
Logical Drive # 2
Controller Slot # 2

Fourth Email – all done, RAID rebuilding complete and disk is OK, back to normal
—————————————————————————————————————————–From: <ProLiant@>
Date: Sat, 22 Nov 2008 12:37:07 +1100

Subject: Storage Agents: Logical Drive Status Change

The system has detected the following event:
SNMP Trap:      3034
Date time:      11/22/2008  12:37:07 PM
Computer:       IDM
Source:         Storage Agents
Type:           Informational
Category:       (4)
Description:
A ‘Logical Drive Status Change’ trap signifies that the agent has detected a change in the status of a drive array logical drive.
Details:
IDA Logical Drive Status ‘OK’
Logical Drive # 2
Controller Slot # 2

After reading Mick Liubinskas’ post on ‘How I Blog’ I thought I’d try a quick and nasty, two beer post with minimal spell checking and absolutely no grammar checking or proof reading…..