Debian Linux and LSI MegaRAID SAS
This HowTo show how to check the health of Hard Disks connected to a LSI Logic/Symbios Logic MegaRAID SAS 2108 RAID controller under linux. But is very useful for another hw raid controllers.
We look for its presence in the system:
~] lspci | grep RAID
01:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS 2108 [Liberator] (rev 03)
Bingo!, we can work with this one.
Install linux utilities
LSI provide megacli, a proprietary management command line utility. Debian repository containing all packages to install proprietary and opensource tools for you any HW RAID card can be found here .
My linux system is debian now. Add repository to /etc/apt/sources.list file in this format:
deb http://hwraid.le-vert.net/distrib branch main
- distrib - can be either debian or ubuntu.
- branch - can be squeeze, wheezy, jessie, stretch, buster or bullseye for debian, or precise, trusty, vivid, wily and xenial etc for ubuntu.
For my server it is:
deb http://hwraid.le-vert.net/debian bullseye main
Edit your /etc/apt/sources.list and add repository to last line:
deb http://hwraid.le-vert.net/debian bullseye main
Packages are now signed, please run the following command after adding the repository to sources.list:
wget -O - https://hwraid.le-vert.net/debian/hwraid.le-vert.net.gpg.key | sudo apt-key add -
Make apt-update and install MegaCli utility and megaclisas-status script wrapper.
~] apt-get update
~] apt-get install megacli
~] apt-get install megaclisas-status
megacli
megacli is a proprietary tool by LSI which can perform both reporting and management for MegaRAID SAS cards. However it's really hard to use because it's use tones of command line parameters and there's no documentation.
Many of the commands of MegaCli make use of the following parameters:
- -aN
- Specifies the adapter. Use
-a0
for the first adapter,-a1
for the second, or-aALL
for all adapters. - -PhysDrv [E:S]
- Specifies a physical drive. E is the enclosure ID, as returned by
MegaCli -EncInfo -aALL
. If more than one drive has to be specified, the drives are written in the form [E:S, E:S, ...]. - -Lx
- Specifies a virtual drive (aka RAID array) (where x is a number starting with zero or the string
all
).
Battery status
root@monitoring:~# megacli -AdpBbuCmd -aALL
BBU status for Adapter: 0
BatteryType: iBBU
Voltage: 3958 mV
Current: 0 mA
Temperature: 29 C
Battery State: Degraded(Need Attention)
A manual learn is required.
BBU Firmware Status:
Charging Status : None
Voltage : OK
Temperature : OK
Learn Cycle Requested : Yes
Learn Cycle Active : No
Learn Cycle Status : OK
Learn Cycle Timeout : No
I2c Errors Detected : No
Battery Pack Missing : No
Battery Replacement required : Yes
Remaining Capacity Low : No
Periodic Learn Required : No
Transparent Learn : No
No space to cache offload : No
Pack is about to fail & should be replaced : No
Cache Offload premium feature required : No
Module microcode update required : No
GasGuageStatus:
Fully Discharged : No
Fully Charged : Yes
Discharging : Yes
Initialized : Yes
Remaining Time Alarm : No
Discharge Terminated : No
Over Temperature : No
Charging Terminated : No
Over Charged : No
Relative State of Charge: 96 %
Charger System State: 49168
Charger System Ctrl: 0
Charging current: 0 mA
Absolute state of charge: 5125 %
Max Error: 100 %
Battery backup charge time : 48 hours +
BBU Capacity Info for Adapter: 0
Relative State of Charge: 96 %
Absolute State of charge: 5125 %
Remaining Capacity: 62267 mAh
Full Charge Capacity: 64754 mAh
Run time to empty: Battery is not being charged.
Average time to empty: Battery is not being charged.
Estimated Time to full recharge: Battery is not being charged.
Cycle Count: 4158
Max Error = 100 %
Remaining Capacity Alarm = 120 mAh
Remining Time Alarm = 10 Min
BBU Design Info for Adapter: 0
Date of Manufacture: 04/15, 2010
Design Capacity: 1215 mAh
Design Voltage: 3700 mV
Specification Info: 33
Serial Number: 11756
Pack Stat Configuration: 0x6480
Manufacture Name: LS1121001A
Firmware Version :
Device Name: 3150301
Device Chemistry: LION
Battery FRU: N/A
Transparent Learn = 0
App Data = 0
BBU Properties for Adapter: 0
Auto Learn Period: 30 Days
Next Learn time: Thu Jun 15 18:57:30 2023
Learn Delay Interval:0 Hours
Auto-Learn Mode: Enabled
Exit Code: 0x00
Array status
To check the whole array status we can run
|
|
Replacing Disks
Automatic array rebuild with a new disk
1. Disks status before one disk is failure:
|
|
Disks status before one disk failure:
disk number | Enclosure Device ID | Slot Number | Firmware state |
---|---|---|---|
disk1 | 252 | 0 | Hotspare, Spun Up |
disk2 | 252 | 1 | Online, Spun Up |
disk3 | 252 | 2 | Online, Spun Up |
2. Identify a raid array is degradeted
|
|
I have a raid array from 3 disk (one hot spare) and one disk is Failed...
3. Find a failured disk
Normally when a disk fails, and the controller detects it, it gets marked as a bad disks: Firmware status: Failed or Firmware state: Unconfigured(bad). It needs to be replaced, and if it is a brand new disk, as soon as it gets replaced, the controller will start to rebuild the RAID.
|
|
Disks status after one disk failure:
disk number | Enclosure Device ID | Slot Number | Firmware state |
---|---|---|---|
disk1 | 252 | 0 | Online, Spun Up |
disk2 | 252 | 1 | Unconfigured(bad) |
disk3 | 252 | 2 | Online, Spun Up |
Disks status before one disk failure:
disk number | Enclosure Device ID | Slot Number | Firmware state |
---|---|---|---|
disk1 | 252 | 0 | Hotspare, Spun Up |
disk2 | 252 | 1 | Online, Spun Up |
disk3 | 252 | 2 | Online, Spun Up |
4. Check automatic rebuild process:
We can identify the disk that failed (look for the one that has Firmware status: Failed) and by looking at its Enclosure devide ID and Slot Number we can form the following command to check the % on the RAID rebuild:
root@monitoring:~# megacli -PDRbld -ShowProg -PhysDrv [252:0] -aALL
Device(Encl-252 Slot-0) is not in rebuild process
Exit Code: 0x00
Hotspare disk is Online, Spun up now and rebuild process is finished. You can replace disk2 now :-)
megaclisas-status
megaclisas-status is a wrapper script around megacli that report summarized RAID status with periodic checks feature. It is available in the packages repository too.
The packages comes with a python wrapper around megacli and an initscript that periodic run this wrapper to check status. It keeps a file with latest status and thus is able to detect RAID status changes and/or brokeness. It will log a line to syslog when something failed and will send you a mail. Until arrays are healthy again a reminder will be sent each 2 hours.
megaclisas-status output examples
Wrapper output example
~] megaclisas-status
-- Controller information --
-- ID | H/W Model | RAM | Temp | BBU | Firmware
c0 | ServeRAID M5014 SAS/SATA Controller | 256MB | N/A | REPL | FW: 12.15.0-0199
-- Array information --
-- ID | Type | Size | Strpsz | Flags | DskCache | Status | OS Path | CacheCade |InProgress
c0u0 | RAID-1 | 136G | 128 KB | RA,WT | Disabled | Optimal | /dev/sda | None |None
-- Disk information --
-- ID | Type | Drive Model | Size | Status | Speed | Temp | Slot ID | LSI ID
c0u0p0 | HDD | IBM-ESXSST9146852SS B62C3TB1H60G0324B62C | 135.9 Gb | Online, Spun Up | 6.0Gb/s | 35C | [252:1] | 10
c0u0p1 | HDD | IBM-ESXSST9146852SS B62C3TB1H5J50324B62C | 135.9 Gb | Online, Spun Up | 6.0Gb/s | 32C | [252:2] | 9
-- Unconfigured Disk information --
-- ID | Type | Drive Model | Size | Status | Speed | Temp | Slot ID | LSI ID | Path
c0uXpY | HDD | IBM-ESXSST9146852SS B62C3TB19TDM0324B62C | 135.9 Gb | Hotspare, Spun Up | 6.0Gb/s | 33C | [252:0] | 13 | N/A
icinga2 integration
The script can be called with --nagios parameter. It will force a single line output and will return exit code 0 if all good, or 2 if at least one thing is wrong. It's the standard nagios expected return code.
~] megaclisas-status --nagios
RAID OK - Arrays: OK:1 Bad:0 - Disks: OK:3 Bad:0
~] echo $?
0
create symlink to megaclisas-status
# find full path to megaclisas-status script
~] which megaclisas-status
/usr/sbin/megaclisas-status
# go to nagios plugins directory
~] cd /usr/lib/nagios/plugins/
# create symlink with name check_megaclisas_status
~] ln -s /usr/sbin/megaclisas-status check_megaclisas_status
run megaclisas-status as root
megaclisas-status must root privileges to run command. So, go to /etc/sudoers.d/ directory and create file monitoring with this contain:
Cmnd_Alias CMD_MONITORING = /usr/lib/nagios/plugins/check_megaclisas_status, /usr/sbin/megaclisas-status
nagios ALL=(ALL) NOPASSWD: CMD_MONITORING
Check that it works:
~] su - nagios
~] sudo /usr/sbin/megaclisas-status --nagios
RAID OK - Arrays: OK:1 Bad:0 - Disks: OK:3 Bad:0
~] sudo /usr/lib/nagios/plugins/check_megaclisas_status --nagios
RAID OK - Arrays: OK:1 Bad:0 - Disks: OK:3 Bad:0
create icinga2 check command definition
Create megaclisas_status.conf file in your icinga2 config directory with this content:
object CheckCommand "megaclisas_status" {
command = [ "sudo", "/usr/lib/nagios/plugins/check_megaclisas_status" ]
arguments = {
"--nagios" = {
required = true
}
}
}
create service config and add service to server
Go to icinga2 config dir, create file with service definiton and add service to server
object Service "megaraid" {
host_name = "monitoring.secar.cz"
check_command = "megaclisas_status"
check_interval = 1m
retry_interval = 30s
max_check_attempts = 5
}
- host_name = "monitoring.secar.cz" - monitoring.secar.cz is hostname where we add this service
restart icinga2 service
Check icinga2 configuration files integrity and reload config
~] icinga2 daemon -C
~] /etc/init.d/icinga2 restart
Sources:
- https://hwraid.le-vert.net/wiki/LSIMegaRAIDSAS
- https://tipstricks.itmatrix.eu/checking-the-health-of-lsi-logic-symbios-logic-megaraid-sas-2108-raid-controller/
- https://hwraid.le-vert.net/wiki/DebianPackages
- https://hwraid.le-vert.net/debian/