A9K-RSP5-SE DIE_DIMM3 in failure state

Hello all...our our RSP5-SE we kept getting DIE_DIMM3 ERROR and show alarms gives us this error:

RP/0/RSP0/CPU0:ios#sh alarms detail system active 

Wed Sep 30 10:38:32.055 UTC



Active Alarms


Description:             DIE_DIMM3: in a failure state.                                                                                                                                                                                                                                  

Location:                0/RSP0                                                                                                                          

AID:                     SM/HW_ENVMON_SENSOR_ALARM/4                                                                                                     

Tag String:              FAM_FAULT_TAG_HW_ENVMON_NM_SENSOR_FAULT                                                                                         

Module Name:             N/A                                                                                                                             

EID:                     CHASSIS/LCC/1:CONTAINER/CC/1:MODULE/RP/1:MODULE/MOTHER_BOARD/1:SENSOR/TEMP/8                                                    

Reporting Agent ID:      50        

Pending Sync:            false

Severity:                Minor       

Status:                  Set     

Group:                   Environ             

Set Time:                09/30/2020 10:28:41 UTC                                         

Clear Time:              -                                                               

Service Affecting:       NotServiceAffecting

Transport Direction:     NotSpecified

Transport Source:        NotSpecified

Threshold Value:         -           

Current Value:           -           

Bucket Type:             NotSpecified

Event Type:              Default     

Interface:               NIL                                                                                                                              

Alarm Name:              sensor in a failure state          



After doing some digging I learned that the RSP5-SE comes with 40GB ECC Correcting DIMMs.  I removed the metal protective plate to see if maybe moving the DIMMS around would fix this error.  What I learned is:

There are 6 DIMM slots and only 5 DIMM slots were populated.  I looked at the memory modules and they are all 8GB each.  So 5x8GB would equal to the 40GB that Cisco ships the RSP5-SE with so it makes sense as to coming to the 40GB total.  But leaving a DIMM slot open will constantly trigger this DIMM3 error.  Is this a bug in IOSXR x64 6.5.1?  Is this normal/acceptable?  I realize that the error is merely "cosmetic" but still an error.  Any input will be greatly appreciated.


Right, having the alarm set won't cause any impact on the system.



This looks to be fixed via CSCvq17023.


RSP5: Add support for displaying/hiding DIMM. RSP5 TR can have 16G or 24G memory(8+8+8) or (16+8).

Depending on number of DIMMs physically present on board, user should only see corresponding DIMM sensor.


There is some additional wording that they also tested the sensors for -SE cards like you have.


Integrated-releases: 06.06.03 07.00.02 07.01.01 07.02.01


Let me double check with the folks that fixed this to make sure it will fix your condition.




Thank you Sam.  I would have thought I would only see the sensor for the corresponding DIMMs that are installed on the RSP.  I actually moved the DIMMS around to see if it would trigger the same alarm on the missing slot and it did.  For example, moved DIMM5 to DIMM3, then DIMM5 would show the error.  Moved DIMMs around again and left DIMM2 empty and same error on DIMM2. Below is a Show env all when DIMM3 was empty and you can see that the temp sensor shows "-" (null) as the RSP can't read the temperature because I assume there is no DIMM on slot DIMM3



Location  TEMPERATURE                 Value   Crit Major Minor Minor Major  Crit

          Sensor                        (deg C)   (Lo) (Lo)  (Lo)  (Hi)  (Hi)   (Hi)



          DIE_FabArbiter0                53    -10    -5     0   115   125   140

          DIE_FabSwitch0                 62    -10    -5     0   115   125   140

          DIE_FabSwitch1                 58    -10    -5     0   115   125   140

          DIE_CPU                        46    -10    -5     0    90    95   110

          DIE_PCH                        49    -10    -5     0    87   100   115

          DIE_DIMM0                      41    -10    -5     0    80    85   100

          DIE_DIMM2                      41    -10    -5     0    80    85   100

          DIE_DIMM3                       -    -10    -5     0    80    85   100

          DIE_DIMM4                      37    -10    -5     0    80    85   100

          DIE_DIMM5                      36    -10    -5     0    80    85   100

          SKYBLT0_Inlet                  43    -10    -5     0    80    85   100

          SKYBLT1_Inlet                  39    -10    -5     0    80    85   100

          High_Power                     58    -10    -5     0    80    85   100

          AIR_Outlet                     48    -10    -5     0    80    85   100

          Inlet                          36    -10    -5     0    70    85   100

          Hotspot                        53    -10    -5     0    90    93    95

          DIE_Aldrin                     61    -10    -5     0   100   110   125



I moved memory out of DIMM3 and now the error is gone but memory is only showing less than 40GB


RP/0/RSP0/CPU0:ios#sh memory summary
Sun Oct 4 09:32:33.728 UTC

node: node0_RSP0_CPU0

Physical Memory: 35123M total (31072M available)
Application Memory : 35123M (29793M available)
Image: 4M (bootram: 0M)
Reserved: 0M, IOMem: 0M, flashfsys: 0M
Total shared window: 131M



So that bug I quoted only fixes the -TR version of the card, I have just raised a bug to fix the -SE version of the card (CSCvw01617).

I do not have an ETA at this time.




Thanks so much Sam. I’m guessing this might be just “cosmetic” and is not service affecting. 


Right, having the alarm set won't cause any impact on the system.



I also noticed that no matter where I position the DIMMs only 36GB is being recognized when I issue 
show memory summary
Maybe that's a bug too?


So I found this today


Guess this explains why only X amount of memory is being shown on show memory summary