Solved: lms 4.x dfm alert oper/unresp.

secureIT · ‎06-15-2013

Hi,

few of the interfaced in routers switches went down and the alarm was not displayed in lms 4.2.3 dfm activities pages..

my observation was that, there were unresponsive alarms generated, and according to my knowledge untill that interface becomes available for "icmp and snmp" no more alarms will not generated if that interface goes down..

should it generate operationally down alert before getting unresponsive alarm.??

in case A, i used to get operationlly down alarm and in 3 mins, unreponsive alarm

in case B, i get only unresponsive alarm only.

can some one clarify my doubt.??

regards

Rajesh

Michel Hegeraat · ‎07-28-2013

Good luck with that Rajesh,

I still would like to see the trace, just to see what goes on between LMS and the device.

You can edit out sensitive stuff like community string, etc with wireshark.

Or send it to me private.

Cheers,

Michel

.

View solution in original post

Michel Hegeraat · ‎06-16-2013

Hi Rajesh,

Only if a device is SNMP reachable it can make DFM generate an operational down alert.

Otherwise "unresponsive" is all you get.

Therefore you better make sure both sides of each important link is "managed"

Cheers,

Michel

secureIT · ‎06-17-2013

Could you please give example for both of the alarms....

Michel Hegeraat · ‎06-17-2013

No sure what you mean with an example here,

If your device is still reachable, because you manage it properly on a loopback interface, it can still be reachable even if several interfaces are down.

It therefore can report back the "down" state of these interfaces to DFM. (giving operational down)

If a device is unreachable because the Fa0 on which the management IP is situated is down, it can no longer report that state back to DFM because there is no IP connectivity to do so. (giving unresponsive)

Cheers,

Michel

secureIT · ‎06-17-2013

Hi Michael,

Let me explain the problem...

1.The Router has a loop back ip and the same is configured in common services. There is an interface Gig 1/1 for example, which was down for more than 5 min but alarm didn't come in Dfm. As per show ip route Lms-ip, executed in router, router is not learning Lms ip via the interface Gig 1/1. I got unresponsive alarms and not operationally down one... From another Router I got got operationally down alarm and after 3 min I got unresponsive alarm... According to my knowledge if interface is not pingable then unresponsive alarm should come. Pls correct me...

Regards

Rajesh

Michel Hegeraat · ‎06-17-2013

> According to my knowledge if interface is not pingable then unresponsive alarm should come

This is true.

But you only receive an operational down, when the device is still reachable on its management IP address, and only for interfaces that are monitored.

So what you should find out is:

if Gi 1/1 is down, can DFM still query the loopback?

It needs to talk to the loopback to get the status of the other monitored interfaces,

cheers,

Michel

secureIT · ‎06-25-2013

Hi michel,

Im facing a crazy issue in lms...

Loopback ip is reachable via icmp and able to get response from snmpwalk to the tengig2/1 interface via the loopback ip.

So snmp is thru and ping is fine...

But still im seeing the interface status as Unresponsive in dfm..

What cud be the reason ?

Lms is not learning the device thru this interface..in other words this interface is not a single point of failure..

Due to this i am not getting any interface flaps in dfm... interfce went down 3 times in 5 mins but still no flaps alerts were reflecting..

Lms version is 4.2.3

please reply to this..

Michel Hegeraat · ‎06-25-2013

So, from the LMS server you can launch a ping or an SNMP walk on the management interface and this works?

That is strange. Maybe DFM uses another address, or maybe the interface is not managed.

You should check in the DFM "detailed device view", if the interface is in a managed state

You can try to remove and re-add the device from LMS, to make DFM rediscover it.

You better enable link trap on the devces to be sent to LMS for the flapping. I don't recall the threshold, maybe 3 in 5 min is not enough.

Cheers,

Michel

secureIT · ‎06-25-2013

Dear Michel

pls find the comments in line...

So, from the LMS server you can launch a ping or an SNMP walk on the management interface and this works?---i am able to ping and snmpwalk to the management interface..

That is strange. Maybe DFM uses another address, or maybe the interface is not managed---DFM uses the same ip as that of Common services and the same IP as configured in the Router (here it is Vlan ip and not loopback sorry).

You should check in the DFM "detailed device view", if the interface is in a managed state --Interface is managed and showing as True

You can try to remove and re-add the device from LMS, to make DFM rediscover it.--Already tried deleting and adding in Common services

You better enable link trap on the devces to be sent to LMS for the flapping. I don't recall the threshold, maybe 3 in 5 min is not enough---Link trap enabled in router as "snmp-server enable traps snmp auth linkdown linkup coldstart warmstart", and the traps have been decreased to 2 counts in 5 mins.

regards

Rajesh

Michel Hegeraat · ‎06-26-2013

Can you do a trace of the communication between the LMS server and this host? That should provide some clues.

You can use tcpdump on the linux or solaris version and wireshark on the windows version of LMS.

Cheers,

Michel

secureIT · ‎06-27-2013

Hi,

When doing a wireshark,i am not seeing ifoperstatus request going from lms automatically in 5 mins..

Checked the same from another router and seeing ifoperstatus req properly.

Rgds

Rajesh

Michel Hegeraat · ‎06-28-2013

What traffic do you see for the router in question Rajesh?

Nothing at all? Just ICMP?

I've come across something similar. It was a problem in the dfm.rps file.

Depending on how much customization you have done you can consider to disable fault management, stop the system, (re)move/name the dfm.rps files, start the system, enable fault management.

This will create a new rps file that hopefully will make dfm poll correctly

secureIT · ‎06-28-2013

im seeing icmp as well as snmp requests and responses, bulksnmp requests etc...

searched and did not find any ifOperStatus snmp udp 161 requests... There are several traps i have seen from the routers on 162 port... When i compared the same with another working router, i could see several ifoperstatus OIDs have been polled in another router and not in the problematic router..

You may be right regarding the dfm.rps files corruption...

To achieve this do we have to follow the below ?

-net start crmdmgtd

delete the dfm.rps file or rename it to dfm.rps.old

what is the location for .rps files and if we delete and start the services, will it be created automatically ??

-then net start crmdmgtd

i have been troubleshooting Dfm and Hum from last 4-5 yrs and still finding some of these kinds of problems as unresolvable and gets stuck up..

there are several devices affected by this, but some of the gives correct operationally down alert, so cant figure out why only some of the devices goes to unresponsive state..

Michel Hegeraat · ‎06-28-2013

Well traces don't lie,... something is not working as it should.

Yes, if you rename the file in objects\smarts\local\repos\ then a new one will be created when you start the service again.

I too have been re-initialising databases and rps files a lot for silly things that won't go away any other way than that.

Sometimes the database can be repaired, but never without the help from the tac since the database is not documented and one can only guess how it is used.

Cheers,

Michel

secureIT · ‎07-01-2013

Thanks michel,

I will have this checked at the earliest and revert to you with the result.

regards

Rajesh P