Re: DFM problem in LMS 3.1

p.imre · ‎02-12-2009

Hello,

I have a DFM issue in LMS 3.1. I upgraded from LMS 3.0 2007 Update to LMS3.1, and after a while, DFM started to produce interesting things. It "stops" sending alerts and notifications, and unable to load the alerts and activities page. All the DFM related services seems to be running. I already made a DFM database reinit, installed DFM 3.1.1 Patch, and both of them repaired the issue, but after 1-2 days it rise again.

The only thing which seems to be strange to me, that the memory usage is 5.8GB, and in LMS 3.0 it was only about 4GB. At the first occasion of the problem there were "java.lang.OutOfMemoryError: Java heap space" messages in adapterserve.log, and AdapterServe and AdapterServe1 processes were stopped.

Any ideas what can cause the problem?

Thanks,

Imre

Joe Clarke · ‎02-12-2009

How many devices are you managing in DFM? How many alerts do you typically have in AAD?

p.imre · ‎02-13-2009

There is ~250 Known devices in DFM. There is ~1500 records for 1 day in Fault History. There is a lot of BAckupActivated Alert, because there is Voice/Data E1 lines, and in LMS 3.0 I had a filter for this type of messages. But now it seems to me, that during the upgrade process this filter (disabling backup activated messages) somehow disappeared. Is it possible, that the huge number of alerts causes my problem? As far as I can remember in LMS 3.0 I used DFM without filtering for a while, but there was no such problem. I try to regenerate the filtering (if I am able to find out what was my solution for this half a year ago...)

Joe Clarke · ‎02-13-2009

It's certainly possible. The AdapterServer is responsible for shuttling events from the backend DfmServer to the EPM database. If there are a huge number of events, it could exhaust memory (note: alerts can contain multiple events).

Certain events can be disabled within DFM > Configuration > Polling and Thresholds > Managing Thresholds. You can also unmanage certain interfaces under DFM > Device Management > Device Details. There are even steps documented here for unmanaging interfaces in bulk.

p.imre · ‎02-14-2009

Thanks jclarke, helped a lot.

I will check the interfaces and events that is unimportant, and will unmanage/disable them. It is a planned task for me, but because there were no such problems with lms 3.0, it was not so urgent. Just one more question: is there any differencies between 3.0 and 3.1 in the way they handle DFM alerts/events? Because I am sure that there were the same amount of alerts in 3.0 without problems...

Thanks again for the very fast response,

Regards,

Imre

Joe Clarke · ‎02-14-2009

The event handling piece is shared code between Cisco and EMC. We don't have complete visibility into all the backend pieces of DFM. Therefore, I cannot say for certain what the engine changes were between DFM 3.0 and 3.1.

That said, since the OutOfMemoryError was only seen once, memory may not be the root cause. Without debugging logs, it's hard to know exactly why you're seeing daemon crashes.

p.imre · ‎02-17-2009

Ok, I see. Well, I would like to clarify what exactly happens. I know that under CS I can set the log level for debugging. This is what you mean, when mention 2debugging logs"? What log files can help me to find out what happens? I know the function ofa few log files, but not all of them. Thanks,

Imre

Joe Clarke · ‎02-17-2009

The debugging is enabled under DFM > Configuration > Other Configurations > Logging. You need to enable Event Promulgation Module and Event Processing Adapters debugging. The logs are under NMSROOT/dfmLogs/EPM and epa.

p.imre · ‎02-19-2009

I enabled debugging for EPM and EPA today. Two days ago I had disabled BackupActivate and HighDiscard rate alerts, so only a few alerts remained.

There were a few huge lg files, so I made a Logrot. But this morning DFM "died" again. The last alert is at about 3am, and if i click on any event id, or try to run Fault History it doesnt works.

I reload LMS and try to find out from the EPA&EPM logs what happens.

Joe Clarke · ‎02-19-2009

It would have been more useful to troubleshoot the server when the problem is occurring. You might also try opening a TAC service request the next time these daemons die so that some live analysis can be done.

p.imre · ‎02-23-2009

it crashed again o the 21st of Feb. Debugging for EPM and EPA was running, so I got the log files. I think I open a TAC Case. Thanks for your help.

REgards,

Imre