Solved: Re: *** Conflicting device support info for DFM 3.2 *** - Page 2

schm196 · ‎08-26-2010

Hellows... ;-)

The helpful folks at TAC have been trying to troubleshoot one of my last and biggest pending items, which was the perceived inability of DFM to manage the devices on our network. This was a rather puzzling issue, as the other LMS components (CS, CM, RME, etc.) had no apparent issues whatsoever doing everything I asked them to do. After countless hours trying to troubleshoot DFM discovery errors ("questioned" with SNMP timeout despite the fact that all other LMS modules manage the same devices perfectly fine), an alert TAC engineer finally asked whether or not these devices were, in fact actually supported by DFM 3.2 - low and behold, a can of worms opened up!

The best current guess is rather confusing to me: There is a Cisco document out there suggesting that NONE of our devices are among those supported by DFM, while I did find another Cisco document that somewhat contradicts that notion. I’d like to think that this must be confusing (or at least very little known) to TAC as well, since nobody over there considered this a potential culprit for the first almost three weeks of troubleshooting around the globe during countless WebEx sessions. We basically went through everything imaginable (process monitoring with full debugging, complete removal and new installation of DFM only, complete clean-up and re-initialization of all module databases – and in the process tearing down most of my configurations and settings –, to a midnight conference call with developers in India).

The end result appears to be that DFM functionality will not be available to me – please confirm. What are the alternatives? Any rhyme or reason to Cisco not supporting these device types? Any plans to ever do so?

I run a variety of devices on my network, most of them being 3560G, 3560E and 6504E switches, pretty much bread-and-butter variety of basic Cisco devices. Why on earth would there even be a question that these are or are not supported by all LMS modules?

Argument AGAINST support in DFM 3.2:

http://www.cisco.com/en/US/docs/net_mgmt/ciscoworks_lan_management_solution/3.2/device_support/table/lms32sdt.html#3.2table

Argument IN FAVOR of support in DFM 3.2:

http://www.cisco.com/en/US/docs/net_mgmt/ciscoworks_device_fault_manager/3.2/device_support/table/dfm3_2os.html

According to that list, our 6504E with IOS is fully supported by DFM 3.2 with LMS 3.2, and so are the 3560G and 2950 series switches, the 2500 series router. However, the 3560E series switches are not listed as supported.

Are we seeing ghosts here or have other people had device support issues with DFM?

Thanks,

Matthias

Joe Clarke · ‎09-02-2010

But what is the Questioned reason? DFM will first attempt to ping the IP address of the device as it is in DCR. If that fails, the device will be put into a questioned state. Only after the ping succeeds does DFM attempt SNMP. Since I'm not seeing any SNMP traffic, I'm thinking the ping may be failing.

schm196 · ‎09-02-2010

That is what is so puzzling... *all* devices return with the same reason - SNMP timeout. However, we tested the SNMP walk several times and it appears to be fine. Again, only DFM has this issue - everything else we use (CS, CM, RME, etc.) appear to be working just fine.

Joe Clarke · ‎09-02-2010

Expand your packet capture filter to capture all traffic to the IP address of one of the problem devices as it appears in DCR (i.e. filter all traffic to the management IP of the device). Rediscover the device in DFM, then post the capture file.

schm196 · ‎09-03-2010

Attached are the two captures, one with LMS and the other one with Wireshark, both taken during the same rediscovery attempt for all devices, both without filters. Note that all network devices in DCR have 192.168.1.x IP addresses, while 10.10.2.9 is the LMS server address and 10.16.5.25 is the address of my desktop during the remote connection, in case you'd like to filter out some garbage.

Joe Clarke · ‎09-04-2010

All of these 192.168.1.X devices moved to a Questioned state?

schm196 · ‎09-05-2010

Yes, all of them return to "questioned" state - from the smallest wireless access point over switches and routers all the way to 6500-series devices. All network devices are in the 192.168.1.0/24 subnet, and there are only about 60 of them.

Joe Clarke · ‎09-05-2010

Okay. I have a feeling the DFM Servers may be in a bad state. If you haven't done so already, install the consolidated patch for CSCtb87449 from http://tools.cisco.com/support/downloads/go/ImageList.x?relVer=3.2.0&mdfid=282640771&sftType=CiscoWorks+Device+Fault+Manager+Patches&optPlat=Windows&nodecount=2&edesignator=null&modelName=CiscoWorks+Device+Fault+Manager+3.2&treeMdfId=268439477&treeNa... . Then REBOOT the server.

When the server comes back, try to rediscover your devices. If that fails, post the DFM.log and DFM1.log under NMSROOT/objects/smarts/local/logs.

schm196 · ‎09-07-2010

Hi Joseph -

Thanks for your assistance thus far. I applied the consolidated patch for CSCtb87449 successfully, restarted the LMS server, and performed another rediscovery. Unfortunately, all devices still go from Questioned to Learning and back again to Questioned state. As before, all devices still cite "SNMP timeout" as the reason.

Attached are the DFM logs.

Matthias

Joe Clarke · ‎09-07-2010

Follow the instructions at https://supportforums.cisco.com/docs/DOC-8796 to reinitialize the DFM databases (dfmEpm, dfmInv, dfmFh, and delete the two rps files). When LMS starts back up, add one device to DFM and verify it goes to a Known state. If it does, sync the rest of your devices from DCR.

schm196 · ‎09-08-2010

I clearly recall that TAC already went through the reinitialization of databases, first for DFM only, then for all LMS databases, and the subsequent attempts to add just a single device. No success.

Martin Ermel · ‎09-09-2010

You reported that you have succesfully tested the snmp RO access to the devices in question. So can you use the DFM built-in snmpwalk tool for this test and enable "debug snmp packets" on the device. With this I expect that you see if the packets makes their way to the device and if the sm_snmpwalk program is working (hopefully this is not only a cli program but also the code used internally):

Step 1 Go to NMSROOT/objects/smarts/bin

Step 2 Enter the following command for:

Snmp v1 and snmp v2 devices:

For Solaris: ./sm_snmpwalk --community= deviceIp

For eg: ./sm_snmpwalk --community=cisco 4.1.1.1

For Windows: sm_snmpwalk --community= deviceIp

For eg: sm_snmpwalk --community=cisco 4.1.1.1

The above command will generate three files,

xxxxx.walk,

xxxxx.mimic, and

xxxxx.snap files

[where xxxxx is the device IP] in the same location, that is in NMSROOT/objects/smarts/bin.

schm196 · ‎09-09-2010

Thanks for your reply. TAC performed these tests already; attached are the results for one of the devices.

Joe Clarke · ‎09-12-2010

The devices are responding to SNMP, but DFM doesn't appear to be querying them during the Discovery process. When you reinitialized the databases before, did you destroy the DFM rps files?

schm196 · ‎09-13-2010

I am not 100% certain but I watched TAC perform these steps and I believe I recall them also addressing the need to delete these rps files... so my answer would be yes.

Joe Clarke · ‎09-13-2010

There appears to be a problem with the DFM engines. This is not an issue with device support. If if your devices were unsupported, you would not be seeing the symptoms you are seeing. You'll need to go back to TAC as I'm betting EMC will need to get involved to look into the DFM server operation.