But what is the Questioned reason? DFM will first attempt to ping the IP address of the device as it is in DCR. If that fails, the device will be put into a questioned state. Only after the ping succeeds does DFM attempt SNMP. Since I'm not seeing any SNMP traffic, I'm thinking the ping may be failing.
That is what is so puzzling... *all* devices return with the same reason - SNMP timeout. However, we tested the SNMP walk several times and it appears to be fine. Again, only DFM has this issue - everything else we use (CS, CM, RME, etc.) appear to be working just fine.
Expand your packet capture filter to capture all traffic to the IP address of one of the problem devices as it appears in DCR (i.e. filter all traffic to the management IP of the device). Rediscover the device in DFM, then post the capture file.
Attached are the two captures, one with LMS and the other one with Wireshark, both taken during the same rediscovery attempt for all devices, both without filters. Note that all network devices in DCR have 192.168.1.x IP addresses, while 10.10.2.9 is the LMS server address and 10.16.5.25 is the address of my desktop during the remote connection, in case you'd like to filter out some garbage.
Yes, all of them return to "questioned" state - from the smallest wireless access point over switches and routers all the way to 6500-series devices. All network devices are in the 192.168.1.0/24 subnet, and there are only about 60 of them.
Okay. I have a feeling the DFM Servers may be in a bad state. If you haven't done so already, install the consolidated patch for CSCtb87449 from http://tools.cisco.com/support/downloads/go/ImageList.x?relVer=3.2.0&mdfid=282640771&sftType=CiscoWorks+Device+Fault+Manager+Patches&optPlat=Windows&nodecount=2&edesignator=null&modelName=CiscoWorks+Device+Fault+Manager+3.2&treeMdfId=268439477&treeNa... . Then REBOOT the server.
When the server comes back, try to rediscover your devices. If that fails, post the DFM.log and DFM1.log under NMSROOT/objects/smarts/local/logs.
Hi Joseph -
Thanks for your assistance thus far. I applied the consolidated patch for CSCtb87449 successfully, restarted the LMS server, and performed another rediscovery. Unfortunately, all devices still go from Questioned to Learning and back again to Questioned state. As before, all devices still cite "SNMP timeout" as the reason.
Attached are the DFM logs.
Follow the instructions at https://supportforums.cisco.com/docs/DOC-8796 to reinitialize the DFM databases (dfmEpm, dfmInv, dfmFh, and delete the two rps files). When LMS starts back up, add one device to DFM and verify it goes to a Known state. If it does, sync the rest of your devices from DCR.
I clearly recall that TAC already went through the reinitialization of databases, first for DFM only, then for all LMS databases, and the subsequent attempts to add just a single device. No success.
You reported that you have succesfully tested the snmp RO access to the devices in question. So can you use the DFM built-in snmpwalk tool for this test and enable "debug snmp packets" on the device. With this I expect that you see if the packets makes their way to the device and if the sm_snmpwalk program is working (hopefully this is not only a cli program but also the code used internally):
Step 1 Go to NMSROOT/objects/smarts/bin
Step 2 Enter the following command for:
Snmp v1 and snmp v2 devices:
For Solaris: ./sm_snmpwalk --community= deviceIp
For eg: ./sm_snmpwalk --community=cisco 18.104.22.168
For Windows: sm_snmpwalk --community= deviceIp
For eg: sm_snmpwalk --community=cisco 22.214.171.124
The above command will generate three files,
[where xxxxx is the device IP] in the same location, that is in NMSROOT/objects/smarts/bin.
The devices are responding to SNMP, but DFM doesn't appear to be querying them during the Discovery process. When you reinitialized the databases before, did you destroy the DFM rps files?
I am not 100% certain but I watched TAC perform these steps and I believe I recall them also addressing the need to delete these rps files... so my answer would be yes.
There appears to be a problem with the DFM engines. This is not an issue with device support. If if your devices were unsupported, you would not be seeing the symptoms you are seeing. You'll need to go back to TAC as I'm betting EMC will need to get involved to look into the DFM server operation.