Re: LMS 4.0 new install - memory leak and way too many cwjava.ex

chharris41 · ‎04-21-2011

I recently did a new install of LMS 4.0 in a VM - running Win2008 R2 on ESXi 4.1 (used the R2 patch from CCO to do install). Server has dual quad-core processors, 16GB RAM dedicated to the VM - well over on all the min requirements.

I have the following outstanding issues and TAC has given no proposed solution, the case is going nowhere. Any help or insight would be greatly appreciated:

1) Main issue - the VM is using 7-7.5GB of memory, and that's when it's relatively idle. There are over 40 cwjava.exe*32 processes running!

2) CPU is erratic, goes from low usage to 80-90% where it will stay for awhile, causing the GUI to hang frequently

3) Toplogy services has no devices except those that show in the Layer 2 view (RME@ devices don't even show anywhere). Everything in L2 view is RED despite being shown in as reachable elsewhere)

There appears to be a serious memory leak, I've tried rebooting and stopping/starting daemons multiple times. The services also take forever to come up. Tried upgrading to 4.0.1 as well, no better off than before...

Gererally speaking, I consider this a downgrade from my production LMS 3.2 which is running smooth for quite some time.

Mr Clarke, I would greatly appreciate your involvment on this.

Thanks all for your input.

Charles

Marvin Rhoads · ‎04-21-2011

My experience with CiscoWorks' processes and memory utilization are similar.I believe that is is by design. It has a lot going on and uses a lot of memory and processes to accomplish that.

I wouldn't characterize it as a leak as the memory increases to what's required as the processes spin up and then remains relatively constant (in my experience = LMS 4.01 on a physical server with a single dual core Xeon CPU running Windows Server 2008 R2 managing about 300 nodes). My box is sitting at 5.50 GB physical memory utilization (12 GB installed) and pretty much flat lines there. A leak would show increasing memory utilization until all available memory was exhausted. My box is running 43 cwjava.exe processes.

Re CPU, I see utilization spike up during ANI discovery process execution but it doesn't by any means cause the GUI to freeze.

Your topology services issues need a deeper drill down to resolve.

Hope this helps.

chharris41 · ‎04-22-2011

Maybe memory leak isn't the best description, but considering that Cisco's recommended spec for total memory is 8GB I'd say either the app is using memory very inefficiently or they need to bump up the spec. Fortunatley I knew from experience and allocated 16GB to the VM.

I counted my cwjava.exe processes and also have 43 - whether or not this is by design is up to Cisco to answer - seems like a bit much to me.

TAC spent some time looking things over yesterday, turns out that the default install pretty much turns on monitoring/polling for pretty much everything and the UPM database was over 2GB. They turned off a bunch of default jobs to bring down the cpu spikes, but still no "solution" or defined problem.

I've asked this question many times to Cisco - maybe somebody can explain to me why soemone thinks I want to know about the current status of every virtual interface in a fully converged VoIP environment?? This is default behavior, and the only way to prevent it is to manually go in and change the monitored state for every one of the thousands of virtual interfaces there are on a VoIP router (not to mention all of the PRI interfaces that alert when the phone goes on/off hook!). Don't even get me started with the bulk manage/unmanage scripts and the adventure they want me to go on with that "tool".

The LMS team should have to spend a day here setting up the interfaces that I actually do care about - I guarantee they would rethink things if they did.

Bottom line - don't expect anything to be eaasier in this version. New front end - kinda nice. Same tired back end - lame. Use the "Legacy" menu and good luck finding things you knew in 3.x

Does anyone know if RME device views are no longer available in 4.x? RME@{server_name} >All Devices is the main topology map we use in our NOC and several other displays in 3.2. If this is gone what is the replacement?

Yes, I am frustrated and overall not happy with 4.x thus far.

Marvin Rhoads · ‎04-22-2011

I share your frustration.

I've been around long enough to remember fondly the old CiscoWorks 'classic' product - before LMS, CiscoWorks2000, or CWSI. I still think that was the cleanest implementation.

Yes, the monitoring stuff is a bit over the top. It's really the result of the continued integration of DFM/HUM/Smarts products into one "seamless" product. As those of us who've dug into the product know, the "seamlessness" isn't quite that under the covers. When it all works well, it's very nice. When something (anything) breaks, it's darn difficult to pull out the right piece of code for fixing it.

Joe Clarke · ‎04-22-2011

The Fault Management default management scheme is extremely simplistic (to an obvious fault). All interfaces are managed by default, and all interlink switch ports are managed by default. The thought was that these were likely to be critical links. Clearly that holds true for most interfaces, but not all. You can modify NMSROOT/objects/smarts/discovery/tpmgr-param.conf and specify a sysObjectID and IFDescription pattern to omit interfaces with a certain pattern. There are quite a few examples in there to get you started.

In LMS 4.0, the application boundaries have been mostly disolved. The All Devices group is still there. It's just simply "All Devices". If you chosen the default auto-allocation mode, then all devices are managed by what used to be RME.

Joe Clarke · ‎04-23-2011

One point on navigation, we find that most new LMS users like the mega-menu style navigation, where as existing users became very comfortable finding things in the application-centric model. I fall into the latter category, personally. What I like to use rather than the Legacy menu is the new Site Map (link is found at the top right of the screen). This lists all of the tasks in LMS and can be searched using "Find in page." The new unified search interface in LMS can also be used to search tasks. It works pretty well, too.

Joe Clarke · ‎04-22-2011

How many vCPUs have you allocated to the LMS VM? When it comes to vCPUs, I typically go well under the physical CPU recommendations. Unless you really have NOTHING else running on this physical server, I would not go above four vCPUs for LMS. Memory usage could be expected depending on the managed device count. Many of the LMS JVMs can allocate up to 1 GB of RAM, and some (e.g. ANIServer) can allocate more.

When CPU is high, what processes are taking up the CPU? Correlate the PIDs to the LMS processes using the pdshow command. You may find that these spikes correspond to jobs running (e.g. Data Collection, config collection, etc.). If multiple jobs are running at once, consider staggering them for better performance.

LMS 4.0 includes a lot of processes. Seeing 40 cwjava.exe processes is not uncommon. Most (but certainly not all) can be tied back to LMS daemons using the pdshow command to map the PIDs. You can find my LMS daemon cheat sheets at https://supportforums.cisco.com/docs/DOC-8798 . This will give you an idea of the number of LMS daemons that can spawn and how to tell if they are healthy.

On Windows 2008, accessing the server from the server's own web browser can be problematic. This is due to IPv6 being enabled by default in Windows 2008, but LMS does not yet have IPv6 client support. If you are trying to access LMS from the server, be sure to point to the IPv4 address and not the hostname.

LMS 4.0 greatly benefits from fast I/O. Make sure your I/O channels are high-speed and congestion-free. Local SAS drives, and 10 GB FC have been shown to have good results.

If you have LMS 4.0.1, you may be getting bit by CSCto06189. Port channels on Cat5K, Cat2K, or Cat3K switches can crash Data Collection resulting in red devices and incomplete topology data. A patch is available by contacting TAC. The ani.log can be used to confirm the bug.

If daemons take forever to come up, you may have a failing process. The output of the pdshow command will confirm if anything is not starting properly.

chharris41 · ‎04-25-2011

Joe,

THe VM is configured with 4 vCPU's, the host machine has 2 x quad core processors, so I figured half of that goes to LMS. Should I decrease the amount of vCPU's? Currently there are no other VM's on this host machine, at most we'd put one other lighter-weight VM on it so there won't be much contention for resources. The sm_server.exe process in consistenly the highest on CPU and memory usage.

I'm not using the gui at all from the server itself, so that's definetly not an issue here. Things have settled down some since TAC deleted a bunch of default jobs and polling - I guess we'll see what happens when I configure the polling/monitoring that I want. I read you post about editing the tpmgr-param.conf file to exclude additional interace types I dont want. Is there any other doc's on what/where to get the IFDescrPattern info for specific interfaces (like an ISDN bearer channel interface for exampe)? I don't want or need to know when these interfaces go up or down, but they are managed by default if they happened to be UP when they were initially discovered. I'd prefer they didn't even get discovered and wasted less resources in LMS.

Tried your tip using the site map, appreciate that tip and plan on making that my default navigation tool.

Thanks very much for your input - let me know about the above when you have a minute.

Charles

Joe Clarke · ‎04-26-2011

I think four vCPUs is probably fine. The sm_server processes are the DfmServers. How many devices are you managing in Fault Management? Are the sm_server processes taking up contant high CPU, or are there spikes seen periodically? By default, polling is done every four minutes for many objects. You could see CPU spikes during these cycles.

IfDescrPatterns are based on ifDescr from the devices themselves. You may be better off using ifType to match specific voice interface types. You can use SNMP Walk of the ifTable to confirm descriptions and types.

bauti1428 · ‎07-14-2011

Same thing here. We had LMS 3.2 and upgraded to LMS 4.0. I did a clean install and I didn't even migrated the old data from 3.2. The CPU has been constantly in 100% and the memry was 4 gig and I upgraded it to 6gigs. The memory is better now but the CPU is still 100% all the time.

Michel Hegeraat · ‎07-15-2011

The command tasklist can tell you what process(es) is/are taking the CPU.

There is also a little tool you can download called pv.exe (google it) to get the CPU info

When you know the pid of the process, run pdshow -brief | find " "

The you check the logs of the proocess that has an issue.

I've used this batch to log the CPU usage of DFM fro a customer

@echo off

:loop

echo %time% >> sm_CPU.log

pv -o"%%i\t%%c10000%%%%\t%%n" pv.exe sm_server.exe | find "sm_" >> sm_CPU.log

ping localhost -n 50 > nul

goto loop

Cheers,

Michel

Joe Clarke · ‎07-15-2011

Michel's tips on troubleshooting high CPU for LMS are a good start. From your output, it appears to be DFM that is taking all of the CPU. This could be due to too many devices/interfaces being managed, too many incoming traps, or database corruption. Please start a new thread for your particular case and include details about the server and the number of managed devices.

LMS 4.0 new install - memory leak and way too many cwjava.exe processes