05-19-2010 08:32 PM
Hi,
I have noticed that our archive purge job is failing completely, for all devices, and it appears to have been failing for some time now.
Looking at the meesage apeparing in the Job Details for one of the devices I get:
*** Device Details for mydevice *** |
Protocol ==> Unknown / Not Applicable |
Unable to get results of job execution for device. Retry the job after increasing the job result wait time using the option:Resource Manager Essentials -> Admin -> Config Mgmt -> Archive Mgmt ->Fetch Settings |
The Maximum time to wait for job results per device is set to 120 seconds.
This may explain why the job takes several days to run if every device is timing out after 2 minutes.
My assumption, obviously incorrect, is that the purge would just go through the database and look for any archives older than a year, which is what we have set the purge options to be. This message implies that RME is trying to connect to each device.
Is it trying to connect to each device?
Next question would be where do I start looking so that I can fix this.
Regards
Jeff
05-20-2010 07:39 PM
Edit NMSROOT/MDC/tomcat/webapps/rme/WEB-INF/classes/JobManager.properties, and change ConfigJobManager.heapsize to 384m. The delete and reschedule the purge job, and it should complete successfully.
05-23-2010 09:21 PM
Hi Joe
Thanks for the response. I have made the change as suggested as disabled and then re-enabled the job, so I now have it running as 8597 instead of 1106.
JobManager.properties:
ConfigJobManager.jobFileStorage=/rme/jobs
ConfigJobManager.debugLevel=0
ConfigJobManager.enableCorba=true
ConfigJobManager.heapsize=384m
I set this running at 9:50 this morning, and with the time now at 4:14 in the afternoon, there does not seem to be a lot happening still. If I look at the job details, all devices are in the Pending list - no Successful, or failed.
If I look at the ..\CSCOpx\files\rme\jobs\ArchivePurge\8597\1\log file i have:
[ Mon May 24 09:50:08 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutor,getJobExecutionImpl,160, Executor implementation class com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor
[ Mon May 24 09:50:08 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor,setJobInfo,167,DcmaJobExecutor: Initializing 8597
[ Mon May 24 09:50:10 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutor,
[ Mon May 24 09:50:10 NZST 2010 ],INFO ,[8597],com.cisco.nm.rmeng.config.ccjs.executor.dmgtJobRunner,run,31, DMGT Job Listener running..
[ Mon May 24 09:50:10 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor,initJobPolicies,583,Notification policy is ON, recipients :
[ Mon May 24 09:50:11 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor,initJobPolicies,603,Execution policy is : Parallel Execution
[ Mon May 24 09:50:11 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor,initJobPolicies,616,Getting managed devices list from DM
[ Mon May 24 09:50:11 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.util.db.DatabaseConnectionPool,getConnection,59,Inside ICSDatabaseConnection, MAX_COUNT =20
[ Mon May 24 09:50:11 NZST 2010 ],INFO ,[Thread-4],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutionESSListener,run,55,ESS Message listener started
[ Mon May 24 09:50:25 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor,initJobPolicies,626,Num Threads = 1, Task = Purge Archive, Num Devices = 2696
[ Mon May 24 09:50:25 NZST 2010 ],ERROR,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor,sendMail,949,sendEmailMessage: Null recipient list
[ Mon May 24 09:50:25 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor,run,212,Job initialization complete, starting execution
[ Mon May 24 09:50:25 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecThread,
[ Mon May 24 09:50:26 NZST 2010 ],INFO ,[Thread-7],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecThread,handleMultiDeviceExecution,143,JobExecutorThread - MultiDeviceExec DcmaJobExecThread 0 : Running
[ Mon May 24 09:50:26 NZST 2010 ],INFO ,[Thread-7],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecThread,purgeArchive,706,Purging Archive....
[ Mon May 24 09:50:26 NZST 2010 ],INFO ,[Thread-8],com.cisco.nm.rmeng.dcma.client.ConfigArchivePurger,purgeConfigs,177,PURGE SETTINGS: ByVersion = false ByAge = true PurgeLabelledFiles = false
[ Mon May 24 09:50:26 NZST 2010 ],INFO ,[Thread-8],com.cisco.nm.rmeng.dcma.client.ConfigArchivePurger,purgeConfigs,201,Purge files between start time and Fri May 29 09:50:26 NZST 2009
[ Mon May 24 09:50:26 NZST 2010 ],INFO ,[Thread-8],com.cisco.nm.rmeng.dcma.client.ConfigArchivePurger,purgeConfigs,207,Num Versions to keep = 0
[ Mon May 24 09:50:26 NZST 2010 ],INFO ,[Thread-8],com.cisco.nm.rmeng.dcma.client.ConfigArchivePurger,purgeConfigs,208,Purge files older than 12 Months
[ Mon May 24 10:33:50 NZST 2010 ],INFO ,[Thread-7],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecThread,handleMultiDeviceExecution,149,Completed executeJob(), updating Results
[ Mon May 24 10:33:50 NZST 2010 ],INFO ,[Thread-7],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecThread,getNumCyclesToPoll,1018,getNumCyclesToPoll Function Started.
[ Mon May 24 10:33:50 NZST 2010 ],INFO ,[Thread-7],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecThread,updateMultiDeviceExecResults,781,Awaiting Job results: req Id = 0 Poll time = 2022 min(s)
[ Mon May 24 12:39:42 NZST 2010 ],INFO ,[Tibrv Dispatcher],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutionESSListener,onMessage,66,Listener waiting for message :
[ Mon May 24 12:42:48 NZST 2010 ],INFO ,[Tibrv Dispatcher],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutionESSListener,onMessage,66,Listener waiting for message :
Is this normal? Most of the other jobs at least give you an updated success or failure for the devcies.
Regards
Jeff
05-23-2010 11:30 PM
If there are a lot of devices and/or archive revisions, the job can take a long time. However, this could still be memory-related. If the job times out again, try jacking up the memory to 1536m. Then reschedule it.
05-24-2010 08:29 PM
Job still running now. These used to take a couple of days, so I'll give it till
tomorrow before trying to stop the job, and upgrade the value once again.
We have just reached the 2,700 device mark. Not sure if that is a lot of devices or not :-)
Regards
Jeff
05-25-2010 04:06 PM
Job finished, and failed as per before. I have increased parameter to 1536m and have disabled and reenabled the job. All sitting ready and waiting to go again.
One thing I have noticed, and maybe my memory is playing tricks on me here, but all devices I have checked on in the Config Version Tree, have only 1 Primary/Startup configuration, for example, 1/May 25 2010 22:10:57.
They have Running configs starting at, for example, 53/Nov 22 2007 20:12:18 and going through to 1129/May 25 2010 22:11:24
The VLAN/Running configs starting at 321/Dec 10 2007 22:10:58 and going through 1129/May 25 2010 22:11:24.
I suppose I am wondering what has happened to all the Startup configs!
In theory we are supposed to keeping the archives for 1 year, but it looks like we havent been keeping any startup ones, and the others arent being purged.
Regards
Jeff
05-26-2010 09:08 PM
Only one rev of the startup config is ever kept (i.e. the latest). Trobuleshooting archive purge can be tricky. Last time I had to do it (which led me to the memory issue), I need to provide some custom code to get to the root of the problem. Hopefully the increased JVM heap size will help.
05-26-2010 09:47 PM
Well, definitely different results this time.
In the job browser I now see the following, which is much quicker than the previous 2 or days. :-) However I think it is a little too quick.
6. | 8623.1 | ArchivePurge | Failed | Default Archive Purge Job | admin | May 26 2010 23:20:00 | May 26 2010 23:20:00 | Weekly |
Looking at the Job results I see the folowing in the Work Order (Only entry with any details)
Name: | Archive Mgmt Job Work Order | |
Summary: | General Info ---------------------------------------------------------------------------------------------- JobId: 8623.1 Owner: admin Description: Default Archive Purge Job Schedule Type: Weekly Job Type: Purge Archive Job Policies ---------------------------------------------------------------------------------------------- E-mail Notification: Job Based Password: Disabled Device Details ----------------------------------------------------------------------------------------------
|
So is the heap size now too much?
Regards
Jeff
05-26-2010 09:54 PM
A heap of 1536m is usually the highest you can safely do on a 32-bit system. You could try reducing to 1280m. The job log or jrm.log may show an issue with being unable to start the JVM. That would confirm a heap size overflow.
05-27-2010 08:54 PM
Hi Joe,
I have changed the heap size back to 1280m and the job is still running.
At this stage I will have to leave it like this as today is the last day of my current contract. Not sure when (or if) I will be back, although there are plans afoot to upgrade CiscoWorks to LMS3.2 so it could be possible.
Until then, thanks for your patience and advice.
Regards
Jeff
05-30-2010 10:35 AM
Upgrading to 3.2 wouldb e a good thing to do. LMS 2.6 is winding down, and it will be end of support around this time next calendar year.
I can't say for certain that your purge problem will be fixed in 3.2 (memory really may be the issue here), but it would certainly give you
a lot of new features and other bug fixes.
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide