Re: RME 4.0.6 - Archive Purge not working

Jeff Law · ‎05-19-2010

Hi,

I have noticed that our archive purge job is failing completely, for all devices, and it appears to have been failing for some time now.

Looking at the meesage apeparing in the Job Details for one of the devices I get:

*** Device Details for mydevice ***

Protocol ==> Unknown / Not Applicable

Unable to get results of job execution for device. Retry the job after increasing the job result wait time using the option:Resource Manager Essentials -> Admin -> Config Mgmt -> Archive Mgmt ->Fetch Settings

The Maximum time to wait for job results per device is set to 120 seconds.

This may explain why the job takes several days to run if every device is timing out after 2 minutes.

My assumption, obviously incorrect, is that the purge would just go through the database and look for any archives older than a year, which is what we have set the purge options to be. This message implies that RME is trying to connect to each device.

Is it trying to connect to each device?

Next question would be where do I start looking so that I can fix this.

Regards

Jeff

Joe Clarke · ‎05-20-2010

Edit NMSROOT/MDC/tomcat/webapps/rme/WEB-INF/classes/JobManager.properties, and change ConfigJobManager.heapsize to 384m. The delete and reschedule the purge job, and it should complete successfully.

Jeff Law · ‎05-23-2010

Hi Joe

Thanks for the response. I have made the change as suggested as disabled and then re-enabled the job, so I now have it running as 8597 instead of 1106.

JobManager.properties:

ConfigJobManager.jobFileStorage=/rme/jobs
ConfigJobManager.debugLevel=0
ConfigJobManager.enableCorba=true
ConfigJobManager.heapsize=384m

I set this running at 9:50 this morning, and with the time now at 4:14 in the afternoon, there does not seem to be a lot happening still. If I look at the job details, all devices are in the Pending list - no Successful, or failed.

If I look at the ..\CSCOpx\files\rme\jobs\ArchivePurge\8597\1\log file i have:

[ Mon May 24 09:50:08 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutor,getJobExecutionImpl,160, Executor implementation class com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor
[ Mon May 24 09:50:08 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor,setJobInfo,167,DcmaJobExecutor: Initializing 8597
[ Mon May 24 09:50:10 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutor,,119,Job listener for daemon manager messages started
[ Mon May 24 09:50:10 NZST 2010 ],INFO ,[8597],com.cisco.nm.rmeng.config.ccjs.executor.dmgtJobRunner,run,31, DMGT Job Listener running..
[ Mon May 24 09:50:10 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor,initJobPolicies,583,Notification policy is ON, recipients :
[ Mon May 24 09:50:11 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor,initJobPolicies,603,Execution policy is : Parallel Execution
[ Mon May 24 09:50:11 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor,initJobPolicies,616,Getting managed devices list from DM
[ Mon May 24 09:50:11 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.util.db.DatabaseConnectionPool,getConnection,59,Inside ICSDatabaseConnection, MAX_COUNT =20
[ Mon May 24 09:50:11 NZST 2010 ],INFO ,[Thread-4],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutionESSListener,run,55,ESS Message listener started
[ Mon May 24 09:50:25 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor,initJobPolicies,626,Num Threads = 1, Task = Purge Archive, Num Devices = 2696
[ Mon May 24 09:50:25 NZST 2010 ],ERROR,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor,sendMail,949,sendEmailMessage: Null recipient list
[ Mon May 24 09:50:25 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecutor,run,212,Job initialization complete, starting execution
[ Mon May 24 09:50:25 NZST 2010 ],INFO ,[main],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecThread,,113,Constructing ExecutorThread DcmaJobExecThread 0
[ Mon May 24 09:50:26 NZST 2010 ],INFO ,[Thread-7],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecThread,handleMultiDeviceExecution,143,JobExecutorThread - MultiDeviceExec DcmaJobExecThread 0 : Running
[ Mon May 24 09:50:26 NZST 2010 ],INFO ,[Thread-7],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecThread,purgeArchive,706,Purging Archive....
[ Mon May 24 09:50:26 NZST 2010 ],INFO ,[Thread-8],com.cisco.nm.rmeng.dcma.client.ConfigArchivePurger,purgeConfigs,177,PURGE SETTINGS: ByVersion = false ByAge = true PurgeLabelledFiles = false
[ Mon May 24 09:50:26 NZST 2010 ],INFO ,[Thread-8],com.cisco.nm.rmeng.dcma.client.ConfigArchivePurger,purgeConfigs,201,Purge files between start time and Fri May 29 09:50:26 NZST 2009
[ Mon May 24 09:50:26 NZST 2010 ],INFO ,[Thread-8],com.cisco.nm.rmeng.dcma.client.ConfigArchivePurger,purgeConfigs,207,Num Versions to keep = 0
[ Mon May 24 09:50:26 NZST 2010 ],INFO ,[Thread-8],com.cisco.nm.rmeng.dcma.client.ConfigArchivePurger,purgeConfigs,208,Purge files older than 12 Months
[ Mon May 24 10:33:50 NZST 2010 ],INFO ,[Thread-7],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecThread,handleMultiDeviceExecution,149,Completed executeJob(), updating Results
[ Mon May 24 10:33:50 NZST 2010 ],INFO ,[Thread-7],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecThread,getNumCyclesToPoll,1018,getNumCyclesToPoll Function Started.
[ Mon May 24 10:33:50 NZST 2010 ],INFO ,[Thread-7],com.cisco.nm.rmeng.dcma.jobdriver.DcmaJobExecThread,updateMultiDeviceExecResults,781,Awaiting Job results: req Id = 0 Poll time = 2022 min(s)
[ Mon May 24 12:39:42 NZST 2010 ],INFO ,[Tibrv Dispatcher],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutionESSListener,onMessage,66,Listener waiting for message :
[ Mon May 24 12:42:48 NZST 2010 ],INFO ,[Tibrv Dispatcher],com.cisco.nm.rmeng.config.ccjs.executor.CfgJobExecutionESSListener,onMessage,66,Listener waiting for message :

Is this normal? Most of the other jobs at least give you an updated success or failure for the devcies.

Regards

Jeff

Joe Clarke · ‎05-23-2010

If there are a lot of devices and/or archive revisions, the job can take a long time. However, this could still be memory-related. If the job times out again, try jacking up the memory to 1536m. Then reschedule it.

Jeff Law · ‎05-24-2010

Job still running now. These used to take a couple of days, so I'll give it till

tomorrow before trying to stop the job, and upgrade the value once again.

We have just reached the 2,700 device mark. Not sure if that is a lot of devices or not :-)

Regards

Jeff

Jeff Law · ‎05-25-2010

Job finished, and failed as per before. I have increased parameter to 1536m and have disabled and reenabled the job. All sitting ready and waiting to go again.

One thing I have noticed, and maybe my memory is playing tricks on me here, but all devices I have checked on in the Config Version Tree, have only 1 Primary/Startup configuration, for example, 1/May 25 2010 22:10:57.

They have Running configs starting at, for example, 53/Nov 22 2007 20:12:18 and going through to 1129/May 25 2010 22:11:24

The VLAN/Running configs starting at 321/Dec 10 2007 22:10:58 and going through 1129/May 25 2010 22:11:24.

I suppose I am wondering what has happened to all the Startup configs!

In theory we are supposed to keeping the archives for 1 year, but it looks like we havent been keeping any startup ones, and the others arent being purged.

Regards

Jeff

Joe Clarke · ‎05-26-2010

Only one rev of the startup config is ever kept (i.e. the latest). Trobuleshooting archive purge can be tricky. Last time I had to do it (which led me to the memory issue), I need to provide some custom code to get to the root of the problem. Hopefully the increased JVM heap size will help.

Jeff Law · ‎05-26-2010

Well, definitely different results this time.

In the job browser I now see the following, which is much quicker than the previous 2 or days. :-) However I think it is a little too quick.

6.

8623.1

ArchivePurge

Failed

Default Archive
Purge Job

admin

May 26 2010 23:20:00

Weekly

Looking at the Job results I see the folowing in the Work Order (Only entry with any details)

Name:

Archive Mgmt Job Work Order

Summary:

General Info
----------------------------------------------------------------------------------------------
JobId: 8623.1
Owner: admin
Description: Default Archive Purge Job
Schedule Type: Weekly
Job Type: Purge Archive

Job Policies
----------------------------------------------------------------------------------------------

E-mail Notification:
Job Based Password: Disabled

Device Details
----------------------------------------------------------------------------------------------

None

So is the heap size now too much?

Regards

Jeff

Joe Clarke · ‎05-26-2010

A heap of 1536m is usually the highest you can safely do on a 32-bit system. You could try reducing to 1280m. The job log or jrm.log may show an issue with being unable to start the JVM. That would confirm a heap size overflow.

Jeff Law · ‎05-27-2010

Hi Joe,

I have changed the heap size back to 1280m and the job is still running.

At this stage I will have to leave it like this as today is the last day of my current contract. Not sure when (or if) I will be back, although there are plans afoot to upgrade CiscoWorks to LMS3.2 so it could be possible.

Until then, thanks for your patience and advice.

Regards

Jeff

Joe Clarke · ‎05-30-2010

Upgrading to 3.2 wouldb e a good thing to do. LMS 2.6 is winding down, and it will be end of support around this time next calendar year.

I can't say for certain that your purge problem will be fixed in 3.2 (memory really may be the issue here), but it would certainly give you

a lot of new features and other bug fixes.