Solved: Re: icm error message

cisco steps · ‎03-12-2011

Hi ,

I am not an expert or even know how ICM works. but I was asked to look up some of the messages that comes to our monitoring tool. I am asking for your expertise ot help me achive the knowlege as well understand the output of the error messages. Thank you

Connected To Client on.*using port
Device .* path changing to idle state
Device .* path changing to active state
Distributor node process .* successfully started
Logger.*node process .* successfully started
Peripheral is off-line and not visible to the Peripheral Gateway. Routing to this site is impacted
Peripheral .* is on-line
PG has reported that peripheral .* is not operational
PG has reported that peripheral .* is operational
Side .* process is OK
Process .* on .* went down for unknown reason. Exit code .* It will be automatically restarted
HeartBeat Event for Side
node restarting process .* after having delayed restart for 10 seconds
node process .* successfully reinitialized after restart
Logger or HDS on connection .* using .* is either out of service or communication has broken
Peripheral Gateway .* is not connected to the Central Controller or is out of service. Routing to this site is impacted
Process {proc} at the Central Site side X is down

Thank you

geoff · ‎03-12-2011

OK, I'll have a crack at this. Please note what I say at the end of this post about the necessity of receiving proper training.

The essential architecture of ICM is a routing engine (Call Router) and a database (Logger) connected to a set of "peripherals" through systems called Peripheral Gateways (PGs), thus making a contact center.

The peripherals are channels controllers. For the traditional inbound voice channel, the peripherals are twofold:

1. phone switches: the VoIP switch Cisco UCM, and a supported set of TDM ACDs

2. Voice Response Units: the VoIP VRUs Cisco IPIVR and Cisco CVP; and a certified set of non-Cisco TDM IVRs who connect to ICM obeying a specific protocol.

For other channels, the peripherals are:

1. Cisco EIM (Email)

2. Cisco WIM (Web collab)

3. two Cisco Outbound peripherals: the Dialer and the SIP Dialer

4. a certified set of non-Cisco peripherals who connect to ICM obeying a specific protocol.

These systems consititute a route request and response paradigm.

The peripheral makes a route request via its PG to the Call Router, who answers asynchronously with a target (called a "label") back to the PG, which passes it back to the peripheral in an understandable form for delivery.

The peripheral is responsible for queuing the contact while waiting for the label, because the Call Router is not a queuing platform. The peripheral may queue the contact itself, such as with a TDM ACD; or it may employ another peripheral because it cannot queue (such as CUCM uses IPIVR), or it may use a voice gateway (as in CVP).

Users of the system (agents) talk either directly to processes on the system, or by proxy through their channel controller. Clients send requests and receive responses, and also receive aynchronous events indicating state changes that the client must process in order to keep in sync with the agent state map held in the memory of the Call Router, which it needs to do its job correctly.

The general term for a component of ICM is "node". A node comprises several software processes under the control of a Node Manager. The Node Manager knows what software processes should be running, and is responsible for starting them, and monitoring them. Should one fail, the Node Manager will restart the failed process. Keeping the Node Manager running is the Node Manager Manager, which keys to an automatic Windows service. Restarts by the Node Manager are controlled through semi-random delays to avoid firestorms. This results in messages like:

Process .* on .* went down for unknown reason. Exit code .* It will be automatically restarted

node restarting process .* after having delayed restart for 10 seconds

node process .* successfully reinitialized after restart

Distributor node process .* successfully started

Logger.*node process .* successfully started

This self-healing nature of a node is strengthened immeasurably by a "duplex" implementation. A node can be created as a pair of peers (side A and side B) that talk to each other over a private network. Now the node has fault tolerance, because the server (hardware) could fail on one side, but the other side will still be running. Communication between the sides will result in messages like

HeartBeat Event for Side

When the duplexed node is a PG, the Call Router delivers events to both sides, and the "active" side knows to act upon the events, while the "idle" side knows to discard the events. Sometimes the sides exchange roles because a process critical to the node fails, and although the Node Manager will restart the process, this is not going to provide true fault tolerance and events will be lost. But not so with the ICM system. Through a complex architecture of "synchronized zones" and a clever Message Delivery System, a side can "fail over" to the other side and no events will be lost.

Together, the Call Router and Logger are known as the Central Controller. The Call Router knows about PGs connected to it, and it the case of duplexed PGs, knows which path is the "active path" and which is the "idle path". As the duplexed PG changes side, the Router will show messages like:

Device .* path changing to idle state

Device .* path changing to active state

A PG interfaces to the peripheral through a Peripheral Interface Monitor (PIM). If the communication between the peripheral and the PIM is broken, then the peripheral cannot make route requests or receive routing labels, and the Call Router will know about this, and you will see messages like:

Peripheral is off-line and not visible to the Peripheral Gateway. Routing to this site is impacted

PG has reported that peripheral .* is not operational

Peripheral Gateway .* is not connected to the Central Controller or is out of service. Routing to this site is impacted

and when the conenction is re-established, messages like:

Peripheral .* is on-line

PG has reported that peripheral .* is operational.

The Central Controller can also be implemented as a duplexed system on two servers (colocated Call Router and Logger for each of side A and side B) or even on four servers (Call Router and Logger on separate servers for each of side A and side B). Now an even more complicated synchronization architecture comes into play, in order to keep the Call Routers synchronized and able to deal with node failure with no loss of route requests or calls in queue.

Finally, the system requires a way of changing configuration on the Central Controller and a way of offloading real-time data (for efficiency), and a way of offloading historical data from the Loggers, as they have a minimal retention period.

In order to change configuration a node called a Distributor (you can have more than one) has to be created. This has a transient copy of the configuration held on the Logger and establishes a real-time feed to the Router so that the configuration can be changed and all Distributors in the platform will be updated and synchronized.

Hence messages like:

Distributor node process .* successfully started

The Distributor will receive real-time data and update it's database with real-time changes so that pseudo real-time reporting can be implemented.

The Distributor can also have an additional role as a Historical Data Server, receiving replicated historical data from the Logger. If you have duplexed Loggers you would want two Historical Data Server, but you can have up to 4 to spread the reporting load. Historical data is replicated through SQL commands via replication processes running on the Logger and the Distributor. This could give rise to messages like:

Logger or HDS on connection .* using .* is either out of service or communication has broken

An ICM system that has been correctly implemented with suitable redundancy is a fault-tolerant and self-healing model that will fix itself under hundred of different fault conditions. When it does this, it will spit out messages as things go down, processes get restarted, sides swap over and peripherals go off line and come back. In general, there will be no interruption of service.

But let me say this. You say "I was asked to look up some of the messages that comes to our monitoring tool." If this is going to lead to more than a casual comment to the powers that be, and they are going to ask you to maintain this system, you MUST get the correct training.

There is no substitute. The Product Training course and the Admin Training course should be taken as soon as possible, and you should follow up be learning on the lab or test system at your contact center. And if you don't have a test system, then move heaven and earth to build one that matches production.

These things are mission critical and cannot be approached in a non-systematic way.

Regards,

Geoff

View solution in original post

geoff · ‎03-12-2011