12-23-2010 08:27 AM
This is going to sound like a seriously noob question because it is.
We have two Ironport C160 devices configured in a cluster in our enviornment. Both seem to purr along nicely but one unit has "crashed" a couple of times in the past few months and I'm trying to figure out what to check, what to do, on the device(s) with regards to basic diagnostics/troubleshooting to find out what has happened. I'd like to go to support have some basic ideas before calling support for assistance. Any thoughts? I'm wondering are there log files to check that might indicate a problem or something else I can look at to see what is going on?
Any input would be greatly appreciated.
Solved! Go to Solution.
12-27-2010 11:59 AM
Greetings,
This is actually a great question. Performing some initial diagnosis (information gathering) prior to contacting support can go a long way into speeding up determination of root cause. It can also improve the accuracy of the analysis. The first thing we have to do is determine what "Crash" means. Did the system become totally unresponsive or did is just become unresponsive to network requests? This reason this is important is that a system that is not responding to commands (locked up) may be encountering a different type of fault situation than say a system that is just not responding to network requests. For example a system that is totally locked up and not responding to any commands from the console may be encountering an problem with the Raid controller or memory (hardware) while a system that is accessible via the console, but not via SSH or HTTPS may simple be over burdon with connections or may be experiencing DNS related issues.
Before powering down the appliance or rebooting it, try connecting to the serial port first. In a majority of cases that we see, the appliance is operational but has encountered a network related issue that is preventing or limiting connections, including those to the GUI or via SSH. Console access will help you determine if the issue goes further than just a networking problem. As mentioned the system logs can be helpful in a situation like this. The Status logs can also help you look at any trends related to a number of parameters, such as CPU, Memory, Connections in and out, as well as disk I/O.
Going further, if you do contact support we prefer that you do so Prior to rebooting the appliance. If you have to power cycle the appliance its possible to loose any logging details related to the event, therefore we would prefer to see the appliance in the fault state.
I am including some detailed information below that outlines some basic diagnostic procedures that we recommend for events like this. If you are seeing such events on a regular basis it would be advisable to contact support so we can perform a more detailed analysis.
Environment: Cisco IronPort Email Security Appliance (ESA), Security Management Appliance (SMA), all versions of AsyncOS
Symptoms: You are unable to connect to your ESA or SMA appliance over the network. You have attempted to connect using the web interface and the CLI via SSH and the appliance does not appear to be answering the requests.
In a majority of cases the appliance is not actually locked up. It may simply be in a state that is preventing it from responding to network requests in the usual manner. Below are some guidelines that can help you diagnose the problem, and possibly get your system back up and running or at least in a state you can work with.
It is very important that you do not power cycle the system unless advised to do so by technical support. Power cycling the appliance can cause data corruption which can result in lost messages, database corruption, lost logging data as well as damage to the file system. When you power cycle the appliance it is not able to unmount the file systems cleanly. For this reason you should always use the 'shutdown' or 'reboot' command from the CLI, or the Shutdown/Reboot option listed under the system administration tab in the GUI.
So what if you rebooted the appliance correctly, and still can not gain access via the network?
In many cases simply swapping out the network cable or moving to another port on the switch can resolve the connectivity issue.
A network crossover cable will allow you to connect directly to the Ethernet ports on the appliance. You will however, have to configure the connecting host to be on the same subnet as the interface your connecting to. Using a network crossover cable can be helpful in diagnosing situations related to your LAN. One such issue is having another host with the same IP address on the same subnet.
If your system is not responding to network requests and immediate access is needed, you can connect to the serial port located on the rear of your appliance. This port is a standard DB9 connector and can be utilized with the serial cable that came with your appliance. If you do not have the serial that came with your appliance you will need to obtain one that is configured as a null modem cable. Optionally you can use a standard serial cable with a null modem adapter. Once you have connected the cable to the appliance you can then connect the other end of the cable to another system, such as a laptop. You will need a terminal program like Hyperterm, or Procom. You will need to configure your terminal program for 9600 Baud 8N1. Once you have started your terminal program, you should be able to connect and get a login. In the event that the serial port is not responding you may want to verify that the cable is connected and the unit is powered on. If you still cannot get a login it is advisable to contact customer support for further assistance.
If you are able to obtain access via the serial port issue the command status, check to see if the appliance is listed as being "Online".
mail.example.com > status detail
Status as of: Mon Jan 04 12:48:31 2010 CST
Up since: Tue Jul 14 16:50:50 2009 CDT (173d 20h 57m 41s)
Last counter reset: Never
System status: Online
Oldest Message: 24 weeks 16 hours 30 mins 48 secs
Feature - Centralized Tracking: 833 days
Feature - Centralized Reporting: 833 days
Feature - IronPort Centralized Configuration Manager: 60 days
Feature - Incoming Mail Handling: Perpetual
Feature - Centralized Spam Quarantine: 833 days
If the status detail command does not respond or produces an error, contact customer support.
Use the "Version" command to check the RAID status.
mail.example.com > version
Current Version
===============
Model: M660
Version: 6.5.2-101
Build Date: 2009-05-28
Install Date: 2009-07-14 17:04:32
Serial #: 002C999999-J999999
BIOS: 2.4.3I
RAID: 1.21.02-0528, 2.01.00, 1.02-014B
RAID Status: Optimal
RAID Type: 10
BMC: 1.77
If the RAID is degraded its possible the appliance is encountering other that may or may not be related to the apparent lock up. If the Version command will not respond or provide any data contact customer support.
Check your network configuration using the command etherconfig.
mail.example.com > etherconfig
Choose the operation you want to perform:
- MEDIA - View and edit ethernet media settings.
- VLAN - View and configure VLANs.
- LOOPBACK - View and configure Loopback.
- MTU - View and configure MTU.
[]> media
Ethernet interfaces:
1. Data 1 (Autoselect: )) 00:22:19:b0:03:c4
2. Data 2 (Autoselect: )) 00:22:19:b0:03:c6
3. Management (Autoselect: <1000baseTX full-duplex>) 00:10:18:4e:29:88
Choose the operation you want to perform:
- EDIT - Edit an ethernet interface.
[]>
Choose the operation you want to perform:
- MEDIA - View and edit ethernet media settings.
- VLAN - View and configure VLANs.
- LOOPBACK - View and configure Loopback.
- MTU - View and configure MTU.
[]> MTU
Ethernet interfaces:
1. Data 1 default mtu 1500
2. Data 2 default mtu 1500
3. Management default mtu 1500
Choose the operation you want to perform:
- EDIT - Edit an ethernet interface.
[]>
Recent network changes can have an impact on connectivity to the appliance.
Use the command "interfaceconfig" to verify your interface settings.
mail.example.com > interfaceconfig
Currently configured interfaces:
1. Management (192.168.1.33/24 on Management: downside.hometown.net)
2. outbound_gloop_ISQ_notify (192.168.1.34/24 on Management: inside.hometown.net)
Choose the operation you want to perform:
- NEW - Create a new interface.
- EDIT - Modify an interface.
- GROUPS - Define interface groups.
- DELETE - Remove an interface.
[]>
Try flushing out all the network related cache.
mail.example.com > diagnostic
Choose the operation you want to perform:
- RAID - Disk Verify Utility.
- DISK_USAGE - Check Disk Usage.
- NETWORK - Network Utilities.
- REPORTING - Reporting Utilities.
- TRACKING - Tracking Utilities.
[]> network
Choose the operation you want to perform:
- FLUSH - Flush all network related caches.
- ARPSHOW - Show system ARP cache.
- SMTPPING - Test a remote SMTP server.
- TCPDUMP - Dump ethernet packets.
[]> flush
Flushing LDAP cache.
Flushing DNS cache.
Flushing system ARP cache.
10.92.152.1 (10.92.152.1) deleted
10.92.152.18 (10.92.152.18) deleted
Network reset complete.
Choose the operation you want to perform:
- FLUSH - Flush all network related caches.
- ARPSHOW - Show system ARP cache.
- SMTPPING - Test a remote SMTP server.
- TCPDUMP - Dump ethernet packets.
[]>
If any of the network related commands fail to respond, contact customer support.
Once you have performed these steps, if you are still unable to gain access via the network it would be advisable to contact customer support for further assistance.
Christopher C Smith
CSE
Cisco IronPort Customer Support
12-24-2010 12:22 AM
In GUI, you can go to "System Administration > Alerts" and configure your Ironport to send system alarms and hardware alarms via e-mail.
Also, in CLI you can read the "system_logs", the next example uses "grep" to show only the "system_logs" with the word "Sep"
ironport> grep -e "Sep" system_logs
Wed Sep 1 2010 Critical: Could not issue an SNMP trap: Cannot find module (SNMPv2-MIB): At line 0 in (none)
Wed Sep 1 2010 Warning: Received an invalid DNS Response: rcode=ServFail data="'\\xbd\\x8b\\x81\\x82\\x00\\
12-29-2010 08:25 AM
Thanks for the feedback and suggestions Eduardo.
12-27-2010 11:59 AM
Greetings,
This is actually a great question. Performing some initial diagnosis (information gathering) prior to contacting support can go a long way into speeding up determination of root cause. It can also improve the accuracy of the analysis. The first thing we have to do is determine what "Crash" means. Did the system become totally unresponsive or did is just become unresponsive to network requests? This reason this is important is that a system that is not responding to commands (locked up) may be encountering a different type of fault situation than say a system that is just not responding to network requests. For example a system that is totally locked up and not responding to any commands from the console may be encountering an problem with the Raid controller or memory (hardware) while a system that is accessible via the console, but not via SSH or HTTPS may simple be over burdon with connections or may be experiencing DNS related issues.
Before powering down the appliance or rebooting it, try connecting to the serial port first. In a majority of cases that we see, the appliance is operational but has encountered a network related issue that is preventing or limiting connections, including those to the GUI or via SSH. Console access will help you determine if the issue goes further than just a networking problem. As mentioned the system logs can be helpful in a situation like this. The Status logs can also help you look at any trends related to a number of parameters, such as CPU, Memory, Connections in and out, as well as disk I/O.
Going further, if you do contact support we prefer that you do so Prior to rebooting the appliance. If you have to power cycle the appliance its possible to loose any logging details related to the event, therefore we would prefer to see the appliance in the fault state.
I am including some detailed information below that outlines some basic diagnostic procedures that we recommend for events like this. If you are seeing such events on a regular basis it would be advisable to contact support so we can perform a more detailed analysis.
Environment: Cisco IronPort Email Security Appliance (ESA), Security Management Appliance (SMA), all versions of AsyncOS
Symptoms: You are unable to connect to your ESA or SMA appliance over the network. You have attempted to connect using the web interface and the CLI via SSH and the appliance does not appear to be answering the requests.
In a majority of cases the appliance is not actually locked up. It may simply be in a state that is preventing it from responding to network requests in the usual manner. Below are some guidelines that can help you diagnose the problem, and possibly get your system back up and running or at least in a state you can work with.
It is very important that you do not power cycle the system unless advised to do so by technical support. Power cycling the appliance can cause data corruption which can result in lost messages, database corruption, lost logging data as well as damage to the file system. When you power cycle the appliance it is not able to unmount the file systems cleanly. For this reason you should always use the 'shutdown' or 'reboot' command from the CLI, or the Shutdown/Reboot option listed under the system administration tab in the GUI.
So what if you rebooted the appliance correctly, and still can not gain access via the network?
In many cases simply swapping out the network cable or moving to another port on the switch can resolve the connectivity issue.
A network crossover cable will allow you to connect directly to the Ethernet ports on the appliance. You will however, have to configure the connecting host to be on the same subnet as the interface your connecting to. Using a network crossover cable can be helpful in diagnosing situations related to your LAN. One such issue is having another host with the same IP address on the same subnet.
If your system is not responding to network requests and immediate access is needed, you can connect to the serial port located on the rear of your appliance. This port is a standard DB9 connector and can be utilized with the serial cable that came with your appliance. If you do not have the serial that came with your appliance you will need to obtain one that is configured as a null modem cable. Optionally you can use a standard serial cable with a null modem adapter. Once you have connected the cable to the appliance you can then connect the other end of the cable to another system, such as a laptop. You will need a terminal program like Hyperterm, or Procom. You will need to configure your terminal program for 9600 Baud 8N1. Once you have started your terminal program, you should be able to connect and get a login. In the event that the serial port is not responding you may want to verify that the cable is connected and the unit is powered on. If you still cannot get a login it is advisable to contact customer support for further assistance.
If you are able to obtain access via the serial port issue the command status, check to see if the appliance is listed as being "Online".
mail.example.com > status detail
Status as of: Mon Jan 04 12:48:31 2010 CST
Up since: Tue Jul 14 16:50:50 2009 CDT (173d 20h 57m 41s)
Last counter reset: Never
System status: Online
Oldest Message: 24 weeks 16 hours 30 mins 48 secs
Feature - Centralized Tracking: 833 days
Feature - Centralized Reporting: 833 days
Feature - IronPort Centralized Configuration Manager: 60 days
Feature - Incoming Mail Handling: Perpetual
Feature - Centralized Spam Quarantine: 833 days
If the status detail command does not respond or produces an error, contact customer support.
Use the "Version" command to check the RAID status.
mail.example.com > version
Current Version
===============
Model: M660
Version: 6.5.2-101
Build Date: 2009-05-28
Install Date: 2009-07-14 17:04:32
Serial #: 002C999999-J999999
BIOS: 2.4.3I
RAID: 1.21.02-0528, 2.01.00, 1.02-014B
RAID Status: Optimal
RAID Type: 10
BMC: 1.77
If the RAID is degraded its possible the appliance is encountering other that may or may not be related to the apparent lock up. If the Version command will not respond or provide any data contact customer support.
Check your network configuration using the command etherconfig.
mail.example.com > etherconfig
Choose the operation you want to perform:
- MEDIA - View and edit ethernet media settings.
- VLAN - View and configure VLANs.
- LOOPBACK - View and configure Loopback.
- MTU - View and configure MTU.
[]> media
Ethernet interfaces:
1. Data 1 (Autoselect: )) 00:22:19:b0:03:c4
2. Data 2 (Autoselect: )) 00:22:19:b0:03:c6
3. Management (Autoselect: <1000baseTX full-duplex>) 00:10:18:4e:29:88
Choose the operation you want to perform:
- EDIT - Edit an ethernet interface.
[]>
Choose the operation you want to perform:
- MEDIA - View and edit ethernet media settings.
- VLAN - View and configure VLANs.
- LOOPBACK - View and configure Loopback.
- MTU - View and configure MTU.
[]> MTU
Ethernet interfaces:
1. Data 1 default mtu 1500
2. Data 2 default mtu 1500
3. Management default mtu 1500
Choose the operation you want to perform:
- EDIT - Edit an ethernet interface.
[]>
Recent network changes can have an impact on connectivity to the appliance.
Use the command "interfaceconfig" to verify your interface settings.
mail.example.com > interfaceconfig
Currently configured interfaces:
1. Management (192.168.1.33/24 on Management: downside.hometown.net)
2. outbound_gloop_ISQ_notify (192.168.1.34/24 on Management: inside.hometown.net)
Choose the operation you want to perform:
- NEW - Create a new interface.
- EDIT - Modify an interface.
- GROUPS - Define interface groups.
- DELETE - Remove an interface.
[]>
Try flushing out all the network related cache.
mail.example.com > diagnostic
Choose the operation you want to perform:
- RAID - Disk Verify Utility.
- DISK_USAGE - Check Disk Usage.
- NETWORK - Network Utilities.
- REPORTING - Reporting Utilities.
- TRACKING - Tracking Utilities.
[]> network
Choose the operation you want to perform:
- FLUSH - Flush all network related caches.
- ARPSHOW - Show system ARP cache.
- SMTPPING - Test a remote SMTP server.
- TCPDUMP - Dump ethernet packets.
[]> flush
Flushing LDAP cache.
Flushing DNS cache.
Flushing system ARP cache.
10.92.152.1 (10.92.152.1) deleted
10.92.152.18 (10.92.152.18) deleted
Network reset complete.
Choose the operation you want to perform:
- FLUSH - Flush all network related caches.
- ARPSHOW - Show system ARP cache.
- SMTPPING - Test a remote SMTP server.
- TCPDUMP - Dump ethernet packets.
[]>
If any of the network related commands fail to respond, contact customer support.
Once you have performed these steps, if you are still unable to gain access via the network it would be advisable to contact customer support for further assistance.
Christopher C Smith
CSE
Cisco IronPort Customer Support
12-29-2010 08:37 AM
Thanks Christopher! This is exactly what I was looking for.
12-30-2010 09:25 AM
Chris, AWESOME post. Thank you for the time invested in writing it. I added it to my personal notes, not only for how to troubleshoot issues but also on how to write troubleshooting documentation. : ) Emoticons not working on my IE 9 browser. : (
Happy Holidays IronPort Nation!
Jason
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide