SG350X-24P - Crash with constant reboot with Firmware 2.5.9.16

carsten@siski.de · ‎05-15-2023

Hello alltogether,

I'm using a SG350X-24P and I upgraded from firmware 2.5.8.15 to firmware 2.5.9.16. After my workstation with Mellanox 10G card (connected via fiber to the uplink ports) came up, the switch crashed immediately with a hard reboot in an endless loop until the fibre connection was removed.

The following error was logged via syslog to my small server system:

2023-05-15T21:22:41.387964+02:00 octopus.mgmt.siski.de %LINK-I-CHNGCOMBOMEDIA: Media changed from copper media to fiber media on port te1/0/2.
2023-05-15T21:34:00.980374+02:00 octopus.mgmt.siski.de %SYSLOG-F-OSFATAL: mtdSoftwareReset(((rel_ifIndex < (64 * 2))?EXTHWP_SF_phy_port_db_ARR[rel_ifIndex]->mtd_object:EXTHWP_SF_phy_port_db_ARR[0]->mtd_object), HALP_config_phy_port_db[rel_i
fIndex].external_phyId, sleep_time_ms) failed with 0x1 ***** FATAL ERROR *****   Reporting Task: HCLT. Software Version: 2.5.9.16 (date Feb 27 2023 time 16:53:52) base_address=0x00444000
2023-05-15T21:34:00.980966+02:00 octopus.mgmt.siski.de ros(+0x798c98)[0xbdcc98] ros(HOSTG_fatal_error+0x14)[0xbe0130] ros(OSSYSG_fatal_error+0x258)[0x10f3370] ros(OSSYSG_fatal_error_formatted+0x44)[0x10f3510] ros(+0x1046f9c)[0x148af9c]
ros(EXTHWP_SF_set_power_modules_2+0x268)[0x148b2e8] ros(EXTHWG_SF_dispatch+0x78)[0x148cef4] ros(HALP_config_phy_set_power_modules_4+0x14)[0x1444748] ros(HALP_config_phy_perform_phy_operation+0xe8)[0x1430e08] ros(HALC_config_phy_perform
_phy_operation+0xfc)[0x1445588] ros(+0xfcb458)[0x140f458] ros(HALC_config_if_dispatch+0x230)[0x1418eac] ros(+0xfdd12c)[0x142112c] ros(+0xfdd350)[0x1421350] ros(HALP_config_main_copy_big_dev_data+0x0)[0x14213d0] /lib/libp2linux.so.1(ta
sk_run+0xf4)[0xb6f3a840]    ***** END OF FATAL ERROR *****
2023-05-15T21:34:00.981567+02:00 octopus.mgmt.siski.de %SYSLOG-F-OSFATAL: mtdSoftwareReset(((rel_ifIndex < (64 * 2))?EXTHWP_SF_phy_port_db_ARR[rel_ifIndex]->mtd_object:EXTHWP_SF_phy_port_db_ARR[0]->mtd_object), HALP_config_phy_port_db[rel_i
fIndex].external_phyId, sleep_time_ms) failed with 0x1 ***** FATAL ERROR *****   Reporting Task: HCLT. Software Version: 2.5.9.16 (date Feb 27 2023 time 16:53:52) base_address=0x004c5000
2023-05-15T21:34:00.982076+02:00 octopus.mgmt.siski.de ros(+0x798c98)[0xc5dc98] ros(HOSTG_fatal_error+0x14)[0xc61130] ros(OSSYSG_fatal_error+0x258)[0x1174370] ros(OSSYSG_fatal_error_formatted+0x44)[0x1174510] ros(+0x1046f9c)[0x150bf9c]
ros(EXTHWP_SF_set_power_modules_2+0x268)[0x150c2e8] ros(EXTHWG_SF_dispatch+0x78)[0x150def4] ros(HALP_config_phy_set_power_modules_4+0x14)[0x14c5748] ros(HALP_config_phy_perform_phy_operation+0xe8)[0x14b1e08] ros(HALC_config_phy_perform
_phy_operation+0xfc)[0x14c6588] ros(+0xfcb458)[0x1490458] ros(HALC_config_if_dispatch+0x230)[0x1499eac] ros(+0xfdd12c)[0x14a212c] ros(+0xfdd350)[0x14a2350] ros(HALP_config_main_copy_big_dev_data+0x0)[0x14a23d0] /lib/libp2linux.so.1(ta
sk_run+0xf4)[0xb6ea9840]    ***** END OF FATAL ERROR *****
2023-05-15T21:34:00.982566+02:00 octopus.mgmt.siski.de %SYSLOG-F-OSFATAL: mtdSoftwareReset(((rel_ifIndex < (64 * 2))?EXTHWP_SF_phy_port_db_ARR[rel_ifIndex]->mtd_object:EXTHWP_SF_phy_port_db_ARR[0]->mtd_object), HALP_config_phy_port_db[rel_i
fIndex].external_phyId, sleep_time_ms) failed with 0x1 ***** FATAL ERROR *****   Reporting Task: HCLT. Software Version: 2.5.9.16 (date Feb 27 2023 time 16:53:52) base_address=0x00499000
2023-05-15T21:34:00.983063+02:00 octopus.mgmt.siski.de ros(+0x798c98)[0xc31c98] ros(HOSTG_fatal_error+0x14)[0xc35130] ros(OSSYSG_fatal_error+0x258)[0x1148370] ros(OSSYSG_fatal_error_formatted+0x44)[0x1148510] ros(+0x1046f9c)[0x14dff9c]
ros(EXTHWP_SF_set_power_modules_2+0x268)[0x14e02e8] ros(EXTHWG_SF_dispatch+0x78)[0x14e1ef4] ros(HALP_config_phy_set_power_modules_4+0x14)[0x1499748] ros(HALP_config_phy_perform_phy_operation+0xe8)[0x1485e08] ros(HALC_config_phy_perform
_phy_operation+0xfc)[0x149a588] ros(+0xfcb458)[0x1464458] ros(HALC_config_if_dispatch+0x230)[0x146deac] ros(+0xfdd12c)[0x147612c] ros(+0xfdd350)[0x1476350] ros(HALP_config_main_copy_big_dev_data+0x0)[0x14763d0] /lib/libp2linux.so.1(ta
sk_run+0xf4)[0xb6f73840]    ***** END OF FATAL ERROR *****
2023-05-15T21:34:00.983690+02:00 octopus.mgmt.siski.de %SYSLOG-F-OSFATAL: mtdSoftwareReset(((rel_ifIndex < (64 * 2))?EXTHWP_SF_phy_port_db_ARR[rel_ifIndex]->mtd_object:EXTHWP_SF_phy_port_db_ARR[0]->mtd_object), HALP_config_phy_port_db[rel_i
fIndex].external_phyId, sleep_time_ms) failed with 0x1 ***** FATAL ERROR *****   Reporting Task: HCLT. Software Version: 2.5.9.16 (date Feb 27 2023 time 16:53:52) base_address=0x004fb000
2023-05-15T21:34:00.984208+02:00 octopus.mgmt.siski.de ros(+0x798c98)[0xc93c98] ros(HOSTG_fatal_error+0x14)[0xc97130] ros(OSSYSG_fatal_error+0x258)[0x11aa370] ros(OSSYSG_fatal_error_formatted+0x44)[0x11aa510] ros(+0x1046f9c)[0x1541f9c]
ros(EXTHWP_SF_set_power_modules_2+0x268)[0x15422e8] ros(EXTHWG_SF_dispatch+0x78)[0x1543ef4] ros(HALP_config_phy_set_power_modules_4+0x14)[0x14fb748] ros(HALP_config_phy_perform_phy_operation+0xe8)[0x14e7e08] ros(HALC_config_phy_perform
_phy_operation+0xfc)[0x14fc588] ros(+0xfcb458)[0x14c6458] ros(HALC_config_if_dispatch+0x230)[0x14cfeac] ros(+0xfdd12c)[0x14d812c] ros(+0xfdd350)[0x14d8350] ros(HALP_config_main_copy_big_dev_data+0x0)[0x14d83d0] /lib/libp2linux.so.1(ta
sk_run+0xf4)[0xb6f56840]    ***** END OF FATAL ERROR *****
2023-05-15T21:34:00.986547+02:00 octopus.mgmt.siski.de %SYSLOG-N-LOGGING: Logging started.
2023-05-15T21:34:05.070855+02:00 sw3-2l.mgmt.siski.de %BOOTP_DHCP_CL-I-DHCPCONFIGURED: The device has been configured on interface Vlan 30 , IP 172.16.1.23, mask 255.255.255.0, DHCP server 172.16.1.17
2023-05-15T21:34:05.234446+02:00 sw5-1l.mgmt.siski.de %BOOTP_DHCP_CL-I-DHCPCONFIGURED: The device has been configured on interface Vlan 30 , IP 172.16.1.22, mask 255.255.255.0, DHCP server 172.16.1.17
2023-05-15T21:34:26.954738+02:00 octopus.mgmt.siski.de %LINK-I-Up: gi1/0/24
2023-05-15T21:34:27.618387+02:00 octopus.mgmt.siski.de %LINK-W-Down: gi1/0/24
2023-05-15T21:34:30.401167+02:00 octopus.mgmt.siski.de %LINK-I-Up: gi1/0/24
2023-05-15T21:34:32.090383+02:00 octopus.mgmt.siski.de %LINK-I-Up: gi1/0/16
2023-05-15T21:35:00.719544+02:00 octopus.mgmt.siski.de %LINK-W-Down: gi1/0/24
2023-05-15T21:35:10.431903+02:00 octopus.mgmt.siski.de %LINK-I-CHNGCOMBOMEDIA: Media changed from copper media to fiber media on port te1/0/2.

There is no issue with firmware 2.5.8.15. The network card on the other side causing the crash of firmware 2.5.9.16 is a
Mellanox Technologies MT27710 Family [ConnectX-4 Lx]. It's using the mlx5 driver from ubuntu 22.04.

The card Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] is not causing problems and works also with the newer firmware (but is using the mlx4 driver on a debian buster)
The SFP+ modules are running since years in this switch and were never causing any issues (finisar and similar brands) up to this date.

Detailled description can be given on request.

Regards

vistalba · ‎08-02-2023

I run into a similar issue as well. Connected are two supermicro servers with 10G SFP+s to the switch.
After a power outage (servers shut down with UPS) the systems never came back up. I started with analysis and saw that the Switch SG350X-24P is in a infinite boot loop.
I tried with disconnecting the ports and it came back up. So I started to plug-in cable for cable. As soon as I attatch one of the supermicro servers it crashed. It doesnt matter if the SFP+ is pluged in. It crashed as soon as the link cames up when the cable is connected on both ends.

Log:

%SYSLOG-F-OSFATAL: mtdSoftwareReset(((rel_ifIndex < (64 * 2))?EXTHWP_SF_phy_port_db_ARR[rel_ifIndex]->mtd_object:EXTHWP_SF_phy_port_db_ARR[0]->mtd_object), HALP _config_phy_port_db[rel_ifIndex].external_phyId, sleep_time_ms) failed with 0x1 ***** FATAL ERROR *****  Reporting Task: HCLT. Software Version: 2.5.9.16 (da te Feb 27 2023 time 16:53:52) base_address=0x0048a000 ros(+0x798c98)[0xc22c98] ros(HOSTG_fatal_error+0x14)[0xc26130] ros(OSSYSG_fatal_error+0x258)[0x1139370 ] ros(OSSYSG_fatal_error_formatted+0x44)[0x1139510] ros(+0x1046f9c)[0x14d0f9c] ros(EXTHWP_SF_set_power_modules_2+0x268)[0x14d12e8] ros(EXTHWG_SF_dispatch+0x

This issue at least occours on 2.5.9.15 and 2.5.9.16. The switch was running fine for some month in the exact same configuration. Trying now to downgrade to 2.5.8.15 as mentioned in oritinal post.

Edit: one more thing: the SFP+ are connected in the combo port while the corresponding copper port is emtpy.

Update: After downgrading to 2.5.8.15 the SFP+ ports and links are now stable since more than 24h without any issues.

GBInformatics · ‎12-24-2024

Hello, I have a similar problem. More details at this link: https://community.cisco.com/t5/switches-small-business/system-parameters-reset-to-zero-sg350-28mp/m-p/5240168#M29248.

Update: After downgrading to version 2.5.8.15, the SFP+ ports and links have now been stable for over 24 hours without any problems.

I only have access to the interface for 1 minute. On putty via the console cable I can't enter the username.

I got this by plugging in the console cable
***** FATAL ERROR *****
Reporting Task: DH6C.
Software Version: 2.5.0.83 (date Jun 18 2019 time 16:44:23)
base_address=0xb4733000
ros(+0x78a5f0)[0xb4ebd5f0]
ros(HOSTG_fatal_error+0x10)[0xb4ebfbf4]
ros(OSSYSG_fatal_error+0x2a0)[0xb54895dc]
ros(+0xb58348)[0xb528b348]
ros(+0x849270)[0xb4f7c270]
ros(+0x84ba74)[0xb4f7ea74]
ros(+0x84857c)[0xb4f7b57c]
ros(+0x84d8c4)[0xb4f808c4]
ros(DHCPV6CLIENTP_task+0x3ec)[0xb4f84638]
/lib/libp2linux.so.1(task_run+0xf4)[0xb46c3818]

My biggest concern: have you managed to reboot the system?

vistalba · ‎12-24-2024

As long as I stay on Firmware version 2.5.8.15 the switch is booting normal with all Interfaces and SFP+ connected.

As soon as I try a 2.5.9.x version the issue reappears. So currently I‘ve to stick on 2.5.8.15.

GBInformatics · ‎12-24-2024

When the switch kept restarting, how did you update it?

In my case, I only have access to the WEB interface for 30 seconds. When I put in the file ‘image_tesla_hybrid_2.5.9.54_release_cisco_signed.bin’, it stops in progress.

vistalba · ‎12-24-2024

As mentioned the issues seems to be SFP+ when the link comes up and the 2.5.9.x versions. So I just disconnected the SFP+ for downgrading. When SFP+ are disconnected the switch came up with 2.5.9.x normally.

GBInformatics · ‎12-24-2024

I don't have an SFP+ module on the switch

GBInformatics · ‎12-24-2024

what I got after pressing the ‘reset’ button for 10 seconds with no RJ45 port connected

To perform reset to factory defaults do not release the button for 10 seconds.

Resetting device to factory defaults.

**************************************************

***************** SYSTEM RESET *****************

**************************************************

Restarting system.

BootROM 1.41

Booting from NAND flash

General initialization - Version: 1.0.0

Serdes initialization - Version: 1.0.2

PEX: pexIdx 0, detected no link

DDR3 Training Sequence - Ver TIP-1.56.0

DDR3 Training Sequence - Switching XBAR Window to FastPath Window

Updated Physical Mem size is from 0x20000000 to 10000000

DDR3 Training Sequence - Ended Successfully

BootROM: Image checksum verification PASSED

ROS Booton: May 26 2019 14:16:26

Press x to choose XMODEM...

Booting from NAND flash

Running UBOOT...

U-Boot 2013.01 (Jun 18 2019 - 16:47:02) Marvell version: 2014_T3.0_eng_dropv6 2.5.18

Loading system/images/active-image ...

secure boot not supported

Uncompressing Linux... done, booting the kernel.

I2C frequency 100 kHz (Tclk 200 MHz, freq_m 12, freq_n 3)

MAC address : ac:4a:56:77:4a:9f.

Autoboot in 2 seconds - press RETURN or Esc. to abort and enter prom.

*******************************************************************

*** Running SW Ver. 2.5.0.83 Date Jun 18 2019 Time 16:44:23 ***

*******************************************************************

HW version is V07

Serial Number is DNIxxxxxxxWG

Base Mac address is: ac:aa:aa:aa:4a:9f

Dram size is : 512M bytes

Flash size is: 256M

18-Jun-2019 04:45:19 %CDB-I-LOADCONFIG: Loading running configuration.

18-Jun-2019 04:45:19 %CDB-I-LOADCONFIG: Loading startup configuration.

Device configuration:

Slot 1 - SG350-28MP

Device 0: CPSS_98DX3235 (AlleyCat3)

CPLD version is: 0x03

CPU speed: 800 MHz

------------------------------------

-- Unit Factory Default --

------------------------------------

18-Jun-2019 04:45:33 %INIT-I-InitCompleted: Initialization task is completed

>

-----------------------------------

-- Unit Number 1 Master Enabled --

-----------------------------------

18-Jun-2019 04:45:43 %Environment-W-RPS-STAT-MSG: Power supply source changed to Main Power Supply.

18-Jun-2019 04:45:43 %MLDP-I-MASTER: Switching to the Master Mode.

18-Jun-2019 04:45:46 %Entity-I-SEND-ENT-CONF-CHANGE-TRAP: entity configuration change trap.

18-Jun-2019 04:45:46 %SNMP-I-CDBITEMSNUM: Number of running configuration items loaded: 0

18-Jun-2019 04:45:46 %SNMP-I-CDBITEMSNUM: Number of startup configuration items loaded: 0

The SSH Server is generating a default RSA key.

This may take a few minutes, depending on the key size.

18-Jun-2019 04:45:47 %NT_poe-I-PoEPowerSourceChange: Active power source set to PS for unit 1

The SSH Server is generating a default DSA key.

This may take a few minutes, depending on the key size.

18-Jun-2019 04:45:51 %Environment-I-FAN-STAT-CHNG: FAN# 1 status changed to operational.

18-Jun-2019 04:45:51 %Environment-I-FAN-STAT-CHNG: FAN# 2 status changed to operational.

The SSH Client is generating a default RSA key.

This may take a few minutes, depending on the key size.

The SSH Client is generating a default DSA key.

This may take a few minutes, depending on the key size.

18-Jun-2019 04:46:00 %SSL-I-SSLCTASK: Starting autogeneration of self-signed certificate - 2048 bits

Generating RSA private key, 2048 bit long modulus

18-Jun-2019 04:46:13 %SSL-I-SSLCTASK: Autogeneration of self-signed certificate was successfully completed

Generating RSA private key, 2048 bit long modulus

>lcli

Console baud-rate auto detection is enabled, press Enter twice to complete the detection process

User Name :

Detected speed: 115200

User Name:cisco

Password:*****

Please change your username AND password from the default settings.

Change of credentials is required for better protection of your network.

Please note that new password must follow password complexity rules.

Enter new username: az

Enter new password: ********

Confirm new password: ********

Username and password were successfully updated.

switch774a9f#24-Dec-2024 19:00:58 %DHCPV6CLIENT-I-ADDR: DHCPv6 Address :: received on vlan 1 from DHCP Server fe80::6a3f:7dff:fe3d:6ef0 was renewed

24-Dec-2024 19:00:58 %DHCPV6CLIENT-I-ADDR: DHCPv6 Address :: received on vlan 1 from DHCP Server fe80::6a3f:7dff:fe3d:6ef0 was renewed

24-Dec-2024 19:01:09 %DHCPV6CLIENT-F-HASHINCONS: Hash table inconsistancy, table - DHCPV6CLIENTP_update_address

***** FATAL ERROR *****

Reporting Task: DH6C.

Software Version: 2.5.0.83 (date Jun 18 2019 time 16:44:23)

base_address=0xb46e0000

ros(+0x78a5f0)[0xb4e6a5f0]

ros(HOSTG_fatal_error+0x10)[0xb4e6cbf4]

ros(OSSYSG_fatal_error+0x2a0)[0xb54365dc]

ros(+0xb58348)[0xb5238348]

ros(+0x849270)[0xb4f29270]

ros(+0x84ba74)[0xb4f2ba74]

ros(+0x84857c)[0xb4f2857c]

ros(+0x84d8c4)[0xb4f2d8c4]

ros(DHCPV6CLIENTP_task+0x3ec)[0xb4f31638]

ros(+0x84d8c4)[0xb4f2d8c4]

ros(DHCPV6CLIENTP_task+0x3ec)[0xb4f31638]

/lib/libp2linux.so.1(task_run+0xf4)[0xb4670818]

***** END OF FATAL ERROR *****

**************************************************

***************** SYSTEM RESET *****************

**************************************************

Restarting system.

GBInformatics · ‎12-24-2024

I can't believe it's the DHCPv6 on the livebox that's making everything crash.

vistalba · ‎12-24-2024

Could be that this makes sense for my case too.

My Supermocro server is hosting multiple VMs. One of the VMs is a firewall which is providing IPv6 to the connected networks.

GBInformatics · ‎12-25-2024

there's a good chance because since ipv6 has been deactivated I've had no more problems

vistalba · ‎12-25-2024

the question is more, why Cisco is not fixing the issue at all? As the issue exists since more than 1.5 years now.

GBInformatics · ‎12-28-2024

I retested to be sure, by activating DHCPv6, the switch restarts permanently, with the version image_tesla_hybrid_2.5.9.54_release_cisco_signed.bin