<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Cisco Secure Network Analysis DSN Failure in Security Analytics</title>
    <link>https://community.cisco.com/t5/security-analytics/cisco-secure-network-analysis-dsn-failure/m-p/5168942#M1096</link>
    <description>&lt;P&gt;Hello, currently we are facing two data nodes failure from 9 nodes DSN cluster. We already tried to start the failed nodes but SMC returns the following Error:&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Unable to read database catalogs - cannot start database&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;First, I would appreciate that if anyone can note troubleshooting commands for this matter.&lt;BR /&gt;Second, should I delete these node (2 from 9) and try to add another fresh nodes and expect no data failure?&lt;/P&gt;&lt;P&gt;Thanks in advance.&lt;/P&gt;</description>
    <pubDate>Sun, 01 Sep 2024 10:59:40 GMT</pubDate>
    <dc:creator>navidn</dc:creator>
    <dc:date>2024-09-01T10:59:40Z</dc:date>
    <item>
      <title>Cisco Secure Network Analysis DSN Failure</title>
      <link>https://community.cisco.com/t5/security-analytics/cisco-secure-network-analysis-dsn-failure/m-p/5168942#M1096</link>
      <description>&lt;P&gt;Hello, currently we are facing two data nodes failure from 9 nodes DSN cluster. We already tried to start the failed nodes but SMC returns the following Error:&lt;BR /&gt;&lt;BR /&gt;&lt;STRONG&gt;Unable to read database catalogs - cannot start database&lt;/STRONG&gt;&lt;BR /&gt;&lt;BR /&gt;First, I would appreciate that if anyone can note troubleshooting commands for this matter.&lt;BR /&gt;Second, should I delete these node (2 from 9) and try to add another fresh nodes and expect no data failure?&lt;/P&gt;&lt;P&gt;Thanks in advance.&lt;/P&gt;</description>
      <pubDate>Sun, 01 Sep 2024 10:59:40 GMT</pubDate>
      <guid>https://community.cisco.com/t5/security-analytics/cisco-secure-network-analysis-dsn-failure/m-p/5168942#M1096</guid>
      <dc:creator>navidn</dc:creator>
      <dc:date>2024-09-01T10:59:40Z</dc:date>
    </item>
    <item>
      <title>Re: Cisco Secure Network Analysis DSN Failure</title>
      <link>https://community.cisco.com/t5/security-analytics/cisco-secure-network-analysis-dsn-failure/m-p/5308372#M1204</link>
      <description>&lt;P&gt;You're facing a serious issue with your Cisco Secure Network Analytics (formerly Stealthwatch) Data Store Node (DSN) cluster — specifically, two failed data nodes in a 9-node cluster, with errors like:&lt;/P&gt;&lt;P&gt;"Unable to read database catalogs - cannot start database"&lt;/P&gt;&lt;P&gt;Let’s walk through (1) troubleshooting steps and (2) best practices for recovery, including whether it’s safe to delete and re-add those nodes.&lt;/P&gt;&lt;P&gt;1. Troubleshooting Commands (on Failed Data Nodes)&lt;BR /&gt;You need to assess why the database won't start on the failed nodes. Start with these CLI checks (as the sysadmin or root user):&lt;/P&gt;&lt;P&gt;Check Disk Space / Mounts&lt;BR /&gt;bash&lt;BR /&gt;Copy&lt;BR /&gt;Edit&lt;BR /&gt;df -h&lt;BR /&gt;lsblk&lt;BR /&gt;mount&lt;BR /&gt;Look for:&lt;/P&gt;&lt;P&gt;/data, /var, or /opt not being mounted&lt;/P&gt;&lt;P&gt;Full partitions (especially /data)&lt;/P&gt;&lt;P&gt;Bad or missing block devices&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Check Database Status&lt;BR /&gt;bash&lt;BR /&gt;Copy&lt;BR /&gt;Edit&lt;BR /&gt;sudo su - postgres&lt;BR /&gt;psql -d dscdb -c "\l"&lt;BR /&gt;or if that fails:&lt;/P&gt;&lt;P&gt;bash&lt;BR /&gt;Copy&lt;BR /&gt;Edit&lt;BR /&gt;sudo systemctl status dscdb&lt;BR /&gt;sudo journalctl -xe | grep postgres&lt;BR /&gt;Error about "cannot read database catalogs" often indicates:&lt;/P&gt;&lt;P&gt;Corrupted database&lt;/P&gt;&lt;P&gt;Improper shutdown&lt;/P&gt;&lt;P&gt;Missing or damaged WAL logs&lt;/P&gt;&lt;P&gt;Check Logs&lt;BR /&gt;bash&lt;BR /&gt;Copy&lt;BR /&gt;Edit&lt;BR /&gt;cat /var/log/messages | grep dscdb&lt;BR /&gt;cat /var/log/dscdb/dscdb.log&lt;BR /&gt;Also check:&lt;/P&gt;&lt;P&gt;bash&lt;BR /&gt;Copy&lt;BR /&gt;Edit&lt;BR /&gt;/var/log/postgresql/postgresql-*.log&lt;BR /&gt;You're looking for:&lt;/P&gt;&lt;P&gt;"could not open file"&lt;/P&gt;&lt;P&gt;"missing WAL segment"&lt;/P&gt;&lt;P&gt;"catalog missing or corrupted"&lt;/P&gt;&lt;P&gt;"FATAL: database system is corrupted or needs recovery"&lt;/P&gt;&lt;P&gt;2. What Happens if You Remove and Rebuild the Nodes?&lt;BR /&gt;In a 9-node cluster, you typically have 3 shards with 3 replicas each (i.e., triple replication). Losing 2 nodes might be survivable if the replica sets are still complete across the other 7 nodes.&lt;/P&gt;&lt;P&gt;But there's no guarantee, unless you validate the current cluster state.&lt;/P&gt;&lt;P&gt;Check Cluster Health (on SMC/Manager):&lt;BR /&gt;From the SMC CLI, run:&lt;/P&gt;&lt;P&gt;bash&lt;BR /&gt;Copy&lt;BR /&gt;Edit&lt;BR /&gt;sudo /lancope/bin/lswnoded status&lt;BR /&gt;Or from the UI:&lt;/P&gt;&lt;P&gt;Central Management → Data Store → Look at the replica sets and shard health.&lt;/P&gt;&lt;P&gt;You’re looking for:&lt;/P&gt;&lt;P&gt;Which shards are missing?&lt;/P&gt;&lt;P&gt;Are all replica sets still showing 2 or more healthy nodes?&lt;/P&gt;&lt;P&gt;If each shard still has a majority of replicas healthy, you can safely remove the failed nodes and re-add them. The system will rebalance data to restore full replication.&lt;/P&gt;&lt;P&gt;Do NOT just delete nodes blindly unless you're sure their data is already replicated elsewhere.&lt;/P&gt;&lt;P&gt;If Data Corruption is Confirmed on Failed Nodes&lt;BR /&gt;If logs confirm Postgres catalog corruption (which sounds likely), and you:&lt;/P&gt;&lt;P&gt;Can't recover DB&lt;/P&gt;&lt;P&gt;Can confirm remaining nodes have healthy replicas&lt;/P&gt;&lt;P&gt;Then proceed with removal + rebuild:&lt;/P&gt;&lt;P&gt;Recommended Recovery Steps&lt;BR /&gt;From SMC: Deregister Failed Nodes&lt;/P&gt;&lt;P&gt;Central Management → Data Store → Select failed nodes → Remove&lt;/P&gt;&lt;P&gt;Reinstall the Data Store Image on New Appliances&lt;/P&gt;&lt;P&gt;Must match version (7.5.2)&lt;/P&gt;&lt;P&gt;Reconfigure basic network, NTP, DNS, etc.&lt;/P&gt;&lt;P&gt;Register New Nodes to the Cluster&lt;/P&gt;&lt;P&gt;Use SMC → Central Management → Add Node&lt;/P&gt;&lt;P&gt;Cluster Will Auto-Rebalance&lt;/P&gt;&lt;P&gt;Takes time (~hours depending on size)&lt;/P&gt;&lt;P&gt;Monitor rebalance under Data Store tab or using:&lt;/P&gt;&lt;P&gt;bash&lt;BR /&gt;Copy&lt;BR /&gt;Edit&lt;BR /&gt;sudo /lancope/bin/lswnoded status&lt;BR /&gt;Optional Recovery Attempt (If You Want to Try Fixing DB)&lt;BR /&gt;You can attempt to manually recover the database:&lt;/P&gt;&lt;P&gt;bash&lt;BR /&gt;Copy&lt;BR /&gt;Edit&lt;BR /&gt;sudo su - postgres&lt;BR /&gt;mv /data/db/dscdb /data/db/dscdb.bak&lt;BR /&gt;initdb -D /data/db/dscdb&lt;BR /&gt;But this resets the node's local DB — useful only if you're rejoining a node to a healthy cluster, not if you're trying to preserve local data.&lt;/P&gt;&lt;P&gt;Summary&lt;BR /&gt;Action Recommended&lt;BR /&gt;Use CLI to check DB, disk, and logs &lt;span class="lia-unicode-emoji" title=":white_heavy_check_mark:"&gt;✅&lt;/span&gt; Yes&lt;BR /&gt;Remove &amp;amp; replace nodes if replication is intact &lt;span class="lia-unicode-emoji" title=":white_heavy_check_mark:"&gt;✅&lt;/span&gt; Yes&lt;BR /&gt;Expect safe recovery if shards have 2+ replicas &lt;span class="lia-unicode-emoji" title=":white_heavy_check_mark:"&gt;✅&lt;/span&gt; Yes&lt;BR /&gt;Manually repair corrupted DB on node &lt;span class="lia-unicode-emoji" title=":warning:"&gt;⚠️&lt;/span&gt; Only if needed&lt;BR /&gt;Delete nodes blindly without checking &lt;span class="lia-unicode-emoji" title=":cross_mark:"&gt;❌&lt;/span&gt; No&lt;/P&gt;</description>
      <pubDate>Fri, 11 Jul 2025 15:59:24 GMT</pubDate>
      <guid>https://community.cisco.com/t5/security-analytics/cisco-secure-network-analysis-dsn-failure/m-p/5308372#M1204</guid>
      <dc:creator>wajidhassan</dc:creator>
      <dc:date>2025-07-11T15:59:24Z</dc:date>
    </item>
    <item>
      <title>Re: Cisco Secure Network Analysis DSN Failure</title>
      <link>https://community.cisco.com/t5/security-analytics/cisco-secure-network-analysis-dsn-failure/m-p/5329985#M1243</link>
      <description>&lt;P&gt;Thanks for your comprehensive answer, although how all of these tshoot steps are possible in 7.5+ version while root access is tokenized?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 14 Sep 2025 14:40:08 GMT</pubDate>
      <guid>https://community.cisco.com/t5/security-analytics/cisco-secure-network-analysis-dsn-failure/m-p/5329985#M1243</guid>
      <dc:creator>navidn</dc:creator>
      <dc:date>2025-09-14T14:40:08Z</dc:date>
    </item>
  </channel>
</rss>

