Cisco Secure Network Analysis DSN Failure

navidn — Sun, 01 Sep 2024 10:59:40 GMT

Hello, currently we are facing two data nodes failure from 9 nodes DSN cluster. We already tried to start the failed nodes but SMC returns the following Error:

Unable to read database catalogs - cannot start database

First, I would appreciate that if anyone can note troubleshooting commands for this matter.
Second, should I delete these node (2 from 9) and try to add another fresh nodes and expect no data failure?

Thanks in advance.

Re: Cisco Secure Network Analysis DSN Failure

wajidhassan — Fri, 11 Jul 2025 15:59:24 GMT

You're facing a serious issue with your Cisco Secure Network Analytics (formerly Stealthwatch) Data Store Node (DSN) cluster — specifically, two failed data nodes in a 9-node cluster, with errors like:

"Unable to read database catalogs - cannot start database"

Let’s walk through (1) troubleshooting steps and (2) best practices for recovery, including whether it’s safe to delete and re-add those nodes.

1. Troubleshooting Commands (on Failed Data Nodes)
You need to assess why the database won't start on the failed nodes. Start with these CLI checks (as the sysadmin or root user):

Check Disk Space / Mounts
bash
Copy
Edit
df -h
lsblk
mount
Look for:

/data, /var, or /opt not being mounted

Full partitions (especially /data)

Bad or missing block devices

Check Database Status
bash
Copy
Edit
sudo su - postgres
psql -d dscdb -c "\l"
or if that fails:

bash
Copy
Edit
sudo systemctl status dscdb
sudo journalctl -xe | grep postgres
Error about "cannot read database catalogs" often indicates:

Corrupted database

Improper shutdown

Missing or damaged WAL logs

Check Logs
bash
Copy
Edit
cat /var/log/messages | grep dscdb
cat /var/log/dscdb/dscdb.log
Also check:

bash
Copy
Edit
/var/log/postgresql/postgresql-*.log
You're looking for:

"could not open file"

"missing WAL segment"

"catalog missing or corrupted"

"FATAL: database system is corrupted or needs recovery"

2. What Happens if You Remove and Rebuild the Nodes?
In a 9-node cluster, you typically have 3 shards with 3 replicas each (i.e., triple replication). Losing 2 nodes might be survivable if the replica sets are still complete across the other 7 nodes.

But there's no guarantee, unless you validate the current cluster state.

Check Cluster Health (on SMC/Manager):
From the SMC CLI, run:

bash
Copy
Edit
sudo /lancope/bin/lswnoded status
Or from the UI:

Central Management → Data Store → Look at the replica sets and shard health.

You’re looking for:

Which shards are missing?

Are all replica sets still showing 2 or more healthy nodes?

If each shard still has a majority of replicas healthy, you can safely remove the failed nodes and re-add them. The system will rebalance data to restore full replication.

Do NOT just delete nodes blindly unless you're sure their data is already replicated elsewhere.

If Data Corruption is Confirmed on Failed Nodes
If logs confirm Postgres catalog corruption (which sounds likely), and you:

Can't recover DB

Can confirm remaining nodes have healthy replicas

Then proceed with removal + rebuild:

Recommended Recovery Steps
From SMC: Deregister Failed Nodes

Central Management → Data Store → Select failed nodes → Remove

Reinstall the Data Store Image on New Appliances

Must match version (7.5.2)

Reconfigure basic network, NTP, DNS, etc.

Use SMC → Central Management → Add Node

Cluster Will Auto-Rebalance

Takes time (~hours depending on size)

Monitor rebalance under Data Store tab or using:

bash
Copy
Edit
sudo /lancope/bin/lswnoded status
Optional Recovery Attempt (If You Want to Try Fixing DB)
You can attempt to manually recover the database:

bash
Copy
Edit
sudo su - postgres
mv /data/db/dscdb /data/db/dscdb.bak
initdb -D /data/db/dscdb
But this resets the node's local DB — useful only if you're rejoining a node to a healthy cluster, not if you're trying to preserve local data.

Summary
Action Recommended
Use CLI to check DB, disk, and logs ✅ Yes
Remove & replace nodes if replication is intact ✅ Yes
Expect safe recovery if shards have 2+ replicas ✅ Yes
Manually repair corrupted DB on node ⚠️ Only if needed
Delete nodes blindly without checking ❌ No

Re: Cisco Secure Network Analysis DSN Failure

navidn — Sun, 14 Sep 2025 14:40:08 GMT

Thanks for your comprehensive answer, although how all of these tshoot steps are possible in 7.5+ version while root access is tokenized?

topic Cisco Secure Network Analysis DSN Failure in Security Analytics

Cisco Secure Network Analysis DSN Failure

Re: Cisco Secure Network Analysis DSN Failure

Re: Cisco Secure Network Analysis DSN Failure