09-01-2024 03:59 AM
Hello, currently we are facing two data nodes failure from 9 nodes DSN cluster. We already tried to start the failed nodes but SMC returns the following Error:
Unable to read database catalogs - cannot start database
First, I would appreciate that if anyone can note troubleshooting commands for this matter.
Second, should I delete these node (2 from 9) and try to add another fresh nodes and expect no data failure?
Thanks in advance.
07-11-2025 08:59 AM
You're facing a serious issue with your Cisco Secure Network Analytics (formerly Stealthwatch) Data Store Node (DSN) cluster — specifically, two failed data nodes in a 9-node cluster, with errors like:
"Unable to read database catalogs - cannot start database"
Let’s walk through (1) troubleshooting steps and (2) best practices for recovery, including whether it’s safe to delete and re-add those nodes.
1. Troubleshooting Commands (on Failed Data Nodes)
You need to assess why the database won't start on the failed nodes. Start with these CLI checks (as the sysadmin or root user):
Check Disk Space / Mounts
bash
Copy
Edit
df -h
lsblk
mount
Look for:
/data, /var, or /opt not being mounted
Full partitions (especially /data)
Bad or missing block devices
Check Database Status
bash
Copy
Edit
sudo su - postgres
psql -d dscdb -c "\l"
or if that fails:
bash
Copy
Edit
sudo systemctl status dscdb
sudo journalctl -xe | grep postgres
Error about "cannot read database catalogs" often indicates:
Corrupted database
Improper shutdown
Missing or damaged WAL logs
Check Logs
bash
Copy
Edit
cat /var/log/messages | grep dscdb
cat /var/log/dscdb/dscdb.log
Also check:
bash
Copy
Edit
/var/log/postgresql/postgresql-*.log
You're looking for:
"could not open file"
"missing WAL segment"
"catalog missing or corrupted"
"FATAL: database system is corrupted or needs recovery"
2. What Happens if You Remove and Rebuild the Nodes?
In a 9-node cluster, you typically have 3 shards with 3 replicas each (i.e., triple replication). Losing 2 nodes might be survivable if the replica sets are still complete across the other 7 nodes.
But there's no guarantee, unless you validate the current cluster state.
Check Cluster Health (on SMC/Manager):
From the SMC CLI, run:
bash
Copy
Edit
sudo /lancope/bin/lswnoded status
Or from the UI:
Central Management → Data Store → Look at the replica sets and shard health.
You’re looking for:
Which shards are missing?
Are all replica sets still showing 2 or more healthy nodes?
If each shard still has a majority of replicas healthy, you can safely remove the failed nodes and re-add them. The system will rebalance data to restore full replication.
Do NOT just delete nodes blindly unless you're sure their data is already replicated elsewhere.
If Data Corruption is Confirmed on Failed Nodes
If logs confirm Postgres catalog corruption (which sounds likely), and you:
Can't recover DB
Can confirm remaining nodes have healthy replicas
Then proceed with removal + rebuild:
Recommended Recovery Steps
From SMC: Deregister Failed Nodes
Central Management → Data Store → Select failed nodes → Remove
Reinstall the Data Store Image on New Appliances
Must match version (7.5.2)
Reconfigure basic network, NTP, DNS, etc.
Register New Nodes to the Cluster
Use SMC → Central Management → Add Node
Cluster Will Auto-Rebalance
Takes time (~hours depending on size)
Monitor rebalance under Data Store tab or using:
bash
Copy
Edit
sudo /lancope/bin/lswnoded status
Optional Recovery Attempt (If You Want to Try Fixing DB)
You can attempt to manually recover the database:
bash
Copy
Edit
sudo su - postgres
mv /data/db/dscdb /data/db/dscdb.bak
initdb -D /data/db/dscdb
But this resets the node's local DB — useful only if you're rejoining a node to a healthy cluster, not if you're trying to preserve local data.
Summary
Action Recommended
Use CLI to check DB, disk, and logs
Remove & replace nodes if replication is intact
Expect safe recovery if shards have 2+ replicas
Manually repair corrupted DB on node
Delete nodes blindly without checking
Discover and save your favorite ideas. Come back to expert answers, step-by-step guides, recent topics, and more.
New here? Get started with these tips. How to use Community New member guide