Re: Cisco DNAC Backup Issues

jamesytn · ‎12-06-2022

Hi,

I'm trying to setup a new backup for our DNAC and keep consistently getting this error message.

Error during _process_backup(): Internal server error: {"error":{"root_cause":[{"type":"snapshot_creation_exception","reason":"[ndp:cba0c672-c478-49d3-b394-c4d0489cc69f.000/lSBGG91JTXC_lkwxq01JDw] failed to create snapshot"}],"type":"snapshot_creation_exception","reason":"[ndp:cba0c672-c478-49d3-b394-c4d0489cc69f.000/lSBGG91JTXC_lkwxq01JDw] failed to create snapshot","caused_by":{"type":"access_denied_exception","reason":"/var/data/es/snapshots/meta-lSBGG91JTXC_lkwxq01JDw.dat"}},"status":500}

The backup gets to 50% and does seem to copy data then always fails at this steps. I have rebuilt the destination NFS server and have also rebooted the DNAC. Neither of which have made any difference.

Thanks

marce1000 · ‎12-06-2022

- FYI : https://bst.cloudapps.cisco.com/bugsearch/bug/CSCwd08262

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

jamesytn · ‎12-06-2022

Thanks for the link:

I ran a chmod 777 -R on the route of the share. It seemed to be working as more data was passed. I can see 11G was transfered to the NFS server. However rather then just failing at 50% it then flapped between 50/60% then eventually failed with the same error as before.

Error during _process_backup(): Internal server error: {"error":{"root_cause":[{"type":"snapshot_creation_exception","reason":"[ndp:c2254c92-8ed4-4177-a3ab-04363c924afb.000/VzykC4S-SBGZZoKi59jKvw] failed to create snapshot"}],"type":"snapshot_creation_exception","reason":"[ndp:c2254c92-8ed4-4177-a3ab-04363c924afb.000/VzykC4S-SBGZZoKi59jKvw] failed to create snapshot","caused_by":{"type":"access_denied_exception","reason":"/var/data/es/snapshots/meta-VzykC4S-SBGZZoKi59jKvw.dat"}},"status":500}

Here's an output of the NFS server that I'm using;

administrator@nfshost:/mnt/sdb$ ls -l DNAC/
total 8
drwsrwsrwx 6 nobody nogroup 4096 Dec  6 10:08 backups
drwsrwsrwx 2 nobody nogroup 4096 Dec  6 09:32 nfs

administrator@nfshost:/mnt/sdb$ ls -l DNAC/backups/
total 16
drwxrwsrwx 5 administrator nogroup 4096 Dec  6 09:34 fusion.postgres
drwxrwsrwx 5 administrator nogroup 4096 Dec  6 09:33 maglev-system.credentialmanager
drwxrwsrwx 5 administrator nogroup 4096 Dec  6 09:33 maglev-system.glusterfs
drwxrwsrwx 5 administrator nogroup 4096 Dec  6 09:33 ndp.redis

On DNAC, /nfs is the NFS server and /backups is the share on the '(Remote Host)' option. Only /nfs is shared as an nfs share.

marce1000 · ‎12-06-2022

- Check if the share has enough free space , also on the NFS server check the nfs related logs and or networking related logs ,

M.

-- Each morning when I wake up and look into the mirror I always say ' Why am I so brilliant ? '
When the mirror will then always repond to me with ' The only thing that exceeds your brilliance is your beauty! '

jamesytn · ‎12-06-2022

Initially I had multipath errors, I managed to get rid of these by adding the following

defaults {
    user_friendly_names yes
}

blacklist {
    device {
        vendor "VMware"
        product "Virtual disk"
    }
}

This removed the errors but now when looking at the logs

administrator@host:/mnt/sdb$ tail /var/log/syslog
Dec  6 14:13:37 host kernel: [85329.354152] NFSD: end of grace period
Dec  6 14:13:37 host kernel: [85329.354155] NFSD: laundromat_main - sleeping for 90 seconds
Dec  6 14:15:07 host kernel: [85419.464417] NFSD: laundromat service - starting
Dec  6 14:15:07 host kernel: [85419.464435] NFSD: end of grace period
Dec  6 14:15:07 host kernel: [85419.464437] NFSD: laundromat_main - sleeping for 90 seconds
Dec  6 14:15:36 host systemd-timesyncd[764]: Timed out waiting for reply from 185.125.190.56:123 (ntp.ubuntu.com).
Dec  6 14:15:36 syt-penm-ibs01 systemd-timesyncd[764]: Initial synchronization to time server 185.125.190.57:123 (ntp.ubuntu.com).
Dec  6 14:16:38 host kernel: [85509.575411] NFSD: laundromat service - starting
Dec  6 14:16:38 host kernel: [85509.575413] NFSD: end of grace period
Dec  6 14:16:38 host kernel: [85509.575415] NFSD: laundromat_main - sleeping for 90 seconds
administrator@syt-penm-ibs01:/mnt/sdb$

There isn't anything special in my /etc/eports string

/mnt/sdb/DNAC/nfs *(rw,all_squash,sync,no_subtree_check)

jay3 · ‎08-30-2023

Hi jamesytn,

I'm currently facing the same challenge. Did you manage to find a solution for this?

Tomas de Leon · ‎09-21-2023

James,

What is the Cisco Catalyst Center (formerly Cisco DNA Center)?

So looking at your directory ownership, this may be the issue. I ran into the same issues or symptoms that you are reporting and I had the same ownership for the NFS directory. I believe that this was the original configuration requirements in the past but I cannot validate this since our documentation has been scrubbed for the different releases.


/mnt/sdb/DNAC/nf
drwsrwsrwx 2 nobody nogroup 4096 Dec 6 09:32 nfs

Chapter: Backup and Restore
https://www.cisco.com/c/en/us/td/docs/cloud-systems-management/network-automation-and-management/dna-center/2-3-5/admin_guide/b_cisco_dna_center_admin_guide_2_3_5/b_cisco_dna_center_admin_guide_2_3_5_chapter_0110.html#Cisco_Task_in_List_GUI.dita_d361...

So, I performed the following to fix the issue

Deleted existing Assurance Backups if any (optional)
Removed the NFS configuration settings from Cisco Catalyst Center (formerly Cisco DNA Center)
Removed the contents and directory from the Remote NFS Server
Added the directory back to the Remote NFS Server
Changed ownership of directory
Refresh the exportfs for the NFS directory
Check & Verify the available disk space
Add the Backup Configuration back to the Cisco Catalyst Center (formerly Cisco DNA Center)
Perform a backup for type "All Data"
Verify successful backup of All Data

$ sudo chown nfsnobody:nfsnobody /home/cx1/nfs
$ sudo exportfs -r
$ df -h

/home/cx1/nfs 
drwxr-xr-x. 2 nfsnobody nfsnobody 6 Sep 12 12:55 nfs

Note: Since this will be the initial "All Data" backup, this task/job may or will take 
multiple hours to complete. This depends on the amount of Assurance Data that you have 
accumulated in your cluster. You will see the backup "appear" to stall at 40-50%. 
Up until this percentage will backed up the RSYNC/Automation data. Then we start backing up 
your Assurance data. This data can be large or very large.

You can monitor the progress on the Remote NFS Server by watching the NFS Directory status during the backup.

For Example:
------------
$ watch -d -n 0.5 "tree /home/cx1/nfs | grep files"

You will see the files & directory numbers continue to increment when the progress status pertcentage is static during the backup.

Richard Greig · ‎02-14-2024

The comment about appearing to stall helped me out

Fix for me was:

Align folder permissions (assurance folder wasn't quite right)
Remove Scheduled backups > Remove NFS config
Reapply NFS config
Run full backup

I thought it had got stuck until I read your comment. 27hrs later, it succeeded at ~820GB