We are doing some scalability tests with the new typhoon based hardware (24 x 10G - SE), and when configuring sub-interfaces with a egress service-policy applied got the following error message:
% Failed to commit one or more configuration items during a pseudo-atomic operation. All changes made have been reverted. Please issue 'show configuration failed' from this session to view the errors
RP/0/RSP0/CPU0:A9K-LAB05(config-subif)#show configuration failed
Fri Jul 11 09:44:59.268 WEST
!! SEMANTIC ERRORS: This configuration was rejected by
!! the system due to semantic errors. The individual
!! errors with each failed configuration command can be
!! found below.
service-policy input SCH_IN_parent_L3_NG1_100M
!!% 'prm_server' detected the 'warning' condition 'An operation that was requested was aborted - data integrity may be compromised.'
Service policy configuration is the following:
shape average 100032 kbps
priority level 1
police rate 30 mbps
priority level 2
police rate percent 45
bandwidth remaining percent 40
random-detect 128 kbytes 256 kbytes
bandwidth remaining percent 30
random-detect 128 kbytes 256 kbytes
bandwidth remaining percent 20
random-detect 64 kbytes 128 kbytes
bandwidth remaining percent 10
random-detect 32 kbytes 64 kbytes
If we configure more sub-interface without the service-policy, configuration is accepted. Right now we have around 17K sub-interfaces configured:
RP/0/RSP0/CPU0:A9K-LAB05#sh int summary location 0/2/CPU0
Fri Jul 11 11:45:45.005 WEST
Interface Type Total UP Down Admin Down
-------------- ----- -- ---- ----------
ALL TYPES 17503 17485 0 18
IFT_TENGETHERNET 24 7 0 17
IFT_VLAN_SUBIF 17479 17478 0 1
We know that we are not hitting queue limits, but we don't know what kind of limit, if any, are we reaching. Can anyone help us understand what kind of limit are we reaching?
This error message indicates an error when trying to program the TCAM, that the SW and HW values are not the same hence the data integrity error.
A sub-int would be able to be committed without a QoS policy as only things like ACLs and QoS policies take up entries in the TCAM.
Can you open an SR and ask for me?
Thanks for the time Pedro.
For closure in the community here is what we found:
There were 17476 sub-interfaces with the same ingress and egress policies applied. Because the ingress policy had 8 classes and the egress policy had 7 (including class-default) this means we had (7+8)*(17476) records in the NP QOS_INTF data structure (262140 out of 262144 available) allocated. When adding either service-policy to a new sub-int this would push us over the limit and hence the error message.
Looking at the chunks for QoS-EA and WFQ because the classes did not have large configurations these resources were not exhausted yet. Different scale numbers exist depending on the LC.
On the TCAM a service-policy is only applied once (only takes up resources once) and why the TCAM was not exhausted either.
Because of the limit we are hitting the only real ways to alleviate the issue is to use fewer sub-ints, move some to another NP, or use fewer classes.
*More details on QoS HW resource consumption*
The first QoS configuration in a class creates a TCAM and NP Struct record (including class-default)
Every copy of a QoS service-policy applied creates more NP Struct records, but TCAM is done only once.
The NP Struct is essentially the aggregate of all the classes applied in the queuing ASICs, as long as the class has some QoS configuration we must allocation the appropriate resources.
Levels 1, 2, and 3
Shape, bandwidth guaranteed, and bandwidth remaining
These resources can be checked with the following commands: