cancel
Showing results for 
Search instead for 
Did you mean: 
cancel
1683
Views
15
Helpful
6
Replies

Python VM not starting in NSO Docker container on M1 Mac

Similar to this post https://community.cisco.com/t5/nso-developer-hub-discussions/python-vm-not-starting-from-darwin-ncs-install-on-m1-macbook/m-p/4646305/highlight/true#M7218 I am having issue with NSO and the M1 Mac.

 

My second issue is running a Linux docker container on my M1. M1 Mac uses Rosetta stone to perform qemu emulation for allow x86_64 programs to run on the M1 (arm64) Mac. The docker desktop runs fine, and other co-workers have containers that they ran on their previous Macs prior to moving to the M1 chipset. The NSO docker image will start up fine, note as healthy, and I can access the ncs cli. The java works fine, however the python VM fails to load.

 

 

admin@c647657c6c41# show packages package oper-status | tab
                                                                                                                   PACKAGE
                                     PROGRAM                                                                       META     FILE
                                     CODE     JAVA           PYTHON         BAD NCS  PACKAGE  PACKAGE  CIRCULAR    DATA     LOAD   ERROR
NAME                             UP  ERROR    UNINITIALIZED  UNINITIALIZED  VERSION  NAME     VERSION  DEPENDENCY  ERROR    ERROR  INFO
------------------------------------------------------------------------------------------------------------------------------------------
aaa                              X   -        -              -              -        -        -        -           -        -      -
apply-template-interfaces        -   -        -              X              -        -        -        -           -        -      -
apply-to-devices                 -   -        -              X              -        -        -        -           -        -      -
...
root@c647657c6c41:/# cat log/ncs-python-vm-apply-template-interfaces.log
<INFO> 07-Jul-2022::14:17:40.339 apply-template-interfaces MainThread: - Python 3.7.3 (default, Jan 22 2021, 20:04:44) [GCC 8.3.0]
<INFO> 07-Jul-2022::14:17:40.346 apply-template-interfaces MainThread: - Starting...
<INFO> 07-Jul-2022::14:17:40.356 apply-template-interfaces MainThread: - Started
root@c647657c6c41:/#

But each one exits early. 

 

 

root@c647657c6c41:/# cat log/ncs-python-vm.log
<INFO> 7-Jul-2022::14:17:39.516 c647657c6c41 ncs[100]: Started PyVM: <<"apply-to-devices">> , Port=#Port<0.99> , OSpid="1276"
<INFO> 7-Jul-2022::14:17:39.614 c647657c6c41 ncs[100]: Started PyVM: <<"apply-template-interfaces">> , Port=#Port<0.101> , OSpid="1287"
...
<INFO> 7-Jul-2022::14:17:39.658 c647657c6c41 ncs[100]: apply-to-devices :: Starting python3 -u /opt/ncs/ncs-5.5.2.12/src/ncs/pyapi/ncs_pyvm/startup.py -l info -f /log/ncs-python-vm -i apply-to-devices
<INFO> 7-Jul-2022::14:17:39.794 c647657c6c41 ncs[100]: apply-template-interfaces :: Starting python3 -u /opt/ncs/ncs-5.5.2.12/src/ncs/pyapi/ncs_pyvm/startup.py -l info -f /log/ncs-python-vm -i apply-template-interfaces
...
<INFO> 7-Jul-2022::14:19:33.355 c647657c6c41 ncs[100]: apply-template-interfaces :: PYTHONPATH is now ['/opt/ncs/ncs-5.5.2.12/src/ncs/pyapi/ncs_pyvm', '/opt/ncs/ncs-5.5.2.12/src/ncs/pyapi', '/nso/run/state/packages-in-use/1/apply-template-interfaces/python', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages', '/opt/ncs/nso-build/lib/pylib']
...
<INFO> 7-Jul-2022::14:19:33.365 c647657c6c41 ncs[100]: apply-to-devices :: PYTHONPATH is now ['/opt/ncs/ncs-5.5.2.12/src/ncs/pyapi/ncs_pyvm', '/opt/ncs/ncs-5.5.2.12/src/ncs/pyapi', '/nso/run/state/packages-in-use/1/apply-to-devices/python', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages', '/opt/ncs/nso-build/lib/pylib']
...
<INFO> 7-Jul-2022::14:20:06.021 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.101>
<INFO> 7-Jul-2022::14:20:06.040 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.99>
<INFO> 7-Jul-2022::14:20:06.056 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.98>
<INFO> 7-Jul-2022::14:20:06.074 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.97>
<INFO> 7-Jul-2022::14:20:06.088 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.94>
<INFO> 7-Jul-2022::14:20:06.107 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.91>
<INFO> 7-Jul-2022::14:20:06.124 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.89>
<INFO> 7-Jul-2022::14:20:06.146 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.87>
<INFO> 7-Jul-2022::14:20:06.164 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.86>
<INFO> 7-Jul-2022::14:20:06.181 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.85>
<INFO> 7-Jul-2022::14:20:06.198 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.83>
<INFO> 7-Jul-2022::14:20:06.216 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.81>
<INFO> 7-Jul-2022::14:20:06.233 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.79>
<INFO> 7-Jul-2022::14:20:06.251 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.78>
<INFO> 7-Jul-2022::14:20:06.273 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.77>
<INFO> 7-Jul-2022::14:20:06.290 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.75>
<INFO> 7-Jul-2022::14:20:06.308 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.74>
<INFO> 7-Jul-2022::14:20:06.325 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.73>
<INFO> 7-Jul-2022::14:20:06.343 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.72>
<INFO> 7-Jul-2022::14:20:06.360 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.71>
<INFO> 7-Jul-2022::14:20:06.383 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.66>
<INFO> 7-Jul-2022::14:20:06.400 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.62>
<INFO> 7-Jul-2022::14:20:06.418 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.61>
<INFO> 7-Jul-2022::14:20:06.435 c647657c6c41 ncs[100]: Got EXIT on Port= #Port<0.60>
root@c647657c6c41:/#

 

 

From digging around, it looks like it's failing around the python VM start for the python action itself.

 

It's clear that qemu is running python, which qemu running java is fine but python is having issues. I'm not certain if it's python itself, which other members on my team not using NSO are using the same containers with python in them without issue. I also run python, so I think it's specific to some library or file being used in NCS.

 

root@c647657c6c41:/opt/ncs/ncs-5.5.2.12/src/ncs/pyapi# ps aux | grep java
root 977 2.2 10.6 13014104 3501964 ? Ssl 15:24 7:16 /usr/bin/qemu-x86_64 /usr/bin/java java -XX:+UseG1GC -XX:+UseStringDeduplication -classpath :/opt/ncs/ncs-5.5.2.12/java/jar/* -Dhost=127.0.0.1 -Dport=4569 -Djvm.restart.enabled=false -Djvm.restart.errCount=3 -Djvm.restart.duration=60 -Djava.security.egd=file:/dev/./urandom -Dfile.encoding=UTF-8 -Dorg.apache.logging.log4j.simplelog.StatusLogger.level=OFF com.tailf.ncs.NcsJVMLauncher
root 5571 0.0 0.0 147308 7720 pts/1 Sl+ 20:43 0:00 /usr/bin/qemu-x86_64 /bin/grep grep java
root@c647657c6c41:/opt/ncs/ncs-5.5.2.12/src/ncs/pyapi#

root@c647657c6c41:/opt/ncs/ncs-5.5.2.12/src/ncs/pyapi# python3 Python 3.7.3 (default, Jan 22 2021, 20:04:44) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>>

 

 

 

I found the file that starts the python VM for the python action class, and modified it with some comments to trace the logic, but got to here and am now lost. I'm pretty sure that self.main._run() call goes to the main.py file of the action where it goes into the main class of the ncs application class Main(ncs.application.Application):

 

<DEBUG> 16-Feb-2022::16:57:03.33 apply-template-interfaces MainThread: - --- Init NcsPyVM object
<DEBUG> 16-Feb-2022::16:57:03.33 apply-template-interfaces MainThread: - Created class: Main
<DEBUG> 16-Feb-2022::16:57:03.33 apply-template-interfaces MainThread: - Created Thread: apply-template-interfaces.main.Main
<DEBUG> 16-Feb-2022::16:57:03.35 apply-template-interfaces ComponentThread:main: - Starting Main Class: apply-template-interfaces.main.Main
<DEBUG> 16-Feb-2022::16:57:03.35 apply-template-interfaces ComponentThread:main: - ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_daemon_name', '_finish', '_logger', '_mk_daemon_name', '_ncs_callback_channel', '_ncs_cname', '_ncs_i', '_ncs_id', '_ncs_pname', '_ncs_user_threads', '_ready', '_run', '_start', '_state_manager', '_state_mgr', '_trans_cb_cls', 'add_running_thread', 'create_daemon', 'critical', 'dd', 'debug', 'del_running_thread', 'error', 'info', 'log', 'reg_finish', 'register_action', 'register_fun', 'register_nano_service', 'register_service', 'register_trans_cb', 'setup', 'teardown', 'unreg_finish', 'wait', 'warning']
<DEBUG> 16-Feb-2022::16:57:03.36 apply-template-interfaces ComponentThread:main: - HERE1


	...
        # Get the 'main class' from module
        self.dbg('Getattr ' + classname + ' from: ' + modname)
        main_cls = getattr(mod, classname)
    ...
    def run(self):
        self.dbg('Starting Main Class: ' + self.main_class)
        try:
            self.dbg(dir(self.main))
            if hasattr(self.main, '_run'):
                self.dbg('HERE1')
                self.main._run()
            else:
                self.dbg('HERE2')
                self.main.run()
            self._component_threads.thread_finished(self.n)
class Main(ncs.application.Application):
    def setup(self):
        self.log.info('Main RUNNING')
        self.register_action('apply-template-interfaces-action', ApplyTemplateInterfaces)

After all of this, I am still lost as to what to do next.

 

6 Replies 6

Alex Stevenson
Cisco Employee
Cisco Employee

 

Hello samuel.coome@secunetics.com,

 

This might be a know bug. Best to check the Cisco Bug Search Tool and see if there is a workaround: https://bst.cloudapps.cisco.com/bugsearch

 

Hope this helps!

 

cealves
Cisco Employee
Cisco Employee

Hello, as far as I can see the problem is coming from C bindings code. I know for sure when it fails, but the reason cannot be tracked on python side cause python calls the C api(https://developer.cisco.com/docs/nso/api/#nso-api-5-6-nso-python-api-_ncs/_ncs.mmap_schemas). Basically it fails under heavy packages reload, because if you try to redeploy one package you'll see that the VM starts fine. But when you try to do a packages reload the maapi call to NSO seems to get stuck.

It fails when load_schemas

_tm.maapi.load_schemas(self.msock)

 is called inside /ncs/maapi.py:

# convenience
def load_schemas(self, use_maapi_socket=False):
"""Load the schemas to Python (using shared memory if enabled).

If 'use_maapi_socket' is set to True, the schmeas are loaded through
the NSO daemon via a MAAPI socket.
"""
if Maapi._schemas_loaded:
return

with Maapi._schemas_loaded_lock:
if not Maapi._schemas_loaded:
if use_maapi_socket:
_tm.maapi.load_schemas(self.msock)
Maapi._schemas_loaded = True
else:
try:
path = _tm.maapi.get_schema_file_path(self.msock)
_tm.mmap_schemas(path)
Maapi._schemas_loaded = True
except _tm.error.Error:
# Shared memory not enabled?
# Fallback to schema load over socket!
_tm.maapi.load_schemas(self.msock) <--------------------- HERE !
Maapi._schemas_loaded = True

I saw that use_maapi_socket = False so it jumps intro the else, which catches an error and falls back to the except statement:

<INFO> 27-Jul-2022::09:05:09.918 ncs.maapi ComponentThread:main: - Traceback (most recent call last):
File "/opt/ncs/ncs-5.7.2.1/src/ncs/pyapi/ncs/maapi.py", line 448, in load_schemas
_tm.mmap_schemas(path)
_ncs.error.Error: system call failed (24): Failed to mmap 95993856 bytes at 0x4010277000: mmap() returned 0xffff73eb4000

After `

_tm.maapi.load_schemas(self.msock) 

all the VM processes start to die. On ncs.log we can see that a timeout is reached so all VMs are being terminated.

<ERR> 27-Jul-2022::09:05:34.464 nso_docker_1 ncs[3744]: - The NCS Python VM acl, acl-reconcile, clear-counters, ebgp, ebgp-reconcile, eth-cfm-edd, eth-cfm-edd-reconcile, eth-cfm-pe, eth-cfm-pe-reconcile, eth-monitor, eth-monitor-reconcile, ic, ic-reconcile, ipsla, ipsla-reconcile, keychain, keychain-reconcile, l2access, l2access-reconcile, l2sic, l2sic-reconcile, l3multicast, l3multicast-reconcile, l3sic, l3sic-reconcile, massive-configs, prefix-set, prefix-set-reconcile, pt-common, pt-reconcile-common, rip, rip-reconcile, route-policy, route-policy-reconcile, sr, sr-reconcile, vlan, vlan-reconcile, vll, vll-reconcile, vpls, vpls-reconcile, vrf, vrf-reconcile startup timeout reached, terminating

I wasn't able to determine which timeout is being triggered or where the processes are being killed. Again, this makes me think that it's from C code that this is happening.

Even if I increase the packages reload timeout the problem persists:

admin@ncs# packages reload max-wait-time 5000                    

reload-result {
    package acl
    result false
    info Python VM failed to start
}
reload-result {
    package acl-reconcile
    result false
    info Python VM failed to start
}
......

So honestly this has to be investigated by BU cause we end up on a roadblock when it starts calling C API.

Definitely the parent process and their childs are being killed abruptly, probably wit a SIGKILL cause no exception or message is logged, and if we top on our machine:

 

image.png

You can see that the python VMs are being created, and out of nowhere they are being killed.

And no, that's not the OOM killer cause I run dmesg and it comes clean, so the OS is not killing our processes:

[admin@nso_docker_1 ncs]$ sudo dmesg -T | egrep -i 'killed'
[admin@nso_docker_1 ncs]$ 

I run out of options

 

 

nandhinr
Cisco Employee
Cisco Employee

NSO on Mac M1 is not supported atm, there is an existing feature request for this ask and is under consideration. Will update the thread if I know more.

Thank you. I look forward to the update.

Are there any updates to this issue? It has been months now - so checking for an update.

 

@lwarrin1964 this feature is supposed to be fixed in NSO 6.1 (tentative release date: April)