
This guide will explain how one can enable observability on NSO and process the data with a data science mechanism with the help of Splunk. We will separate the guide into two parts on two different blog post
Part 1 will focus on how to enable observability and start data collection from NSO. For, part 2 will focus on data processing, visualization, and machine learning prediction. At the same time, how one can use the prediction data and proceed action with it. For action, we will introduce a predictive service design that can predict the RSS (Physical memory consumption), Commited_AS(Memory Allocation), and Time consumption before the service is executed. In this case, prevent OOM scenario and NSO operator can have an overview of how long time the operation will take. The goal here is to achieve “Dry-run without Commit Dry-run” which means enabling forecast ability in NSO service by obtaining an overview before the service is executed.
For part 2, we will expand the previous diagram in part 1 by adding multiple data processing components.
By using this design, we can complete our data processing pipeline from Data Collection, Data Processing, and Proactive Action based on Data Processing Results.
[1] “7.5 Innovations state space models for exponential smoothing | Forecasting: Principles and Practice,” Otexts.com, 2019. https://otexts.com/fpp2/ets.html
Most of the activity we will explain in this article is operated on Splunk except the Predictive Service Design. Predictive Service Design comes with one code example. We strongly recommend you clone this example and go through the Readme before reading the Predictive Service Design chapter. In this case, you can follow our journey through this guide by also trying to do it yourself. Learning by doing is always easier than just reading.
To visualize the data from data storage, one needs to create a Search String. Then let a Dashboard visualize the data constructed by the Search String. We start by entering the Search & Reporting tab under the Apps.
From here, one can start testing and construct a Search String inside the Search box.
We will start with constructing a Search String that can act as a Progress Trace View with the data that comes from the Observability Exporter. With this Search String, we will try to create a plot that compares the performance of the “order-by-user list” between 6.1.3 and 6.3.2. The Search String below is an example of doing so. We will try to dissect the line by line to see what each line means.
(index="events" attributes.tid=132) OR (index="events" attributes.tid=2331)
| eval diff=end_time-start_time
| chart values(diff) over name by attributes.tid
| rename "132" as "6.3.2"
| rename "2331" as "6.1.3"
In the first line, we limit the data that will be used in this diagram to the data inside the “events” index with either Transaction ID (tid) 132 or 2331. The tid 132 is the time consumption data per phase of transaction from 6.3.2 while 2331 is the data from 6.1.3. Therefore, in the last two lines of the Search String, we rename the data name that is used when Observability Exporter (OE) stream to Splunk - “132” and “2331” to the correlated NSO version. Back to Line 2, what is useful to the diagram is the time difference. However, the Observability exporter only streamed the start and end time of the transaction in each phase. Therefore, we instruct Splunk to create a difference by substructing the end time and by start time. Eventually, before renaming, we construct a chart that going to get plotted by Splunk by categorizing the value of the time difference over the name of each phase in the transaction (name). Afterward, split data per transaction phase by their Transaction ID (attributes.tid). In this case, we got a chart with the NSO version as column name and transaction phase as row name as shown below.
By clicking on the Visualization tab on the top, one can see the diagram preview below. However, due to the preview being too small, the y-axis did not show its value.
Therefore, let’s create a full-scale dashboard by clicking on the Dashboards tab on the top and “Create New Dashboard”. For this step, I will use the existing dashboard that I created “NSO metrics”.
Then click on Add Panel and choose Bar Chart.
Inside the New Bar Chart tab, enter the Search String we constructed before and choose the correct Time Range.
After you click on the “Add to Dashboard”, the Progress Trace Viewer diagram will show up like the one below. If the column name is still not shown, drag the diagram and make it bigger.
From phase 1, we have streamed the live memory consumption of "ncs.smp" to Splunk. In this chapter, we will benefit from these data and create a live dashboard for Splunk. First of all, let’s start with constructing the Search String. From the Search String below, we start with limiting the searching index as ncssmp since that is where we store the data that comes from CollectD. At the same time, we are only interested in the Physical Memory Consumption(ps_rss) within the processes CollectD plugin. Eventually, we mention the process name that we are interested in is “ncs.smp”. By doing so, we got the Physical Memory Consumption of “ncs.smp” from the processes plugin. Afterward, in line 2, we construct a chart with Time as the row name and Physical Memory Consumption as the value. Eventually, we renamed the Physical Memory Consumption to "ncs.smp-RSS" instead of "values(values{})".
index="ncssmp" plugin=processes type=ps_rss plugin_instance="ncs.smp"
| chart values(values{}) over time
| rename "values(values{})" as "ncs.smp-RSS"
Afterward, we use this search string to construct the dashboard with a similar method above. However, to make this a live dashboard, we need to set the Auto Refresh Delay. In the example below, we set the Auto Refresh Delay as 30 seconds. This will make the dashboard refresh every 30 seconds with the latest data within 24 hours.
As result, the following diagram will be created. However, the plot will be different depending on the operation of your NSO.
Inside the Search String, we can request Splunk to trigger machine learning to process specific datasets. The machine learning engine can reside inside Splunk as “Splunk Machine Learning Toolkit (Splunk_ML_Toolkit)” or externally inside a docker container (MLTK-container) via “Splunk App for Data Science and Deep Learning”. For this article, we focus on the internal one. To construct the Search String with machine learning, we enter the “Splunk Machine Learning Toolkit” under the Apps as shown below.
Then we click on “Experiments” on the top bar and choose “Create New Experiment” to start a new Search String construction. For the existing experiment, one can be found in the table as shown in the diagram below. What we are trying to forecast is the RSS which is the physical memory. Therefore, we will choose “Smart Forecasting” which is based on State Space Forecast.
After clicking on the “Create New Experiment”, we need to specify the dataset we want to forecast by constructing a Search String. In here, we specify the following search string to limit the index to "events_perf" which is the index we used to store performance-related data from “Lux-based Performance Data Collector”. Then we create a chart with the values of physical memory(attributes.mem) by setting the length of the order-by-user list(attributes.x) as a row name. Eventually, I renamed each of the parameters to a meaning naming by setting “attributes.x” to "Element Count" and values(attributes.mem) to "Memory".
index="events_perf"
| chart values(attributes.mem) over attributes.x
| rename attributes.x as "Element Count"
| rename values(attributes.mem) as "Memory"
By putting this Search String in the search box, we can see the chart we just constructed in the “Data Preview”. If one would like to play around with Splunk sample data, click on the “Datasets” instead of constructing your own Search String and choose the built-in sample dataset for testing. Afterward, we click “Next” to proceed with forecast parameter configurations.
The diagram below shows a sample configuration for the forecast. We set the column to forecast as “Memory” and forecast 3 points above time (Future timespan). At the same time, holdback of time of 1 (Holdback period) which means erase the measurement of the last measurement with 450000 Element Count and make it become a forecast predicted value. By doing so, we will have 3(Future timespan) – 1(Holdback period) =2 future values predicted. However, one can realize preview of the forecast plot is not possible after clicking on the “Forecast” button. This is the data we enter is not actual Time Series data with “_time” showing up in the data as an index. Our actual index is the “Element Count”. However, we can proceed with the rest of the process manually and display the plot in our dashboard.
By clicking on the SPL button in the diagram above, one can see the Search String that is constructed for the current forecast session as shown below. It also comes with an explanation on the right as the comment. We will copy out this Search String as the foundation and build upon it.
Our final goal is to make the Search String as below. We will dissect what is the meaning of each of the changes we do. First of all, we do not want holdback in our experiment. So, we start by setting the holdback to 0. In this case, I will have 3 future prediction points instead of 2. Afterward, we can see the predicted value does not have a value in Element Count from the previous search preview. Therefore, we will start by adding those missing Element Counts before the prediction starts (lines 5-7). In our search data, we have “_time” however that is the time when the data came into Splunk. We want to use "Element Count" as the index. Therefore, we delete the “_time” column and sort the data via "Element Count" (lines 8-9). This will give a fully functional chart. Eventually, we feed this chart to the internal machine learning engine to produce the forecast value (line 10) and set a reasonable name for each of the columns (lines 11-13).
index="events_perf"
| chart values(attributes.mem) over attributes.x
| rename attributes.x as "Element Count"
| rename values(attributes.mem) as "Memory"
| append [| makeresults | eval "Element Count"=500000 ]
| append [| makeresults | eval "Element Count"=550000 ]
| append [| makeresults | eval "Element Count"=600000 ]
| fields - _time
| sort "Element Count"
| fit StateSpaceForecast "Memory" output_metadata=true holdback=0 forecast_k=3 into "app:ordermemcon"
| rename predicted(Memory) as forecast
| rename lower95(predicted(Memory)) as lower95
| rename upper95(predicted(Memory)) as upper95
By doing so, we can get the table as below without any missing index which means it is good to plot the diagram.
If we just copy the Search String above and use it to generate a diagram in our dashboard. We can see the forecast value (dotted line) is very close to the collected dataset (green line). This means that the forecast that is done after the last “Element Count” should be pretty accurate. However, one can see the more forecast we do the bigger the uncertainty from the opening between “lower95” and “upper95” (95% Confidence Interval).
However, the diagram you initially created is probably without the dotted line. In this case, we can style the table by manipulating its Source Code. Click on the “Source” button on the top of the page beside “Edit Dashboard” and “UI” and add the following option. One does have to make sure the column name is correct. In the example below I set the “forecast”, “lower95” and “upper95” as dashed lines (shortDash). At the same time, “Memory” is a solid line. By doing so, one can get the same diagram as shown above.
<option name="charting.chart">line</option>
<option name="charting.fieldDashStyles">{"forecast":"shortDash", "lower95":"shortDash", "upper95":"shortDash","Memory":"solid"}</option>
lower95” and “upper95” as dashed line(shortDash). At the same time, “Memory” is a solid line.
To make things more interesting we can also try to predict some more complicated data. For example the live status memory consumption of ncs.smp from CollectD(from phase 1). The diagram below shows one of the experiments we run in NSO by adding multiple long order-by-user lists.
We set the range of this search by specifying ‘time>"1727784056.823" time<"1727788486.821"’ and then use the similar method that was described previously and create the following Search String. This will give us the forecast diagram as shown above.
index="ncssmp" plugin=processes type=ps_rss plugin_instance="ncs.smp" time>"1727784056.823" time<"1727788486.821"
| chart values(values{}) over time
| rename "values(values{})" as "ncs.smp-RSS"
| append [| makeresults | eval "time"=1727788486]
| append [| makeresults | eval "time"=1727788496]
| append [| makeresults | eval "time"=1727788506]
| append [| makeresults | eval "time"=1727788516]
| append [| makeresults | eval "time"=1727788526]
| append [| makeresults | eval "time"=1727788536]
| append [| makeresults | eval "time"=1727788546]
| append [| makeresults | eval "time"=1727788556]
| append [| makeresults | eval "time"=1727788566]
| append [| makeresults | eval "time"=1727788576]
| fields - _time
| sort "time"
| fit StateSpaceForecast "ncs.smp-RSS" output_metadata=true forecast_k=10 holdback=0 into "app:forecast"
| rename predicted(ncs.smp-RSS) as forecast
| rename lower95(predicted(ncs.smp-RSS)) as lower95
| rename upper95(predicted(ncs.smp-RSS)) as upper95
At this stage, we have the original data in the Splunk data storage, predicted data from the Splunk machine learning engine, and diagrams to show how the visualization of the data. We have the full gear with us now, let’s try to take some proactive action based on these data. Before we continue, we recommend you start cloning the source we mentioned in the “Code Example that Used in this Blog” chapter. In the repository, one can find the demo service package of the “Predictive Service Design”. The demo package will add an “order-by-user list” with the length specified under the configuration “predictive_service test2 max-length <length>”. If such operation will trigger Commited_AS over CommitLimit when “overcommit_memory” is 2 or RSS over 90 percent of the total physical memory, the service will be aborted as below.
admin@ncs(config)# predictive_service test2 max-length 500000
admin@ncs(config-predictive_service-test2)# commit dry-run
Aborted: Python cb_pre_modification error. Abort the service execution due to Potention OOM risk
Otherwise, the execution will proceed. In the Python VM log one can see the service code trigger Splunk to forecast data 3 times on “memory(RSS)”, “time” and “Commited_AS”
<INFO> 17-Oct-2024::11:48:57.67 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - Service premod(service=/predictive_service:predictive_service{test6})
<INFO> 17-Oct-2024::11:48:57.68 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - ***Start Splunk Forecasting***
<INFO> 17-Oct-2024::11:48:57.68 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - Fetching memory data from Splunk
<INFO> 17-Oct-2024::11:48:57.68 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - Contacting Splunk Datastore: https://10.147.40.101:8089
<INFO> 17-Oct-2024::11:48:57.138 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - Creating job with sid: 1729158537.15277
<INFO> 17-Oct-2024::11:49:07.256 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - Fetching time data from Splunk
<INFO> 17-Oct-2024::11:49:07.256 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - Contacting Splunk Datastore: https://10.147.40.101:8089
<INFO> 17-Oct-2024::11:49:07.317 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - Creating job with sid: 1729158547.15279
<INFO> 17-Oct-2024::11:49:17.428 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - Fetching Commited_AS data from Splunk
<INFO> 17-Oct-2024::11:49:17.429 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - Contacting Splunk Datastore: https://10.147.40.101:8089
<INFO> 17-Oct-2024::11:49:17.515 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - Creating job with sid: 1729158557.15280
After data is received, start with evaluating if the execution has the possibility to cause an OOM(Out of Memory) scenario by allocating more than CommitLimit restricted. This evaluation starts with checking if “overcommit_memory” is equal to 2 or not. If not, the following log will show,
<INFO> 17-Oct-2024::11:49:27.647 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - Evaluation will based on RSS. Ignore OOM Action due to overcommit_memory is: 0
Otherwise, start comparing Commited_AS vs CommitLimit. If Commited_AS is over CommitLimit, the following log will show.
<ERROR> 12-Oct-2024::11:09:43.60 predictive_service ncs-dp-3719214-predictive_service:main:0-1-th-60123: - Commited_AS over CommitLimit
At the same time, service will be terminated.
<INFO> 12-Oct-2024::11:09:43.60 predictive_service ncs-dp-3719214-predictive_service:main:0-1-th-60123: - Expect time consumption 0:00:39.626565, RSS/Memory Limit: 261.93kb/14625824.4kb, Commited_AS/CommitLimit: 9972206.053970432/8143884, Recommended Action: Abort
<ERROR> 12-Oct-2024::11:09:43.60 predictive_service ncs-dp-3719214-predictive_service:main:0-1-th-60123: - Abort the service execution due to Potention OOM risk
Traceback (most recent call last):
File "/home/leeli4/Tail-f-workenv/Tail-f-env/nso/nso_store/6.3.2/src/ncs/pyapi/ncs/application.py", line 875, in wrapper
pl = fn(self, tctx, op, kp, root, proplist)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/leeli4/Tail-f-workenv/test/git/leeli4/splunk-example---predictive/nso/ncs-run/state/packages-in-use/1/predictive_service/python/predictive_service/main.py", line 46, in cb_pre_modification
action=forecast(service.max_length, self.log)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/leeli4/Tail-f-workenv/test/git/leeli4/splunk-example---predictive/nso/ncs-run/state/packages-in-use/1/predictive_service/python/predictive_service/splunk_api.py", line 155, in forecast
raise Exception(f'Abort the service execution due to Potention OOM risk')
Exception: Abort the service execution due to Potention OOM risk
Otherwise, the service will proceed.
<INFO> 17-Oct-2024::11:49:27.647 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - Expect time consumption 0:15:00.709549, RSS/Memory Limit: 1.42mb/14625824.4mb, Commited_AS/CommitLimit: 11731622.436961522/8143884, Recommended Action: Proceed
<INFO> 17-Oct-2024::11:49:27.653 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - Service create(service=/predictive_service:predictive_service{test6})
<INFO> 17-Oct-2024::11:49:27.653 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - Expected memory consumption not close to the critical limit. Proceed with service execution.
From the log above, one can also see logs like the one below. This log shows the forecast data obtained from Splunk and what is it compared against. This is beneficial for whoever would like to have an overview of the rest of the process before the cb_create is executed.
<INFO> 17-Oct-2024::11:49:27.647 predictive_service ncs-dp-3920865-predictive_service:main:0-2-th-62057: - Expect time consumption 0:15:00.709549, RSS/Memory Limit: 1.42mb/14625824.4mb, Commited_AS/CommitLimit: 11731622.436961522/8143884, Recommended Action: Proceed
A general overview of how the function is managed to work from the previous chapter can be found in the diagram below. Service Code check with the “Usage” in “splunk_api” and “Usage” do the decision making by extracting data from Splunk by using the “Toolset” in “splunk_api”. “Toolset” is written based on Splunk REST API.
By saying so, one can realize all the checking happened on the pre-modification by importing and calling the “forecast” function in the “splunk_api” module.
from .splunk_api import *
@Service.pre_modification
def cb_pre_modification(self, tctx, op, kp, root, proplist):
self.log.info('Service premod(service=', kp, ')')
_ncs.dp.action_set_timeout(tctx.uinfo,6000)
if op == _ncs.dp.NCS_SERVICE_CREATE:
service = ncs.maagic.cd(root, kp)
action=forecast(service.max_length, self.log)
if not action:
raise Exception("Expected memory consumption close to the critical limit. Aborting service execution.")
The “forecast” function is the main function in the in “splunk_api” module. Before we look at what it is doing, let’s look at how does “splunk_api” module works in general. “splunk_api” includes the API to trigger forecast on the Splunk side and pull the data back to the NSO. One can find these APIs under the “Toolset” comment in “python/predictive_service/splunk_api.py”. “get_splunk_data” is the main class of the “Toolset”. It will call the “splunk_create_job” to create the forecast job and collect the result via “splunk_get_result” when the forecast is ready. The output of the “get_splunk_data” is a Pandas data frame that includes the data that was just acquired.
def get_splunk_data(query):
#print(os.getcwd())
global splunk_ip
with open('packages/predictive_service/python/predictive_service/config/splunk_config.json') as f:
#with open('config/splunk_config.json') as f:
d = json.load(f)
user=d["user"]
password=d["pass"]
max_retry=d["max_retry"]
splunk_ip=d["splunk_ip"]
global_log.info("Contacting Splunk Datastore: "+splunk_ip)
#Create a search job
sid=splunk_create_job(user,password,query)
global_log.info("Creating job with sid: " + sid)
#Check status of a search
data=splunk_get_result(user,password,sid,max_retry)
#datetime.timedelta(seconds=output)
df_method1 = pd.read_csv(StringIO(data.decode("utf-8")))
#print(df_method1)
return df_method1
“splunk_create_job” sends a post request to Splunk on "/services/search/jobs/"
def splunk_create_job(user,password,query):
response = requests.post(splunk_ip+"/services/search/jobs/", auth = HTTPBasicAuth(user,password), data = {"search": "search "+query}, verify=False)
#print(response.content)
if response.status_code >= 200 and response.status_code < 300:
root = ElementTree.fromstring(response.content)
sid=root[0].text
else:
raise Exception("HTTP Response Error on splunk_create_job:"+ response.status_code)
return sid
while the “splunk_get_result” collects the result by querying "/services/search/jobs/"+sid+"/results/" towards Splunk. If Splunk comes back with the empty string, it means the forecast is not ready and we will retry again after 5 seconds.
def splunk_get_result(user,password,sid, max_retry):
output=""
retry_counter=0
while len(output) == 0 and retry_counter < max_retry:
time.sleep(5)
response = requests.get(splunk_ip+"/services/search/jobs/"+sid+"/results/", auth = HTTPBasicAuth(user,password), data = {"output_mode": "csv"}, verify=False)
if response.status_code >= 200 and response.status_code < 300:
output=response.content
else:
raise Exception("HTTP Response Error on splunk_get_result:"+ response.status_code)
retry_counter+=1
return output
Based on these toolsets, we construct the logic under the “Usage”. The following function acts as a wrapper for the “get_splunk_data” function by specifying the different filters.
For example, the “get_splunk_commited_as_data” sets the Search String in the “query” and then uses the “query” in “get_splunk_data”.
def get_splunk_commited_as_data():
query="""
index="events_perf" | chart values(attributes.commited_as) over attributes.x | rename attributes.x as "Element Count" | rename values(attributes.commited_as) as "Commited_AS"
| append [| makeresults | eval "Element Count"=500000 ]
| append [| makeresults | eval "Element Count"=550000 ]
| append [| makeresults | eval "Element Count"=600000 ]
| fields - _time
| sort "Element Count"
| fit StateSpaceForecast "Commited_AS" output_metadata=true holdback=0 forecast_k=3 into "app:commitas"
| rename predicted(Commited_AS) as forecast
| rename lower95(predicted(Commited_AS)) as lower95
| rename upper95(predicted(Commited_AS)) as upper95
| chart values("Commited_AS") values(lower95) values(forecast) values(upper95) BY "Element Count"
| rename values("Commited_AS") as "Commited_AS"
| rename values(lower95) as lower95
| rename values(upper95) as upper95
| rename values(forecast) as forecast
"""
data=get_splunk_data(query)
return data
while the following function parses the data obtained from the function above and outputs the data that is needed
For example “get_commited_as_data” calls “get_splunk_commited_as_data” and takes the “commited_as” data from the length that was specified in the configuration “predictive_service test2 max-length <length>” that entered from the northbound.
def get_commited_as_data(length):
global_log.info("Fetching Commited_AS data from Splunk")
as_df=get_splunk_commited_as_data()
row=as_df.loc[as_df['Element Count'] == length]
as_data=row['forecast'].item()
return(as_data)
Eventually, we call all the functions above in the “forecast” and decide if the service should be terminated or not. If terminated, we will proceed with proactive action. In this example, we decide to raise an exception to terminate the service evaluation before the service reach “cb_create”. However, there should be better proactive action than throwing exceptions like raising an alarm or proceeding with better “cb_create” like adding items to the order-by-user list chunk by chunk instead of once for all.
def forecast(length,log):
global global_log
global_log=log
global_log.info("***Start Splunk Forecasting***")
mem_data=get_mem_data(length)
time_data=get_time_data(length)
commited_as_data=get_commited_as_data(length)
(action,mem_free_kb,mem_total_kb,commited_limit,mem_dec,commited_dec)=get_action(mem_data[2],time_data,commited_as_data)
global_log.info(f'Expect time consumption {time_data}, RSS/Memory Limit: {mem_data[0]}{mem_data[1]}/{mem_free_kb}{mem_data[1]}, Commited_AS/CommitLimit: {commited_as_data}/{commited_limit}, Recommended Action: {action}')
if action=="Abort":
if not mem_dec:
raise Exception(f'Abort the service execution due to RSS close to the Memory Limit')
elif not commited_dec:
raise Exception(f'Abort the service execution due to Potential OOM risk')
else:
raise Exception(f'Abort the service execution due to Unknow Decision from Splunk ML Enginer')
return False
else:
return True
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Find answers to your questions by entering keywords or phrases in the Search bar above. New here? Use these resources to familiarize yourself with the NSO Developer community: