This page is under development

Introduction

TAD (Test And Diagnostic) monitors the amount of free memory available in the system at run time. It triggers the memory/Resource Reclamation (RR) process in TDM, when the amount of free memory drops below a configurable threshold. RR process can also be triggered by memory allocation failures which results in a notification being sent to TDM to try to reclaim memory.

Selfheal is another feature implemented in Test And Diagnostic Component.

Self-heal monitors:

Self-heal stores Reset Count and Reboot Count.
Self-heal takes required action like: Rebooting the device, Restarting required process based on predefined conditions.
Self-heal does connectivity test.


Feature

Selfheal – Resource Monitoring

Monitors the resources periodically (eg: 15 mins). If "Average Memory Used" reaches threshold value, necessary action will be executed. 

"resource_monitor.sh" script is used for monitoring Memory and CPU usage.
Located in the device path: "/fss/gw/usr/ccsp/tad/resource_monitor.sh".

Selfheal – Process Monitoring

Monitors the processes periodically (eg: 15 mins) based on process id (pid). Based on the process id availability, required action will be taken such as restarting the process, rebooting the device.

"task_health_monitor.sh" script is used for monitoring RDK-B processes. This is located at path: "/fss/gw/usr/ccsp/tad/task_health_monitor.sh". We can monitor any RDK-B processes by adding the process pid in this script.

Self-heal stores Reset Count and Reboot Count

Selfheal – Connectivity Test

Self-heal does connectivity test. Ping test will be done through server IP/URI (this needs to be configured). If server IP/URI is not configured, Ping test won't be executed and no action will be taken. If server is configured and ping test fails, reboot action will be executed.

"self_heal_connectivity_test.sh" script is used for ping test

Selfheal – Action

Self-heal takes the required action through "corrective_action.sh" script. This script has implementation of the actions.

Some of the actions are:

rebootNeeded - Reboots the device
resetNeeded - Restarts the required process
storeInformation - Stores Memory and CPU usage

In Raspberry Pi the functionality of self-heal feature is provided by systemd.


Code Flow

Resource Monitoring - resource_monitor.sh




Process Monitoring - task_health_monitor.sh


Connectivity Test - self_heal_connectivity_test.sh


Objects

Self heal objects in its DML layer: 

Device.SelfHeal.X_RDKCENTRAL-COM

Self heal can be Enabled/disabled by the below data model. By default, it is enabled

$ dmcli eRT getv Device.SelfHeal.X_RDKCENTRAL-COM_Enable
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
getv from/to component(eRT.com.cisco.spvtg.ccsp.tdm): Device.SelfHeal.X_RDKCENTRAL-COM_Enable
Execution succeed.
Parameter    1 name: Device.SelfHeal.X_RDKCENTRAL-COM_Enable
               type:       bool,    value: true


Verify the selfheal feature running status

$ ps -Af | grep -i self
 4449 root       0:00 {self_heal_conne} /bin/sh /usr/ccsp/tad/self_heal_connectivity_test.sh
18921 root       0:00 grep -i self

Resource monitoring

The Below DM is used to verify the Average CPU threshold. By default the value is set to 100

$ dmcli eRT getv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
getv from/to component(eRT.com.cisco.spvtg.ccsp.tdm): Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
Execution succeed.
Parameter    1 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
               type:       uint,    value: 100


The Below DM is used to verify the Average Memory threshold. By default the value is set to 100

$ dmcli eRT getv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
getv from/to component(eRT.com.cisco.spvtg.ccsp.tdm): Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
Execution succeed.
Parameter    1 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
               type:       uint,    value: 100