Introduction
TAD (Test And Diagnostic) monitors the amount of free memory available in the system at run time. It triggers the memory/Resource Reclamation (RR) process in TDM, when the amount of free memory drops below a configurable threshold. RR process can also be triggered by memory allocation failures which results in a notification being sent to TDM to try to reclaim memory.
Selfheal is another feature implemented in Test And Diagnostic Component.
Self-heal monitors:
- CPU usage
- Memory Usage
- Critical RDK-B processes
Self-heal stores Reset Count and Reboot Count.
Self-heal takes required action like: Rebooting the device, Restarting required process based on predefined conditions.
Self-heal does connectivity test.
Feature
Selfheal – Resource Monitoring
Monitors the resources periodically (eg: 15 mins). If "Average Memory Used" reaches threshold value, necessary action will be executed.
"resource_monitor.sh" script is used for monitoring Memory and CPU usage.
Located in the device path: "/usr/ccsp/tad/resource_monitor.sh".
Selfheal – Process Monitoring
Monitors the processes periodically (eg: 15 mins) based on process id (pid). Based on the process id availability, required action will be taken such as restarting the process, rebooting the device.
"task_health_monitor.sh" script is used for monitoring RDK-B processes. This is located at path: "/usr/ccsp/tad/task_health_monitor.sh". We can monitor any RDK-B processes by adding the process pid in this script.
Self-heal stores Reset Count and Reboot Count
Selfheal – Connectivity Test
Self-heal does connectivity test. Ping test will be done through server IP/URI (this needs to be configured). If server IP/URI is not configured, Ping test won't be executed and no action will be taken. If server is configured and ping test fails, reboot action will be executed.
"self_heal_connectivity_test.sh" script is used for ping test
Selfheal – Action
Self-heal takes the required action through "corrective_action.sh" script. This script has implementation of the actions.
Some of the actions are:
rebootNeeded - Reboots the device
resetNeeded - Restarts the required process
storeInformation - Stores Memory and CPU usage
In Raspberry Pi the functionality of self-heal feature is provided by systemd.
Code Flow
eyJleHRTcnZJbnRlZ1R5cGUiOiIiLCJnQ2xpZW50SWQiOiIiLCJjcmVhdG9yTmFtZSI6IlotWW9nb21heWEgTWFoYXJhbmEiLCJvdXRwdXRUeXBlIjoiYmxvY2siLCJsYXN0TW9kaWZpZXJOYW1lIjoiWi1Zb2dvbWF5YSBNYWhhcmFuYSIsImxhbmd1YWdlIjoiZW4iLCJ1aUNvbmZpZyI6Int9IiwiZGlhZ3JhbURpc3BsYXlOYW1lIjoiIiwic0ZpbGVJZCI6IiIsImF0dElkIjoiMTQ0OTAxMzk2IiwiZGlhZ3JhbU5hbWUiOiJDT0RFIEZMT1cgMS5kcmF3aW8iLCJhc3BlY3QiOiIiLCJsaW5rcyI6ImF1dG8iLCJjZW9OYW1lIjoiVGVzdEFuZERpYWdub3N0aWMiLCJ0YnN0eWxlIjoidG9wIiwiY2FuQ29tbWVudCI6ZmFsc2UsImRpYWdyYW1VcmwiOiIiLCJjc3ZGaWxlVXJsIjoiIiwiYm9yZGVyIjp0cnVlLCJtYXhTY2FsZSI6IjEiLCJvd25pbmdQYWdlSWQiOjIxMzY1NzEwLCJlZGl0YWJsZSI6ZmFsc2UsImNlb0lkIjoyMTM2NTcxMCwicGFnZUlkIjoiIiwibGJveCI6dHJ1ZSwic2VydmVyQ29uZmlnIjp7ImVtYWlscHJldmlldyI6IjEifSwib2RyaXZlSWQiOiIiLCJyZXZpc2lvbiI6MSwibWFjcm9JZCI6ImEwZDZlM2FjLTNjN2UtNGQ0OS04NDE2LTgzNzRkMmFlZWExMCIsInByZXZpZXdOYW1lIjoiQ09ERSBGTE9XIDEuZHJhd2lvLnBuZyIsImxpY2Vuc2VTdGF0dXMiOiJPSyIsInNlcnZpY2UiOiIiLCJpc1RlbXBsYXRlIjoiIiwid2lkdGgiOiJudWxsIiwic2ltcGxlVmlld2VyIjpmYWxzZSwibGFzdE1vZGlmaWVkIjoxNjE0Njk0NTUwMDAwLCJleGNlZWRQYWdlV2lkdGgiOmZhbHNlLCJvQ2xpZW50SWQiOiIifQ==
Resource Monitoring - resource_monitor.sh
eyJleHRTcnZJbnRlZ1R5cGUiOiIiLCJnQ2xpZW50SWQiOiIiLCJjcmVhdG9yTmFtZSI6IlotWW9nb21heWEgTWFoYXJhbmEiLCJvdXRwdXRUeXBlIjoiYmxvY2siLCJsYXN0TW9kaWZpZXJOYW1lIjoiWi1Zb2dvbWF5YSBNYWhhcmFuYSIsImxhbmd1YWdlIjoiZW4iLCJ1aUNvbmZpZyI6Int9IiwiZGlhZ3JhbURpc3BsYXlOYW1lIjoiIiwic0ZpbGVJZCI6IiIsImF0dElkIjoiMTQ0OTAxMzk0IiwiZGlhZ3JhbU5hbWUiOiJSRVNPVVJDRSBNT05JVE9SSU5HIERJQUcuZHJhd2lvIiwiYXNwZWN0IjoiIiwibGlua3MiOiJhdXRvIiwiY2VvTmFtZSI6IlRlc3RBbmREaWFnbm9zdGljIiwidGJzdHlsZSI6InRvcCIsImNhbkNvbW1lbnQiOmZhbHNlLCJkaWFncmFtVXJsIjoiIiwiY3N2RmlsZVVybCI6IiIsImJvcmRlciI6dHJ1ZSwibWF4U2NhbGUiOiIxIiwib3duaW5nUGFnZUlkIjoyMTM2NTcxMCwiZWRpdGFibGUiOmZhbHNlLCJjZW9JZCI6MjEzNjU3MTAsInBhZ2VJZCI6IiIsImxib3giOnRydWUsInNlcnZlckNvbmZpZyI6eyJlbWFpbHByZXZpZXciOiIxIn0sIm9kcml2ZUlkIjoiIiwicmV2aXNpb24iOjEsIm1hY3JvSWQiOiI3ZDVkNWNjZC0wY2MyLTRiYjktYjg3Ni0xNGViZjQ2ODg5ZDAiLCJwcmV2aWV3TmFtZSI6IlJFU09VUkNFIE1PTklUT1JJTkcgRElBRy5kcmF3aW8ucG5nIiwibGljZW5zZVN0YXR1cyI6Ik9LIiwic2VydmljZSI6IiIsImlzVGVtcGxhdGUiOiIiLCJ3aWR0aCI6Im51bGwiLCJzaW1wbGVWaWV3ZXIiOmZhbHNlLCJsYXN0TW9kaWZpZWQiOjE2MTQ2OTQ1NTAwMDAsImV4Y2VlZFBhZ2VXaWR0aCI6ZmFsc2UsIm9DbGllbnRJZCI6IiJ9
- resource_monitor.sh monitors the Memory and CPU usage
- Average memory and CPU thresholds will be obtained from syscfg.db (default avg_cpu_threshold:100, avg_memory_threshold:100)
- Memory Usage Monitor
- Gets the total, free and used memory details using free command
- AvgMemUsed = usedMem*100 / totalMem
- if AvgMemUsed > memory_threshold, device will be rebooted
- CPU Usage Monitor
- Active CPU is considered as sum of user, system, iowait, irq, softirq, steal cpu
- CPU usage difference in every 30 seconds for a period of 5 mins, is considered as an Curr_CPULoad_Avg.
- If Curr_CPULoad_Avg > cpu_threshold, corrective action will be taken
Process Monitoring - task_health_monitor.sh
eyJleHRTcnZJbnRlZ1R5cGUiOiIiLCJnQ2xpZW50SWQiOiIiLCJjcmVhdG9yTmFtZSI6IlotWW9nb21heWEgTWFoYXJhbmEiLCJvdXRwdXRUeXBlIjoiYmxvY2siLCJsYXN0TW9kaWZpZXJOYW1lIjoiWi1Zb2dvbWF5YSBNYWhhcmFuYSIsImxhbmd1YWdlIjoiZW4iLCJ1aUNvbmZpZyI6Int9IiwiZGlhZ3JhbURpc3BsYXlOYW1lIjoiIiwic0ZpbGVJZCI6IiIsImF0dElkIjoiMTQ0OTAxMzkyIiwiZGlhZ3JhbU5hbWUiOiJwcm9jZXNzIG1vbml0b3JpbmcuZHJhd2lvIiwiYXNwZWN0IjoiIiwibGlua3MiOiJhdXRvIiwiY2VvTmFtZSI6IlRlc3RBbmREaWFnbm9zdGljIiwidGJzdHlsZSI6InRvcCIsImNhbkNvbW1lbnQiOmZhbHNlLCJkaWFncmFtVXJsIjoiIiwiY3N2RmlsZVVybCI6IiIsImJvcmRlciI6dHJ1ZSwibWF4U2NhbGUiOiIxIiwib3duaW5nUGFnZUlkIjoyMTM2NTcxMCwiZWRpdGFibGUiOmZhbHNlLCJjZW9JZCI6MjEzNjU3MTAsInBhZ2VJZCI6IiIsImxib3giOnRydWUsInNlcnZlckNvbmZpZyI6eyJlbWFpbHByZXZpZXciOiIxIn0sIm9kcml2ZUlkIjoiIiwicmV2aXNpb24iOjEsIm1hY3JvSWQiOiJlYTVhODZlNS0wOTdlLTRkM2UtYjFmZi00MGZhMjhiNTQyY2IiLCJwcmV2aWV3TmFtZSI6InByb2Nlc3MgbW9uaXRvcmluZy5kcmF3aW8ucG5nIiwibGljZW5zZVN0YXR1cyI6Ik9LIiwic2VydmljZSI6IiIsImlzVGVtcGxhdGUiOiIiLCJ3aWR0aCI6Im51bGwiLCJzaW1wbGVWaWV3ZXIiOmZhbHNlLCJsYXN0TW9kaWZpZWQiOjE2MTQ2OTQ1NTAwMDAsImV4Y2VlZFBhZ2VXaWR0aCI6ZmFsc2UsIm9DbGllbnRJZCI6IiJ9
- task_health_monitor.sh monitors the status of various taks periodically and takes the corrective action
- Default monitoring interval is 15mins and can be modified using resource_monitor_interval in syscfg.db
- Monitors
- Health of peer processor, in case of dual core processors
- Other tasks added as part of the script
- New tasks can be added by editing the script
Connectivity Test - self_heal_connectivity_test.sh
eyJleHRTcnZJbnRlZ1R5cGUiOiIiLCJnQ2xpZW50SWQiOiIiLCJjcmVhdG9yTmFtZSI6IlotWW9nb21heWEgTWFoYXJhbmEiLCJvdXRwdXRUeXBlIjoiYmxvY2siLCJsYXN0TW9kaWZpZXJOYW1lIjoiWi1Zb2dvbWF5YSBNYWhhcmFuYSIsImxhbmd1YWdlIjoiZW4iLCJ1aUNvbmZpZyI6Int9IiwiZGlhZ3JhbURpc3BsYXlOYW1lIjoiIiwic0ZpbGVJZCI6IiIsImF0dElkIjoiMTQ0OTAxMzkwIiwiZGlhZ3JhbU5hbWUiOiJjb25uZWN0aXZpdHkgdGVzdC5kcmF3aW8iLCJhc3BlY3QiOiIiLCJsaW5rcyI6ImF1dG8iLCJjZW9OYW1lIjoiVGVzdEFuZERpYWdub3N0aWMiLCJ0YnN0eWxlIjoidG9wIiwiY2FuQ29tbWVudCI6ZmFsc2UsImRpYWdyYW1VcmwiOiIiLCJjc3ZGaWxlVXJsIjoiIiwiYm9yZGVyIjp0cnVlLCJtYXhTY2FsZSI6IjEiLCJvd25pbmdQYWdlSWQiOjIxMzY1NzEwLCJlZGl0YWJsZSI6ZmFsc2UsImNlb0lkIjoyMTM2NTcxMCwicGFnZUlkIjoiIiwibGJveCI6dHJ1ZSwic2VydmVyQ29uZmlnIjp7ImVtYWlscHJldmlldyI6IjEifSwib2RyaXZlSWQiOiIiLCJyZXZpc2lvbiI6MSwibWFjcm9JZCI6IjIwZGM1M2E4LWIzMjAtNDk3MC1hN2NjLWRiNDNlNDZlOTIyNCIsInByZXZpZXdOYW1lIjoiY29ubmVjdGl2aXR5IHRlc3QuZHJhd2lvLnBuZyIsImxpY2Vuc2VTdGF0dXMiOiJPSyIsInNlcnZpY2UiOiIiLCJpc1RlbXBsYXRlIjoiIiwid2lkdGgiOiJudWxsIiwic2ltcGxlVmlld2VyIjpmYWxzZSwibGFzdE1vZGlmaWVkIjoxNjE0Njk0NTUwMDAwLCJleGNlZWRQYWdlV2lkdGgiOmZhbHNlLCJvQ2xpZW50SWQiOiIifQ==
- Self_heal_connectivity_test.sh will run Ping and DNS tests.
- ConnTest_PingInterval in syscfg.db specifies the frequency of the connectivity test.
- If nothing specified, it is 60seconds by default.
- runPingTest
- Gets the IP (default_router IP) from syscfg.db
- If no IP specified, it will try pinging to default gw
- If ping fails, takes the corrective action, which is none by default.
- runDNSPingTest
- This is disabled by default. Can be enabled by selfheal_dns_pingtest_enable in syscfg.db
- Gets the urlToVerify from syscfg.db
- If nslookup fails, takes the corrective action, which is none by default
Objects
Self heal objects in its DML layer:
Device.SelfHeal.X_RDKCENTRAL-COM
Self heal can be Enabled/disabled by the below data model. By default, it is enabled
$ dmcli eRT getv Device.SelfHeal.X_RDKCENTRAL-COM_Enable
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
getv from/to component(eRT.com.cisco.spvtg.ccsp.tdm): Device.SelfHeal.X_RDKCENTRAL-COM_Enable
Execution succeed.
Parameter 1 name: Device.SelfHeal.X_RDKCENTRAL-COM_Enable
type: bool, value: true
Verify the selfheal feature running status
$ ps -Af | grep -i self
4449 root 0:00 {self_heal_conne} /bin/sh /usr/ccsp/tad/self_heal_connectivity_test.sh
18921 root 0:00 grep -i self
Resource monitoring
The Below DM is used to verify the Average CPU threshold. By default the value is set to 100
$ dmcli eRT getv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
getv from/to component(eRT.com.cisco.spvtg.ccsp.tdm): Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
Execution succeed.
Parameter 1 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
type: uint, value: 100
The Below DM is used to verify the Average Memory threshold. By default the value is set to 100
$ dmcli eRT getv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
getv from/to component(eRT.com.cisco.spvtg.ccsp.tdm): Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
Execution succeed.
Parameter 1 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
type: uint, value: 100