RDK Resources
[*RDK Preferred*]
Code Management Facility
RDK Forums
[RDK Conferences]
RDK Support
Archives
Papers & Presentations Archive
...
resource_monitor.sh
task_health_monitor.sh
corrective_action.sh
self_heal_connectivity.sh
Please refer the below screenshot code snippet to verify self heal module was enabled or not,
Self Heal is enabled by default and is active at the time of boot up.
It periodically performs below actions.
1. By default, AVG CPU threshold value will be set as 100. This value will be stored in syscfg database. If we want the change the default AVG CPU threshold value, Please refer the attached screenshot and do the following steps,
2. By default, AVG Memory threshold value will be set as 100. This value will be stored in syscfg database. If we want the change the default AVG Memory threshold value, Please refer the attached screenshot and do the following steps,
Code Block | ||||
---|---|---|---|---|
| ||||
root@Filogic-GW:/rdklogs/logs# dmcli eRT getv Device.SelfHeal.X_RDKCENTRAL-COM_Enable
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter 1 name: Device.SelfHeal.X_RDKCENTRAL-COM_Enable
type: bool, value: true
root@Filogic-GW:/rdklogs/logs#
root@Filogic-GW:/rdklogs/logs# ps -alx | grep reso
4 0 4531 1 20 0 3496 2744 do_wai S ? 0:00 /bin/sh /usr/ccsp/tad/resource_monitor.sh
0 0 16777 10009 20 0 2244 808 pipe_w S+ pts/0 0:00 grep reso
root@Filogic-GW:/rdklogs/logs# ps -alx | grep self
4 0 4528 1 20 0 3628 2836 do_wai S ? 0:00 /bin/sh /usr/ccsp/tad/self_heal_connectivity_test.sh
4 0 4539 1 20 0 4288 3468 do_wai S ? 0:00 /bin/sh /usr/ccsp/tad/selfheal_aggressive.sh
0 0 16816 10009 20 0 2244 824 pipe_w S+ pts/0 0:00 grep self |
Self Heal is enabled by default and is active at the time of boot up.
It periodically performs below actions.
1. By default, AVG CPU threshold value will be set as 100. This value will be stored in syscfg database. If we want the change the default AVG CPU threshold value, Please refer the below code snippet and do the following steps,
Code Block | ||||
---|---|---|---|---|
| ||||
root@Filogic-GW:/# dmcli eRT getv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter 1 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
type: uint, value: 100
root@Filogic-GW:/rdklogs/logs# dmcli eRT setv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold uint 200
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
root@Filogic-GW:/# dmcli eRT getv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter 1 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
type: uint, value: 200
|
2. By default, AVG Memory threshold value will be set as 100. This value will be stored in syscfg database. If we want the change the default AVG Memory threshold value, Please refer the below code snippet and do the following steps,
Code Block | ||||
---|---|---|---|---|
| ||||
root@Filogic-GW:~# dmcli eRT getv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter 1 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
type: uint, value: 100
root@Filogic-GW:/rdklogs/logs# dmcli eRT setv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold uint 200
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
root@Filogic-GW:~# dmcli eRT getv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter 1 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
type: uint, value: 200 |
3. Once it's reaches the threshold value, device will be rebooted automatically.
observation in /rdklogs/logs/SelfHeal.txt.0
For Memory Threshold
41106-10:22:51.210979 RDKB_SELFHEAL : Used memory in system is 153740 at timestamp 2024:11:06:10:22:51
241106-10:22:51.212450 RDKB_SELFHEAL : Free memory in system is 3754608 at timestamp 2024:11:06:10:22:51
241106-10:22:51.213869 RDKB_SELFHEAL : AvgMemUsed in % is 3
241106-10:22:51.267398 <128>CABLEMODEM[Mediatek]:<99000006><2024:11:06:10:22:51><ea:4f:a0:5d:06:99><BananapiBPI-R4> RM Memory threshold reached
241106-10:23:21.732742 RDKB_SELFHEAL : Today's reboot count is 1
241106-10:23:21.734210 RDKB_SELFHEAL : <128>CABLEMODEM[Mediatek]:<99000000><2024:11:06:10:23:21><ea:4f:a0:5d:06:99><BananapiBPI-R4> RM Rebooting device as part of corrective action
241106-10:23:21.735754 Setting Last reboot reason as MEM_THRESHOLD
241106-10:23:21.737264 Setting rebootReason to MEM_THRESHOLD and rebootCounter to 1
241106-10:23:21.789361 RDKB_REBOOT : Rebooting device due to MEM threshold reached
After reboot,
Code Block | ||||
---|---|---|---|---|
| ||||
root@Filogic-GW:~# dmcli eRT getv Device.DeviceInfo.X_RDKCENTRAL-COM_LastRebootReason
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter 1 name: Device.DeviceInfo.X_RDKCENTRAL-COM_LastRebootReason
type: string, value: MEM_THRESHOLD |
For CPU Threshold
241106-10:54:09.546821 RDKB_SELFHEAL : Today's reboot count is 3
241106-10:54:09.548312 RDKB_SELFHEAL : <128>CABLEMODEM[Mediatek]:<99000000><2024:11:06:10:54:09><d2:33:17:da:85:e4><BananapiBPI-R4> RM Rebooting device as part of corrective action
241106-10:54:09.549727 Setting Last reboot reason as CPU_THRESHOLD
241106-10:54:09.551169 Setting rebootReason to CPU_THRESHOLD and rebootCounter to 1
241106-10:54:09.603327 RDKB_REBOOT : Rebooting device due to CPU threshold reached
<128>CABLEMODEM[Mediatek]:<99000005><2024:11:06:10:53:39><d2:33:17:da:85:e4><BananapiBPI-R4> RM CPU threshold reached
[2024-11-06:10:53:39:083104] Setting Last reboot reason
After reboot,
Code Block | ||||
---|---|---|---|---|
| ||||
root@Filogic-GW:~# dmcli eRT getv Device.DeviceInfo.X_RDKCENTRAL-COM_LastRebootReason
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter 1 name: Device.DeviceInfo.X_RDKCENTRAL-COM_LastRebootReason
type: string, value: CPU_THRESHOLD |
If it detects that any of the process is not running, it automatically restarts that particular Component.
Let us take the example of CcspLMLite Component :
Run a ps command to verify that CcspLMLite is up and running again with different process id
ps aux | grep Ccsp
2. kill CcspLMLite process by using the below command
kill -9 PID(CcspLMLite PID)
3. Verfiy whether the CcspLMLite Process was killed or not by using the below command
ps aux | grep Ccsp
4. After 60 seconds(default), it will automatically restart the Process. Please check the CcspLMLIte PID.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
root@Filogic-GW:~# ps -alx | grep CcspLM
5 950 4397 1 20 0 574768 7360 hrtime Ssl ? 0:00 /usr/bin/CcspLMLite -subsys eRT.
0 0 31853 9847 20 0 2244 820 pipe_w S+ pts/0 0:00 grep CcspLM
root@Filogic-GW:~# systemctl status CcspLMLite
● CcspLMLite.service - CcspLMLite service
Loaded: loaded (/lib/systemd/system/CcspLMLite.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2022-04-28 17:43:00 UTC; 2 years 6 months ago
Process: 4353 ExecStart=/usr/bin/CcspLMLite -subsys $Subsys (code=exited, status=0/SUCCESS)
Main PID: 4397 (CcspLMLite)
CGroup: /system.slice/CcspLMLite.service
└─ 4397 /usr/bin/CcspLMLite -subsys eRT.
2022 Apr 28 17:43:00 Filogic-GW systemd[1]: Starting CcspLMLite service...
2022 Apr 28 17:43:00 Filogic-GW systemd[1]: Started CcspLMLite service.
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: eRT.com.cisco.spvtg.ccsp.lmlite start to check eRT.com.cisco.spvtg.ccsp.psm status
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: eRT.com.cisco.spvtg.ccsp.psm is ready, eRT.com.cisco.spvtg.ccsp.lmlite continue
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: PSM module done.
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: Conf file /etc/debug.ini open success
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: rdk_dyn_log_initg_dl_socket = 3 __progname = CcspLMLite
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: rdk_logger_init /etc/debug.ini Already Stack Level Logging processed... not processing again.
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: mq == (mqd_t)-1: Invalid argument
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: mq == (mqd_t)-1: Invalid argument
root@Filogic-GW:~#
root@Filogic-GW:~# kill -9 4397
root@Filogic-GW:~# systemctl status CcspLMLite
× CcspLMLite.service - CcspLMLite service
Loaded: loaded (/lib/systemd/system/CcspLMLite.service; enabled; vendor preset: enabled)
Active: failed (Result: signal) since Wed 2024-11-06 09:22:16 UTC; 1s ago
Process: 4353 ExecStart=/usr/bin/CcspLMLite -subsys $Subsys (code=exited, status=0/SUCCESS)
Process: 32297 ExecStopPost=/bin/sh -c echo "`date`: Stopping/Restarting CcspLMLite" >> ${PROCESS_RESTART_LOG} (code=exited, status=0/SUCCESS)
Main PID: 4397 (code=killed, signal=KILL)
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: eRT.com.cisco.spvtg.ccsp.lmlite start to check eRT.com.cisco.spvtg.ccsp.psm status
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: eRT.com.cisco.spvtg.ccsp.psm is ready, eRT.com.cisco.spvtg.ccsp.lmlite continue
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: PSM module done.
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: Conf file /etc/debug.ini open success
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: rdk_dyn_log_initg_dl_socket = 3 __progname = CcspLMLite
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: rdk_logger_init /etc/debug.ini Already Stack Level Logging processed... not processing again.
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: mq == (mqd_t)-1: Invalid argument
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: mq == (mqd_t)-1: Invalid argument
2024 Nov 06 09:22:16 Filogic-GW systemd[1]: CcspLMLite.service: Main process exited, code=killed, status=9/KILL
2024 Nov 06 09:22:16 Filogic-GW systemd[1]: CcspLMLite.service: Failed with result 'signal'.
root@Filogic-GW:~# ps -alx | grep CcspLM
0 0 32501 9847 20 0 2244 820 pipe_w S+ pts/0 0:00 grep CcspLM
self heal logs :
241106-09:22:55.245084 RDKB_SELFHEAL : <128>Ethwan Gateway[Mediatek]:<99000007><2024:11:06:09:22:53><e6:72:eb:94:4f:2e><BananapiBPI-R4> RM CcspLMLite process not running , restarting it
241106-09:22:55.246875 RDKB_SELFHEAL : Resetting process CcspLMLite
root@Filogic-GW:~# ps -alx | grep CcspLM
5 950 33247 1 20 0 443700 7364 hrtime Ssl ? 0:00 /usr/bin/CcspLMLite -subsys eRT.
0 0 33796 9847 20 0 2244 804 pipe_w S+ pts/0 0:00 grep CcspLM
root@Filogic-GW:~# |
If Connectivity Test fails, device will go for reboot if corrective action enabled.
Validation : Using the below steps to validate the connectivity Test
unplug the ethernet LAN cable or ifconfig erouter0 down
1. Using selfHeal logs to trouble shoot the run-time errors. SelfHeal logs will be created the below path,
3. Once it's reaches the threshold value, device will be rebooted automatically. observation in /rdklogs/logs/SelfHeal.txt.0
RDKB_SELFHEAL : Total memory in system is 949444 at timestamp 2019:09:24:10:17:08
RDKB_SELFHEAL : Used memory in system is 148772 at timestamp 2019:09:24:10:17:08
RDKB_SELFHEAL : Free memory in system is 800768 at timestamp 2019:09:24:10:17:08
RDKB_SELFHEAL : AvgMemUsed in % is 15
190924-10:17:09.055074 <128>CABLEMODEM[Raspberry]:<99000006><2019:09:24:10:17:09><B8:27:EB:50:C1:CF><ARMv7> RM Memory threshold reached
RDKB_SELFHEAL : Total memory in system is 949444
RDKB_SELFHEAL : Used memory in system is 148752
RDKB_SELFHEAL : Free memory in system is 800792
If it detects that any of the process is not running, it automatically restarts that particular Component.
Let us take the example of CcspLMLite Component :
...
2. Resource Monitor sample Logs,
MEM :
41106-10:22:51.210979 RDKB_SELFHEAL : Used memory in system is 153740 at timestamp 2024:11:06:10:22:51
ps aux | grep Ccsp 2. kill CcspLMLite process by using the below command241106-10:22:51.212450 RDKB_SELFHEAL : Free memory in system is 3754608 at timestamp 2024:11:06:10:22:51
kill -9 PID(CcspLMLite PID)
3. Verfiy whether the CcspLMLite Process was killed or not by using the below command
241106-10:22:51.213869 RDKB_SELFHEAL : AvgMemUsed in % is 3
ps aux | grep Ccsp 4. After 60 seconds(default), it will automatically restart the Process. Please check the CcspLMLIte PID.
If Connectivity Test fails, device will go for reboot.
Validation : Using the below steps to validate the connectivity Test
unplug the ethernet LAN cable or ifconfig erouter0 down
241106-10:22:51.267398 <128>CABLEMODEM[Mediatek]:<99000006><2024:11:06:10:22:51><ea:4f:a0:5d:06:99><BananapiBPI-R4> RM Memory threshold reached
1. Using selfHeal logs to trouble shoot the run-time errors. SelfHeal logs will be created the below path, 241106-10:23:21.732742 RDKB_SELFHEAL : Today's reboot count is 1
241106-10:23:21.734210 RDKB_SELFHEAL : <128>CABLEMODEM[Mediatek]:<99000000><2024:11:06:10:23:21><ea:4f:a0:5d:06:99><BananapiBPI-R4> RM Rebooting device /rdklogs/logs/SelfHeal.txt.0 2. Resource Monitor sample Logs, as part of corrective action
MEM : 241106-10:23:21.735754 Setting Last reboot reason as MEM_THRESHOLD
RDKB_SELFHEAL : Total memory in system is 949444 at timestamp 2019:09:24:10:17:08 241106-10:23:21.737264 Setting rebootReason to MEM_THRESHOLD and rebootCounter to 1
RDKB_SELFHEAL : Used memory in system is 148772 at timestamp 2019:09:24:10:17:08 241106-10:23:21.789361 RDKB_REBOOT : Rebooting device due to MEM threshold reached
RDKB_SELFHEAL : Free memory in system is 800768 at timestamp 2019:09:24:10:17:08CPU:
241106-10:54:09.546821 RDKB_SELFHEAL : AvgMemUsed in % is 15Today's reboot count is 3
190924 241106-10:1754:09.055074 548312 RDKB_SELFHEAL : <128>CABLEMODEM[RaspberryMediatek]:<99000006><2019<99000000><2024:0911:2406:10:1754:09><B809><d2:2733:EB17:50da:C185:CF><ARMv7> RM Memory threshold reachede4><BananapiBPI-R4> RM Rebooting device as part of corrective action
RDKB_SELFHEAL : Total memory in system is 949444 241106-10:54:09.549727 Setting Last reboot reason as CPU_THRESHOLD
RDKB_SELFHEAL : Used memory in system is 148752 241106-10:54:09.551169 Setting rebootReason to CPU_THRESHOLD and rebootCounter to 1
RDKB_SELFHEAL : Free memory in system is 800792 241106-10:54:09.603327 RDKB_REBOOT : Rebooting device due to CPU threshold reached
CPU: <128>CABLEMODEM[Mediatek]:<99000005><2024:11:06:10:53:39><d2:33:17:da:85:e4><BananapiBPI-R4> RM CPU threshold reached
190924-10:17:09.055074 <128>CABLEMODEM[Raspberry]:<99000006><2019:09:24[2024-11-06:10:1753:09><B8:27:EB:50:C1:CF><ARMv7> RM CPU threshold reached39:083104] Setting Last reboot reason
3. Process Monitor Sample Logs,
LMLite Process :
RDKB241106-09:22:55.245084 RDKB_SELFHEAL : <128>CABLEMODEM<128>Ethwan Gateway[RaspberryMediatek]:<99000007><2019<99000007><2024:0911:2406:09:2022:34><B853><e6:2772:EBeb:5094:C14f:CF><ARMv7> 2e><BananapiBPI-R4> RM CcspLMLite process not running , restarting it
241106-09:22:55.246875 RDKB_SELFHEAL : Resetting process CcspLMLite
...
191007-09:00:13.899713 [RDKB_SELFHEAL] : GW IP Connectivity Test Successfull
191007-09:00:13.909201 [RDKB_SELFHEAL] : IPv4 GW Address is:192.168.60.1
191007-09:00:13.918684 [RDKB_SELFHEAL] : IPv6 GW Address is:
191007-09:00:13.972966 RDKB_SELFHEAL : Ping server lists are empty , not taking any corrective actions
191007-09:00:14.119985 DNS Response: fail to resolve this URL www.google.com
191007-09:00:14.152808 RDKB_SELFHEAL : Taking corrective action
Jira | ||||||
---|---|---|---|---|---|---|
|