1. Introduction

Self Heal is a monitoring and recovery module.

It continuously monitors the system resources like CPU and Memory and monitors the critical Processes running.

Self heal also performs Connectivity tests.

In case of any problems encountered, Self Heal takes corrective actions like: Rebooting the device, Restarting required process based on predefined conditions.

Self-heal stores Reset Count and Reboot Count.

2. Environment Setup

Self Heal functionality is handled by a set of scripts. These scripts are available in the RDK build by default.

Please ensure that below Self heal scripts are present on the device at the path "/usr/ccsp/tad".

  • resource_monitor.sh

  • task_health_monitor.sh

  • corrective_action.sh

  • self_heal_connectivity.sh

Please refer the below code snippet to verify self heal module was enabling by default or not,

SH_Enabling
root@Filogic-GW:/rdklogs/logs# dmcli eRT getv Device.SelfHeal.X_RDKCENTRAL-COM_Enable
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter    1 name: Device.SelfHeal.X_RDKCENTRAL-COM_Enable
               type:       bool,    value: true 

root@Filogic-GW:/rdklogs/logs# 
root@Filogic-GW:/rdklogs/logs# ps -alx | grep reso
4     0    4531       1  20   0   3496  2744 do_wai S    ?          0:00 /bin/sh /usr/ccsp/tad/resource_monitor.sh
0     0   16777   10009  20   0   2244   808 pipe_w S+   pts/0      0:00 grep reso
root@Filogic-GW:/rdklogs/logs# ps -alx | grep self
4     0    4528       1  20   0   3628  2836 do_wai S    ?          0:00 /bin/sh /usr/ccsp/tad/self_heal_connectivity_test.sh
4     0    4539       1  20   0   4288  3468 do_wai S    ?          0:00 /bin/sh /usr/ccsp/tad/selfheal_aggressive.sh
0     0   16816   10009  20   0   2244   824 pipe_w S+   pts/0      0:00 grep self


          

3. Executing System

Self Heal is enabled by default and is active at the time of boot up.

It periodically performs below actions.

  • Resource monitoring: Monitors memory / cpu usage and if it goes beyond threshold, it reboots the device.
  • Process monitoring: It will periodically monitors status of the critical processes.
    • Ccsp processes: If any of these processes crashed, it will be restarted via Self Heal.
    • "CcspCrSsp": If this process is crashed, device will be rebooted.
    • "syseventd": If syseventd is crashed, device will be rebooted.
  • Connectivity test: If DNS or WAN_IP is down, device will stop the LAN functionality.
Selfheal_DM
DM parameters
=============

root@Filogic-GW:/# dmcli eRT getv Device.SelfHeal.
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter    1 name: Device.SelfHeal.X_RDKCENTRAL-COM_FreeMemThreshold
               type:       uint,    value: 0 
Parameter    2 name: Device.SelfHeal.X_RDKCENTRAL-COM_MemFragThreshold
               type:       uint,    value: 0 
Parameter    3 name: Device.SelfHeal.X_RDKCENTRAL-COM_CpuMemFragInterval
               type:       uint,    value: 0 
Parameter    4 name: Device.SelfHeal.X_RDKCENTRAL-COM_Enable
               type:       bool,    value: true 
Parameter    5 name: Device.SelfHeal.X_RDKCENTRAL-COM_MaxRebootCount
               type:       uint,    value: 3 
Parameter    6 name: Device.SelfHeal.X_RDKCENTRAL-COM_MaxResetCount
               type:       uint,    value: 3 
Parameter    7 name: Device.SelfHeal.X_RDKCENTRAL-COM_NoWaitLogSync
               type:       bool,    value: false 
Parameter    8 name: Device.SelfHeal.X_RDKCENTRAL-COM_LogBackupThreshold
               type:       uint,    value: 0 
Parameter    9 name: Device.SelfHeal.X_RDKCENTRAL-COM_DiagnosticMode
               type:       bool,    value: false 
Parameter   10 name: Device.SelfHeal.X_RDKCENTRAL-COM_DiagMode_LogUploadFrequency
               type:       uint,    value: 1440 
Parameter   11 name: Device.SelfHeal.X_RDKCENTRAL-COM_DNS_PINGTEST_Enable
               type:       bool,    value: false 
Parameter   12 name: Device.SelfHeal.X_RDKCENTRAL-COM_DNS_URL
               type:     string,    value: www.google.com 
Parameter   13 name: Device.SelfHeal.CpuMemFragNumberOfEntries
               type:       uint,    value: 2 
Parameter   14 name: Device.SelfHeal.CpuMemFrag.1.DMA
               type:     string,    value:  
Parameter   15 name: Device.SelfHeal.CpuMemFrag.1.DMA32
               type:     string,    value:  
Parameter   16 name: Device.SelfHeal.CpuMemFrag.1.Normal
               type:     string,    value:  
Parameter   17 name: Device.SelfHeal.CpuMemFrag.1.Highmem
               type:     string,    value:  
Parameter   18 name: Device.SelfHeal.CpuMemFrag.1.FragPercentage
               type:       uint,    value: 0 
Parameter   19 name: Device.SelfHeal.CpuMemFrag.2.DMA
               type:     string,    value:  
Parameter   20 name: Device.SelfHeal.CpuMemFrag.2.DMA32
               type:     string,    value:  
Parameter   21 name: Device.SelfHeal.CpuMemFrag.2.Normal
               type:     string,    value:  
Parameter   22 name: Device.SelfHeal.CpuMemFrag.2.Highmem
               type:     string,    value:  
Parameter   23 name: Device.SelfHeal.CpuMemFrag.2.FragPercentage
               type:       uint,    value: 0 
Parameter   24 name: Device.SelfHeal.ConnectivityTest.X_RDKCENTRAL-COM_PingInterval
               type:       uint,    value: 60 
Parameter   25 name: Device.SelfHeal.ConnectivityTest.X_RDKCENTRAL-COM_NumPingsPerServer
               type:       uint,    value: 3 
Parameter   26 name: Device.SelfHeal.ConnectivityTest.X_RDKCENTRAL-COM_MinNumPingServer
               type:       uint,    value: 1 
Parameter   27 name: Device.SelfHeal.ConnectivityTest.X_RDKCENTRAL-COM_PingRespWaitTime
               type:       uint,    value: 1000 
Parameter   28 name: Device.SelfHeal.ConnectivityTest.X_RDKCENTRAL-COM_CorrectiveAction
               type:       bool,    value: false 
Parameter   29 name: Device.SelfHeal.ConnectivityTest.X_RDKCENTRAL-COM_LastReboot
               type:       uint,    value: 0 
Parameter   30 name: Device.SelfHeal.ConnectivityTest.X_RDKCENTRAL-COM_RebootInterval
               type:        int,    value: 0 
Parameter   31 name: Device.SelfHeal.ConnectivityTest.X_RDKCENTRAL-COM_CurrentCount
               type:        int,    value: 0 
Parameter   32 name: Device.SelfHeal.ConnectivityTest.PingServerList.IPv4PingServerTableNumberOfEntries
               type:       uint,    value: 0 
Parameter   33 name: Device.SelfHeal.ConnectivityTest.PingServerList.IPv6PingServerTableNumberOfEntries
               type:       uint,    value: 0 
Parameter   34 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_UsageComputeWindow
               type:       uint,    value: 15 
Parameter   35 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
               type:       uint,    value: 100 
Parameter   36 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
               type:       uint,    value: 100 
Parameter   37 name: Device.SelfHeal.CPUProcAnalyzer.Enable
               type:       bool,    value: false 
Parameter   38 name: Device.SelfHeal.CPUProcAnalyzer.SleepInterval
               type:       uint,    value: 60 
Parameter   39 name: Device.SelfHeal.CPUProcAnalyzer.TimeToRun
               type:       uint,    value: 600 
Parameter   40 name: Device.SelfHeal.CPUProcAnalyzer.DynamicProcess
               type:       bool,    value: false 
Parameter   41 name: Device.SelfHeal.CPUProcAnalyzer.MonitorAllProcess
               type:       bool,    value: false 
Parameter   42 name: Device.SelfHeal.CPUProcAnalyzer.MemoryLimit
               type:       uint,    value: 1536 
Parameter   43 name: Device.SelfHeal.CPUProcAnalyzer.ProcessList
               type:     string,    value:  
Parameter   44 name: Device.SelfHeal.CPUProcAnalyzer.SystemStatsToMonitor
               type:     string,    value: cpu,memory,fd,loadavg,cliconnected
 
Parameter   45 name: Device.SelfHeal.CPUProcAnalyzer.ProcessStatsToMonitor
               type:     string,    value: cpu,memory,fd,thread
 
Parameter   46 name: Device.SelfHeal.CPUProcAnalyzer.TelemetryOnly
               type:       bool,    value: false 

root@Filogic-GW:/# 

root@Filogic-GW:/usr/ccsp/tad# ls
CcspTandDSsp	       corrective_action.sh  log_buddyinfo.sh	  self_heal_connectivity_test.sh  selfheal_reset_counts.sh
TestAndDiagnostic.XML  cpumemfrag_cron.sh    resource_monitor.sh  selfheal_aggressive.sh	  task_health_monitor.sh

root@Filogic-GW:/rdklogs/logs# 
root@Filogic-GW:/rdklogs/logs# ps -alx | grep reso
4     0    4531       1  20   0   3496  2744 do_wai S    ?          0:00 /bin/sh /usr/ccsp/tad/resource_monitor.sh
0     0   16777   10009  20   0   2244   808 pipe_w S+   pts/0      0:00 grep reso
root@Filogic-GW:/rdklogs/logs# ps -alx | grep self
4     0    4528       1  20   0   3628  2836 do_wai S    ?          0:00 /bin/sh /usr/ccsp/tad/self_heal_connectivity_test.sh
4     0    4539       1  20   0   4288  3468 do_wai S    ?          0:00 /bin/sh /usr/ccsp/tad/selfheal_aggressive.sh
0     0   16816   10009  20   0   2244   824 pipe_w S+   pts/0      0:00 grep self


3.1. Resource Monitor - Monitors CPU and MEMORY

                1.  By default, AVG CPU threshold value will be set as 100. This value will be stored in syscfg database. If we want the change the default AVG CPU threshold value, Please refer the below code snippet and do the following steps,

Memory_threshold
root@Filogic-GW:/# dmcli eRT getv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter    1 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
               type:       uint,    value: 100 

root@Filogic-GW:/rdklogs/logs# dmcli eRT setv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold uint 200
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.

root@Filogic-GW:/# dmcli eRT getv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter    1 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
               type:       uint,    value: 200 
 


                 2. By default, AVG Memory threshold value will be set as 100. This value will be stored in syscfg database. If we want the change the default AVG Memory threshold value, Please refer the below code snippet and do the following steps,

CPU
root@Filogic-GW:~# dmcli eRT getv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold                                            
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter    1 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
               type:       uint,    value: 100 

root@Filogic-GW:/rdklogs/logs# dmcli eRT setv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold uint 200
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.

root@Filogic-GW:~# dmcli eRT getv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold                                            
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter    1 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
               type:       uint,    value: 200  


                  3. Once it's reaches the threshold value, device will be rebooted automatically.           

observation in /rdklogs/logs/SelfHeal.txt.0

For Memory Threshold

41106-10:22:51.210979 RDKB_SELFHEAL : Used memory in system is 153740 at timestamp 2024:11:06:10:22:51
241106-10:22:51.212450 RDKB_SELFHEAL : Free memory in system is 3754608 at timestamp 2024:11:06:10:22:51
241106-10:22:51.213869 RDKB_SELFHEAL : AvgMemUsed in % is  3
241106-10:22:51.267398 <128>CABLEMODEM[Mediatek]:<99000006><2024:11:06:10:22:51><ea:4f:a0:5d:06:99><BananapiBPI-R4> RM Memory threshold reached

241106-10:23:21.732742 RDKB_SELFHEAL : Today's reboot count is 1 
241106-10:23:21.734210 RDKB_SELFHEAL : <128>CABLEMODEM[Mediatek]:<99000000><2024:11:06:10:23:21><ea:4f:a0:5d:06:99><BananapiBPI-R4> RM Rebooting device as part of corrective action
241106-10:23:21.735754 Setting Last reboot reason as MEM_THRESHOLD
241106-10:23:21.737264 Setting rebootReason to MEM_THRESHOLD and rebootCounter to 1
241106-10:23:21.789361 RDKB_REBOOT : Rebooting device due to MEM threshold reached

After reboot,

MEM_CHECK
root@Filogic-GW:~# dmcli eRT getv Device.DeviceInfo.X_RDKCENTRAL-COM_LastRebootReason                                                   
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter    1 name: Device.DeviceInfo.X_RDKCENTRAL-COM_LastRebootReason
               type:     string,    value: MEM_THRESHOLD 


For CPU Threshold

241106-10:54:09.546821 RDKB_SELFHEAL : Today's reboot count is 3 
241106-10:54:09.548312 RDKB_SELFHEAL : <128>CABLEMODEM[Mediatek]:<99000000><2024:11:06:10:54:09><d2:33:17:da:85:e4><BananapiBPI-R4> RM Rebooting device as part of corrective action
241106-10:54:09.549727 Setting Last reboot reason as CPU_THRESHOLD
241106-10:54:09.551169 Setting rebootReason to CPU_THRESHOLD and rebootCounter to 1
241106-10:54:09.603327 RDKB_REBOOT : Rebooting device due to CPU threshold reached

<128>CABLEMODEM[Mediatek]:<99000005><2024:11:06:10:53:39><d2:33:17:da:85:e4><BananapiBPI-R4> RM CPU threshold reached
[2024-11-06:10:53:39:083104] Setting Last reboot reason

After reboot,

CPU_CHECK
root@Filogic-GW:~# dmcli eRT getv Device.DeviceInfo.X_RDKCENTRAL-COM_LastRebootReason                                                    
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter    1 name: Device.DeviceInfo.X_RDKCENTRAL-COM_LastRebootReason
               type:     string,    value: CPU_THRESHOLD    


     

3.2. Process Monitor - Monitors  the Process Periodically based on Process id's

If it detects that any of the process is not running, it automatically restarts that particular Component.

Let us take the example of CcspLMLite Component :

  1. Run a ps command to verify that CcspLMLite is up and running again with different process id

                                 ps aux | grep Ccsp

        2. kill CcspLMLite process by using the below command

                                kill -9 PID(CcspLMLite PID)

        3.  Verfiy whether the CcspLMLite Process was killed or not by using the below command

                                 ps aux | grep Ccsp

        4. After 60 seconds(default), it will automatically restart  the Process. Please check the CcspLMLIte PID.

process_monitor
root@Filogic-GW:~# ps -alx | grep CcspLM
5   950    4397       1  20   0 574768  7360 hrtime Ssl  ?          0:00 /usr/bin/CcspLMLite -subsys eRT.
0     0   31853    9847  20   0   2244   820 pipe_w S+   pts/0      0:00 grep CcspLM
root@Filogic-GW:~# systemctl status CcspLMLite
● CcspLMLite.service - CcspLMLite service
     Loaded: loaded (/lib/systemd/system/CcspLMLite.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2022-04-28 17:43:00 UTC; 2 years 6 months ago
    Process: 4353 ExecStart=/usr/bin/CcspLMLite -subsys $Subsys (code=exited, status=0/SUCCESS)
   Main PID: 4397 (CcspLMLite)
     CGroup: /system.slice/CcspLMLite.service
             └─ 4397 /usr/bin/CcspLMLite -subsys eRT.

2022 Apr 28 17:43:00 Filogic-GW systemd[1]: Starting CcspLMLite service...
2022 Apr 28 17:43:00 Filogic-GW systemd[1]: Started CcspLMLite service.
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: eRT.com.cisco.spvtg.ccsp.lmlite start to check eRT.com.cisco.spvtg.ccsp.psm status
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: eRT.com.cisco.spvtg.ccsp.psm is ready, eRT.com.cisco.spvtg.ccsp.lmlite continue
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: PSM module done.
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: Conf file /etc/debug.ini open success
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: rdk_dyn_log_initg_dl_socket = 3 __progname = CcspLMLite
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: rdk_logger_init /etc/debug.ini Already Stack Level Logging processed... not processing again.
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: mq == (mqd_t)-1: Invalid argument
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: mq == (mqd_t)-1: Invalid argument
root@Filogic-GW:~# 
root@Filogic-GW:~# kill -9 4397
root@Filogic-GW:~# systemctl status CcspLMLite
× CcspLMLite.service - CcspLMLite service
     Loaded: loaded (/lib/systemd/system/CcspLMLite.service; enabled; vendor preset: enabled)
     Active: failed (Result: signal) since Wed 2024-11-06 09:22:16 UTC; 1s ago
    Process: 4353 ExecStart=/usr/bin/CcspLMLite -subsys $Subsys (code=exited, status=0/SUCCESS)
    Process: 32297 ExecStopPost=/bin/sh -c echo "`date`: Stopping/Restarting CcspLMLite" >> ${PROCESS_RESTART_LOG} (code=exited, status=0/SUCCESS)
   Main PID: 4397 (code=killed, signal=KILL)

2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: eRT.com.cisco.spvtg.ccsp.lmlite start to check eRT.com.cisco.spvtg.ccsp.psm status
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: eRT.com.cisco.spvtg.ccsp.psm is ready, eRT.com.cisco.spvtg.ccsp.lmlite continue
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: PSM module done.
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: Conf file /etc/debug.ini open success
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: rdk_dyn_log_initg_dl_socket = 3 __progname = CcspLMLite
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: rdk_logger_init /etc/debug.ini Already Stack Level Logging processed... not processing again.
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: mq == (mqd_t)-1: Invalid argument
2022 Apr 28 17:43:00 Filogic-GW CcspLMLite[4397]: mq == (mqd_t)-1: Invalid argument
2024 Nov 06 09:22:16 Filogic-GW systemd[1]: CcspLMLite.service: Main process exited, code=killed, status=9/KILL
2024 Nov 06 09:22:16 Filogic-GW systemd[1]: CcspLMLite.service: Failed with result 'signal'.
root@Filogic-GW:~# ps -alx | grep CcspLM
0     0   32501    9847  20   0   2244   820 pipe_w S+   pts/0      0:00 grep CcspLM

self heal logs :

241106-09:22:55.245084 RDKB_SELFHEAL : <128>Ethwan Gateway[Mediatek]:<99000007><2024:11:06:09:22:53><e6:72:eb:94:4f:2e><BananapiBPI-R4> RM CcspLMLite process not running , restarting it
241106-09:22:55.246875 RDKB_SELFHEAL : Resetting process CcspLMLite
               
root@Filogic-GW:~# ps -alx | grep CcspLM
5   950   33247       1  20   0 443700  7364 hrtime Ssl  ?          0:00 /usr/bin/CcspLMLite -subsys eRT.
0     0   33796    9847  20   0   2244   804 pipe_w S+   pts/0      0:00 grep CcspLM
root@Filogic-GW:~# 


3.3. Connectivity Test - Ping Functionality

                                    If Connectivity Test fails, device will go for reboot if corrective action enabled.

        Validation :   Using the below steps to validate the connectivity Test

                                   unplug the ethernet LAN cable or ifconfig erouter0 down

Note : By default DNS PING Test and corrective actions are disabled.

PING_TESTING
By default DNS and corrective actions are disabled. Use the below commands to enable those parameters.

For DNS Testing ,

root@Filogic-GW:/rdklogs/logs# dmcli eRT getv Device.SelfHeal.X_RDKCENTRAL-COM_DNS_PINGTEST_Enable
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter    1 name: Device.SelfHeal.X_RDKCENTRAL-COM_DNS_PINGTEST_Enable
               type:       bool,    value: false 

root@Filogic-GW:~# dmcli eRT getv Device.SelfHeal.ConnectivityTest.X_RDKCENTRAL-COM_CorrectiveAction                                    
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter    1 name: Device.SelfHeal.ConnectivityTest.X_RDKCENTRAL-COM_CorrectiveAction
               type:       bool,    value: false 

root@Filogic-GW:/rdklogs/logs# 
root@Filogic-GW:/rdklogs/logs# 
root@Filogic-GW:/rdklogs/logs# dmcli eRT setv Device.SelfHeal.X_RDKCENTRAL-COM_DNS_PINGTEST_Enable bool true
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.

root@Filogic-GW:/rdklogs/logs# dmcli eRT getv Device.SelfHeal.X_RDKCENTRAL-COM_DNS_PINGTEST_Enable
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter    1 name: Device.SelfHeal.X_RDKCENTRAL-COM_DNS_PINGTEST_Enable
               type:       bool,    value: true 

root@Filogic-GW:/rdklogs/logs# 

To enable corrective action,

dmcli eRT setv Device.SelfHeal.ConnectivityTest.X_RDKCENTRAL-COM_CorrectiveAction bool true

root@Filogic-GW:/rdklogs/logs# dmcli eRT getv Device.DeviceInfo.X_RDKCENTRAL-COM_LastRebootReason
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
Execution succeed.
Parameter    1 name: Device.DeviceInfo.X_RDKCENTRAL-COM_LastRebootReason
               type:     string,    value: PING_Connectivity_Test_Failure 

root@Filogic-GW:/rdklogs/logs# 

root@Filogic-GW:/rdklogs/logs# tail -f SelfHeal.txt.0

Failure :
======
241104-10:16:38.221777 RDKB_SELFHEAL : No IPv4 Gateway Address detected
241104-10:16:38.223798 RDKB_SELFHEAL : No IPv6 Gateway Address detected
241104-10:16:38.239881 RDKB_SELFHEAL : Taking corrective action

241104-10:16:39.782027 RDKB_SELFHEAL : Ping server lists are empty , not taking any corrective actions
241104-10:16:39.821478 DNS Response: fail to resolve this URL www.google.com
241104-10:16:39.837748 RDKB_SELFHEAL : Taking corrective action

241104-10:14:35.589120 RDKB_SELFHEAL : <128>Ethwan Gateway[Mediatek]:<99000007><2024:11:04:10:14:34><ea:a2:ae:1a:b7:63><BananapiBPI-R4> RM PIt
diff and last_reboot -960 and 28800
PING_LATENCY_GWIPv4:1.00,2.42,1.07
241104-10:14:59.740762 [RDKB_SELFHEAL] : GW IP Connectivity Test Successfull
241104-10:14:59.742689 [RDKB_SELFHEAL] : IPv4 GW  Address is:192.168.2.254
241104-10:14:59.744620 [RDKB_SELFHEAL] : IPv6 GW  Address is:fe80::da3a:ddff:fe0d:b86c
241104-10:14:59.771527 RDKB_SELFHEAL : Ping server lists are empty , not taking any corrective actions
241104-10:14:59.808810 DNS Response: fail to resolve this URL www.google.com
241104-10:14:59.824618 RDKB_SELFHEAL : Taking corrective action

241104-11:01:01.277374 RDKB_SELFHEAL : Ping server lists are empty , not taking any corrective actions
241104-11:01:01.312725 DNS Response: fail to resolve this URL www.google.com
241104-11:01:01.328165 RDKB_SELFHEAL : Taking corrective action
241104-11:01:01.406788 RDKB_SELFHEAL : Total memory in system is 4023440
241104-11:01:01.408277 RDKB_SELFHEAL : Used memory in system is 143808
241104-11:01:01.409853 RDKB_SELFHEAL : Free memory in system is 3773984
241104-11:01:02.457602 RDKB_SELFHEAL : Current CPU load is 0
241104-11:01:02.459152 RDKB_SELFHEAL : Top 5 tasks running on device with resource usage are below
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   3735.6 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
      1 root      20   0   94000   8100   5744 S   0.0   0.2   0:06.29 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.00 kthreadd
      3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
241104-11:01:02.716353 RDKB_SELFHEAL : 2.4GHz radio is operating on  channel
241104-11:01:02.732741 RDKB_SELFHEAL : 5GHz radio is operating on  channel
241104-11:01:02.734334 RDKB_SELFHEAL : MoCA stats are not available due to MoCA crash
241104-11:01:02.800183 RDKB_SELFHEAL : <128>Ethwan Gateway[Mediatek]:<99000007><2024:11:04:11:01:01><7e:22:72:d9:02:13><BananapiBPI-R4> RM PIt
diff and last_reboot 55 and 50
Ping reset Router
241104-11:01:02.881393 RDKB_SELFHEAL : DNS Information :

Success:
=======
241106-11:23:00.787843 [RDKB_SELFHEAL] : GW IP Connectivity Test Successfull
241106-11:23:00.789414 [RDKB_SELFHEAL] : IPv4 GW  Address is:192.168.2.254
241106-11:23:00.791062 [RDKB_SELFHEAL] : IPv6 GW  Address is:fe80::1dde:7669:fc2e:fe43
fe80::da3a:ddff:fe09:f505
fe80::532e:c128:f66b:79f1
fe80::f485:c03c:fcf3:c75
241106-11:23:00.817414 RDKB_SELFHEAL : Ping server lists are empty , not taking any corrective actions
241106-11:23:01.021947 DNS Response: Got success response for this URL www.google.com


4. Troubleshooting

                    1.  Using selfHeal logs to trouble shoot the run-time errors. SelfHeal logs will be created the below path,

                                                 /rdklogs/logs/SelfHeal.txt.0

                    2.  Resource Monitor sample Logs, 

                            MEM :

                            41106-10:22:51.210979 RDKB_SELFHEAL : Used memory in system is 153740 at timestamp 2024:11:06:10:22:51
                            241106-10:22:51.212450 RDKB_SELFHEAL : Free memory in system is 3754608 at timestamp 2024:11:06:10:22:51
                            241106-10:22:51.213869 RDKB_SELFHEAL : AvgMemUsed in % is  3
                           241106-10:22:51.267398 <128>CABLEMODEM[Mediatek]:<99000006><2024:11:06:10:22:51><ea:4f:a0:5d:06:99><BananapiBPI-R4> RM Memory threshold reached

                           241106-10:23:21.732742 RDKB_SELFHEAL : Today's reboot count is 1 
                           241106-10:23:21.734210 RDKB_SELFHEAL : <128>CABLEMODEM[Mediatek]:<99000000><2024:11:06:10:23:21><ea:4f:a0:5d:06:99><BananapiBPI-R4> RM Rebooting device                               as part of corrective action
                           241106-10:23:21.735754 Setting Last reboot reason as MEM_THRESHOLD
                           241106-10:23:21.737264 Setting rebootReason to MEM_THRESHOLD and rebootCounter to 1
                           241106-10:23:21.789361 RDKB_REBOOT : Rebooting device due to MEM threshold reached

                         CPU:

                                 241106-10:54:09.546821 RDKB_SELFHEAL : Today's reboot count is 3 
                                 241106-10:54:09.548312 RDKB_SELFHEAL : <128>CABLEMODEM[Mediatek]:<99000000><2024:11:06:10:54:09><d2:33:17:da:85:e4><BananapiBPI-R4> RM Rebooting                                           device as part of corrective action
                                 241106-10:54:09.549727 Setting Last reboot reason as CPU_THRESHOLD
                                 241106-10:54:09.551169 Setting rebootReason to CPU_THRESHOLD and rebootCounter to 1
                                 241106-10:54:09.603327 RDKB_REBOOT : Rebooting device due to CPU threshold reached

                               <128>CABLEMODEM[Mediatek]:<99000005><2024:11:06:10:53:39><d2:33:17:da:85:e4><BananapiBPI-R4> RM CPU threshold reached
                                 [2024-11-06:10:53:39:083104] Setting Last reboot reason

                   3. Process Monitor Sample Logs,

                         LMLite Process :

                                      241106-09:22:55.245084 RDKB_SELFHEAL : <128>Ethwan Gateway[Mediatek]:<99000007><2024:11:06:09:22:53><e6:72:eb:94:4f:2e><BananapiBPI-R4> RM CcspLMLite process not running , restarting it
241106-09:22:55.246875 RDKB_SELFHEAL : Resetting process CcspLMLite

                 4. Connectivity Test Sample Logs ,

                              Successful Scenario :

                                        190924-08:56:43.577621 [RDKB_SELFHEAL] : GW IP Connectivity Test Successfull
                                        190924-08:56:43.583217 [RDKB_SELFHEAL] : IPv4 GW Address is:192.168.30.1
                                        190924-08:56:43.588370 [RDKB_SELFHEAL] : IPv6 GW Address is:
                                        190924-08:56:43.622618 RDKB_SELFHEAL : Ping server lists are empty , not taking any corrective actions
                                        190924-08:56:43.730057 DNS Response: Got success response for this URL www.google.com

                             Failure Scenario :

                                         191007-09:00:13.899713 [RDKB_SELFHEAL] : GW IP Connectivity Test Successfull
                                         191007-09:00:13.909201 [RDKB_SELFHEAL] : IPv4 GW Address is:192.168.60.1
                                         191007-09:00:13.918684 [RDKB_SELFHEAL] : IPv6 GW Address is:
                                         191007-09:00:13.972966 RDKB_SELFHEAL : Ping server lists are empty , not taking any corrective actions
                                         191007-09:00:14.119985 DNS Response: fail to resolve this URL www.google.com
                                          191007-09:00:14.152808 RDKB_SELFHEAL : Taking corrective action

5. References

RDKBACCL-303 - Getting issue details... STATUS





      


  • No labels