1. Introduction

Self Heal is a monitoring and recovery module.

It continuously monitors the system resources like CPU and Memory and monitors the critical Processes running.

Self heal also performs Connectivity tests.

In case of any problems encountered, Self Heal takes corrective actions like: Rebooting the device, Restarting required process based on predefined conditions.

Self-heal stores Reset Count and Reboot Count.

2. Environment Setup

Self Heal functionality is handled by a set of scripts. These scripts are available in the RDK build by default.

Please ensure that below Self heal scripts are present on the device at the path "/usr/ccsp/tad".

  • resource_monitor.sh

  • task_health_monitor.sh

  • corrective_action.sh

  • self_heal_connectivity.sh

Please refer the below screenshot to verify self heal module was enabled or not,

          

3. Executing System

Self Heal is enabled by default and is active at the time of boot up.

It periodically performs below actions.

  • Resource monitoring: Monitors memory / cpu usage and if it goes beyond threshold, it reboots the device.
  • Process monitoring: It will periodically monitors status of the critical processes.
    • Ccsp processes: If any of these processes crashed, it will be restarted via Self Heal.
    • "CcspCrSsp": If this process is crashed, device will be rebooted.
    • "syseventd": If syseventd is crashed, device will be rebooted.
  • Connectivity test: If DNS or WAN_IP is down, device will stop the LAN functionality.

3.1. Resource Monitor - Monitors CPU and MEMORY

                1.  By default, AVG CPU threshold value will be set as 100. This value will be stored in syscfg database. If we want the change the default AVG CPU threshold value, Please refer the attached screenshot and do the following steps,

                 2. By default, AVG Memory threshold value will be set as 100. This value will be stored in syscfg database. If we want the change the default AVG Memory threshold value, Please refer the attached screenshot and do the following steps,

                  3. Once it's reaches the threshold value, device will be rebooted automatically.           

observation in /rdklogs/logs/SelfHeal.txt.0

RDKB_SELFHEAL : Total memory in system is 949444 at timestamp 2019:09:24:10:17:08
RDKB_SELFHEAL : Used memory in system is 148772 at timestamp 2019:09:24:10:17:08
RDKB_SELFHEAL : Free memory in system is 800768 at timestamp 2019:09:24:10:17:08
RDKB_SELFHEAL : AvgMemUsed in % is 15
190924-10:17:09.055074 <128>CABLEMODEM[Raspberry]:<99000006><2019:09:24:10:17:09><B8:27:EB:50:C1:CF><ARMv7> RM Memory threshold reached
 RDKB_SELFHEAL : Total memory in system is 949444
 RDKB_SELFHEAL : Used memory in system is 148752
 RDKB_SELFHEAL : Free memory in system is 800792

     

3.2. Process Monitor - Monitors  the Process Periodically based on Process id's

If it detects that any of the process is not running, it automatically restarts that particular Component.

Let us take the example of CcspLMLite Component :

  1. Run a ps command to verify that CcspLMLite is up and running again with different process id

                                 ps aux | grep Ccsp

        2. kill CcspLMLite process by using the below command

                                kill -9 PID(CcspLMLite PID)

        3.  Verfiy whether the CcspLMLite Process was killed or not by using the below command

                                 ps aux | grep Ccsp

        4. After 60 seconds(default), it will automatically restart  the Process. Please check the CcspLMLIte PID.

3.3. Connectivity Test - Ping Functionality

                                    If Connectivity Test fails, device will go for reboot.

        Validation :   Using the below steps to validate the connectivity Test

                                   unplug the ethernet LAN cable or ifconfig erouter0 down

4. Troubleshooting

                    1.  Using selfHeal logs to trouble shoot the run-time errors. SelfHeal logs will be created the below path,

                                                 /rdklogs/logs/SelfHeal.txt.0

                    2.  Resource Monitor sample Logs, 

                            MEM :

                                  RDKB_SELFHEAL : Total memory in system is 949444 at timestamp 2019:09:24:10:17:08
                                  RDKB_SELFHEAL : Used memory in system is 148772 at timestamp 2019:09:24:10:17:08
                                  RDKB_SELFHEAL : Free memory in system is 800768 at timestamp 2019:09:24:10:17:08
                                  RDKB_SELFHEAL : AvgMemUsed in % is 15
                                 190924-10:17:09.055074 <128>CABLEMODEM[Raspberry]:<99000006><2019:09:24:10:17:09><B8:27:EB:50:C1:CF><ARMv7> RM Memory threshold reached
                                 RDKB_SELFHEAL : Total memory in system is 949444
                                 RDKB_SELFHEAL : Used memory in system is 148752
                                 RDKB_SELFHEAL : Free memory in system is 800792

                         CPU:

                                  190924-10:17:09.055074 <128>CABLEMODEM[Raspberry]:<99000006><2019:09:24:10:17:09><B8:27:EB:50:C1:CF><ARMv7> RM CPU threshold reached

                   3. Process Monitor Sample Logs,

                         LMLite Process :

                                      RDKB_SELFHEAL : <128>CABLEMODEM[Raspberry]:<99000007><2019:09:24:09:20:34><B8:27:EB:50:C1:CF><ARMv7> RM CcspLMLite process not running , restarting it

                                      RDKB_SELFHEAL : Resetting process CcspLMLite

                 4. Connectivity Test Sample Logs ,

                              Successful Scenario :

                                        190924-08:56:43.577621 [RDKB_SELFHEAL] : GW IP Connectivity Test Successfull
                                        190924-08:56:43.583217 [RDKB_SELFHEAL] : IPv4 GW Address is:192.168.30.1
                                        190924-08:56:43.588370 [RDKB_SELFHEAL] : IPv6 GW Address is:
                                        190924-08:56:43.622618 RDKB_SELFHEAL : Ping server lists are empty , not taking any corrective actions
                                        190924-08:56:43.730057 DNS Response: Got success response for this URL www.google.com

                             Failure Scenario :

                                         191007-09:00:13.899713 [RDKB_SELFHEAL] : GW IP Connectivity Test Successfull
                                         191007-09:00:13.909201 [RDKB_SELFHEAL] : IPv4 GW Address is:192.168.60.1
                                         191007-09:00:13.918684 [RDKB_SELFHEAL] : IPv6 GW Address is:
                                         191007-09:00:13.972966 RDKB_SELFHEAL : Ping server lists are empty , not taking any corrective actions
                                         191007-09:00:14.119985 DNS Response: fail to resolve this URL www.google.com
                                          191007-09:00:14.152808 RDKB_SELFHEAL : Taking corrective action





      


  • No labels