Trouble Shooting & Debugging - Terms & Steps

Notes and basics from Google Automation: Trouble Shooting & Debugging Coursera course

  • Trouble shooting - fixing system
  • Debugging - fixing program code
  • Cache stores data in a form that's faster to access than its original form. 
  •  Memory leak - when memory which is no longer needed is not getting released. 
  • WEEK 2 Videos - review for terms and steps!

Debugging Process

  • figuring out the problem and its solution require some creativity. 
  • We need to come up with new ideas of what could be failing, and ways to check for that. 
  • And once we know what's failing, we need to imagine how to solve it. 
  • To take it a step further, once we've solved a problem, we can start thinking about how to prevent it from happening again

Steps to solve a problem

  1. Gather information
    • super important resource to solve a problem is the reproduction case, which is a clear description of how and when the problem appears
  2. Finding the root cause
    • most difficult step. 
    • key: get to the bottom of what's going on, 
    • what triggered the problem, 
    • and how we can change that. 
  3. Performing the necessary remediation. 
    • might include ...
      • an immediate remediation to get the system back to health, 
      • and then a medium 
      • or long-term remediation to avoid the problem in the future.

User Reports: It doesn't work!

To be able to reproduce and start to get to root problem
ASK:
  • What were you trying to do?
  • What steps did you follow?
  • What was the expected result?
  • What was the actual result?
Then try to recreate the problem, so you can solve without user.
Also it rules out the user or the user's computer as the cause of the problem.
Then might try apps on same server.  SSH --> top to see how busy server is based on the # of cores.

Creating a reproduction case


A reproduction case is a way to verify if the problem is present or not.
If you can recreate - problem is not on the user's system.
If you can't then suspect user's environment or configuration.

Look at logs for errors:

  • Linux:  system logs like /var/log/syslog and user-specific logs like the.xsession-errors file located in the user's home directory.
  • Mac: system logs and logs stored in the library logs directory
  • Windows: Use Event Viewer tool (6 Ways to Open Event Viewer in Windows 10 (isunshare.com))
No errors where does problem happen? Computer or office area;  File or directory

how do we go about finding the actual root cause of the problem? We generally follow a cycle of looking at the information we have, coming up with a hypothesis that could explain the problem, and then testing our hypothesis. If we confirm our theory, we found the root cause. If we don't, then we go back to the beginning and try different possibility. 

Dealing with Intermittent Issues

Log more info
Look at:
  • Load on the computer
  • Processes running at the same time
  • Use of the network
Observing cause the error to go away (Heisenburg error)
Turn off/on and error goes away  Probably a resource error, since off/on resets many things.


Comments

Popular posts from this blog

Monitoring Tools

Getting started with Git