Today’s topic is pretty simple I think. Every business owner will come across this in their career and it is: how to solve a problem. I know it sounds kind of simple and everybody says “I should be able to solve problems” but most of the time people don’t solve problems they just put a Band-Aid on their problem. This is something that a lot of business owners do. They want to try to Band-Aid a problem. I’m telling you today, do not Band-Aid problems. You always want to find the root cause of problems. I’ll say it again: do not Band-Aid problems, find the root cause. Today we’re going to give you the five steps that we use to come up with what the root cause is of a problem so that our customers don’t experience the issues over and over. Okay, so let’s go through these five steps that we have that we use every time to find the root cause to problems.
Number one: define the problem. You want to note the time that it happens, collect the data on it, when it’s happening, and check the logs for this time. I can tell you one of the things that we do, for example when somebody is becoming disconnected from the internet and we get the call, we always ask, “Hey can you write down the day and the time that this happened?” Then we go in and we go check the logs to see when that’s happening, and see if we can line up a time when the issue is happening to something that is being captured in the logs. So that again is defining the problem: when is it happening, who’s it happening to, what applications are they running, and you can narrow in what that problem is by noting these little bits of information, then going and looking into the log so that you can define what the problem is.
Number two: we want to capture the problem. This is like capturing a wild animal on a video surveillance camera- you want to capture that Chupacabra in the wild. So something you want to do is try to line up the logs with when the problem is happening. You can ask your user, “Hey what time did it happen at?” “Oh it’s 12:15 every day, the Internet disconnects for 5 minutes.” That’s a good time to go look at those logs. You might also ask them, “Hey can you do a screen capture of an error message that you get on your screen when this happens?” or of the symptom that the user is seeing. They can do a screen capture and send it to you and this is how you capture a problem. That’s number two: you can’t solve a problem if you can’t capture the problem.
Number three: you want to break the system on command. So when you start to find out, “Hey if I go and it’s my lunch break and I start to watch a bunch of videos or start to do something and a few minutes later the internet connection goes down,” that’s maybe a repeatable step that you can do. So what you want to do is you want to find how you can break the system. I do this and it breaks, I do this and it breaks, and if you find out what breaks the system then you have a better chance of finding out what’s going to actually fix the system. You can actually fix the system, and you can do that in other ways, especially once you can see that you can reproduce the problem then you know that you’re on the right steps to finding the root cause.
Number four: implement a fix. So here’s a good example of an if-then statement which is going to be: you break it on command, you retest the steps that broke the system, if the fix that you implemented doesn’t work and the system is still breaking, you simply go back to number two and you continue capturing the problem in the logs and from a user’s experience. You break it on command, you implement a fix, you retest it, and it’s either going to be fixed or you’re going to go back and still dig to find out what is that root cause. If it is resolved you’re going to go to step number five.
Number five: confirm the fix. Resolve the problem- there are times when you fix something and it’s fixed for 5 minutes but it’s not permanently fixed. So what you want to do to confirm that you have resolved the problem and fix it for good is that you want to set up a monitor. So if it was creating a log every time that it happened, you would want a monitor for that log, and have it send you an alert whenever that happened so that you know that is happening right then and you’re really on to something there. Then you want to periodically go back a couple days later, go back a couple of weeks later, and go back a couple months later. Most of the time if you go back a couple months later and the problem is not returning you have confirmed that you found the root cause and that the fix you implemented fixed the issue permanently.
These are the five steps that we use all the time. I highly recommend that you use them also, try them out and I think you’ll find that you have a lot more problems that are resolved and a lot less repeated.
To recap:
Number one you want to define the problem: when is it happening, who’s it happening to, what applications is it happening to.
Number two you want to capture that problem. You want to capture it on either screen share, in the logs, and then you can kind of identify; you can go search on what those issues are saying Number three you want to be able to repeat breaking the system on command, so you want to say “Hey every time I do this it breaks the system” because when you go to number four and you implement that fix you can go back and you can repeat those steps that used to break it. If it isn’t breaking anymore then you know that you’ve actually fixed the root cause of the problem. Number five is that you go back, you monitor for that, and you go back and confirm over a period of time. Over a couple months if you go back and it’s still not happening, you know that you have fixed that problem.