CompTIA CYSA+ CS0-002 – Automation Concepts and Technologies part 2

  1. Machine Learning (OBJ 3.4)

Machine learning. In this lesson we’re going to talk about machine learning and a couple of related concepts. These are known as artificial intelligence or AI. Machine learning or ML and deep learning. First, let’s talk about artificial intelligence. Now, artificial intelligence is the science of creating machines with the ability to develop problem solving and analysis strategies without significant human direction or intervention. Essentially, we want to have a machine that can think for itself. Now there are a lot of great things that we can do with artificial intelligence, especially in cybersecurity. When we start looking at artificial intelligence we can create these expert systems and the original ones use these if then else statements to basically make things happen based on a limited data set using knowledge bases and set rules.

But modern AI can think for itself and that’s really where the benefit comes. Now we’re going to talk about this as we go through in terms of machine learning though, because machine learning is a component of AI that really enables the machines to develop strategies for solving a given task. Now, if you get a labeled data set where the features have been manually identified but they don’t have further explicit instructions. And so machine learning, the concept here is you have to train the machine. If you don’t teach the machine what you want it to know, it’s not going to know how to categorize things. And machine learning works really, really well when you start dealing with things that are dealing with labels or categorizations.

So for example, if I wanted to go through a data set and say this is malware, this is not, this is malware, this is not and I train the machine, it can then take over using its behavioral engine and machine learning to identify on its own what is and what is not malware. Now, this isn’t a rule based set but by training it with a large data set, it over time can start learning on its own. Let me give you a real world example of this. One of the earlier machine learning case studies they did was training a machine to identify what was a party and what wasn’t. And so they started showing it images. So for example, if I showed the computer an image like this, I would categorize and say this is a party. There’s a bunch of people there, they’re having a good time, they’re playing with some confetti.

Looking at it as a human, I would say yes, this is a party. Then they would show it another image and the human would sit there and look at it and say no, that doesn’t look like a party to me. Looks like they’re at the office working. No, that’s not a party. And the human would categorize it and they would keep doing this with images. The next one here, is this a party? No, it looks like they’re at a conference, they’re at a work event. They’re smiling, which is usually a sign of a party. But they’re obviously not at a party. And so I would say, no, this is not a party. Then I go to the next one. What about this one? There’s a couple of ladies dancing. That looks like a party, right? They’re probably having a good time either at a club or at a friend’s house.

And they have a drink in their hand and they’re having a good time at a party. So I would say, yes, that’s a party. And then I go to another one. Here’s one where people are sitting around a table, they’re eating, they’re having a good time. And there’s a lot of different people at this table. So it looks like they’re having a dinner party. So I can categorize that as a party and say, yes, it is. Now, what’s the problem with what I just did? I went through five images, which is a very limited data set. But let’s say I did this with 5000 images. That would be enough for a computer to start making its own decisions on what’s a party and what isn’t. So, what is the problem that just happened when I use these images to train this machine learning engine? Well, the problem is I just trained this computer to be racist.

That’s right, because if you look back at these images I just went through, all the ones that were at parties they only had men and women and only people who were white. The only image that had somebody who was of a darker complexion an African American or a black person happened to be at that business conference. So this computer has now just learned that for a party to exist, it has to have white people. This is a problem with machine learning because if you give it a bad data set you can train these machines to be racist, to be discriminatory or to simply misclassify things and miss things. So you have to be very careful with the data sets you provide these machines so they can learn. Now, this is the danger with machine learning.

Machine learning is only as good as the data sets that are used to train it. So you have to keep this in mind when you’re going through and creating your data sets. If you’re trying to train it what malware looks like you need to make sure that you identify properly what is malware and what isn’t as you’re feeding at those data sets. And this is the same same thing we deal with images or any other type of data set you’re feeding it. Now, the next concept we need to talk about is an artificial neural network, or an. This is an architecture of input, hidden and output layers that can perform algorithmic analysis of a data set to achieve outcome objectives. Now, essentially, when we have an artificial neural network this is the pathways that are being created based on that learning, it’s doing so as it’s learning, it’s starting to make its own feedback loops of what is the right if.

Thens if I see this, I see somebody holding a glass of champagne, that’s a party. If I see people eating food, smiling and having a good time, that’s a party. If I see people dancing, that’s a party. That’s all part of this neural network and it’s all being developed on the fly by the computer based on what it’s learning. Now, a machine learning system can adjust its neural networks over time and they do this to try to reduce errors and optimize the objectives because they’re trying to always get to better identification of what you’re trying to identify. In my example, identifying what is and is not a party. So now at this point, we’ve already talked about artificial intelligence. We started talking about machine learning and now we’re going to dive a little bit deeper and go into deep learning.

Now, when we talk about deep learning, this is a refinement of machine learning that enables a machine to develop strategies for solving a task given a labeled data set. Now, all of that so far sounds like machine learning. But here’s the difference. Without further explicit instructions, so I can just hand it a data set and it will start making its own determinations.I don’t have to do all the categorization for it. That’s the difference with deep learning. So when you create deep learning, deep learning is going to use complex classes of knowledge to find, in relation to simpler classes of knowledge to make more informed determinations about an environment.

So I might start out giving it that simple data set and saying, this is a party, this isn’t a party. But then I turn it over to the machine and it can learn from there much better on its own what is and is not a party based on its own observations. Basically, it’s like a child. And when it starts out, it doesn’t know much. But as it learns and grows, it creates deeper and deeper connections inside its neural networks to make better decisions. So to help solidify what the difference is between machine learning and deep learning, let me give you an example that applies to the cybersecurity world. Let’s say I have network traffic and I’m going to take that as my input and I want to be able to categorize that and say this is benign or this is malicious, this is okay and this is something that’s bad and needs to be flagged.

Now, if I’m dealing with machine learning, I have a human who has to determine what those malicious factors are. Just like I sat there and said, this is a party. This is not a party. I would have to sit there and start training that system. So you might have a week period or a month period or even a six month period where you have analysts who are actually going through and categorizing traffic that you’re seeing as malicious or benign. And based on that, that is going to start training the computer on what it is and then the computer can take over. Now, when you deal with deep learning, you don’t even have a human there. You just send it to network traffic and over time, it’s going to make its own decisions on what is benign and what is malicious training itself.

And so we have those deeper connections that really starts figuring out what are those things that make up something that’s malicious. Now, how would the computer know? Well, maybe it’s being able to see your whole network and it sees one computer that you took offline, reimaged them and put it back online. It now knows there was something bad on that system. And based on that, it can start looking into those logs and figure out what was it it saw. That may have been an indicator of malicious traffic. And so these things can learn over time. Now, are we there yet? Are we 100% with deep learning and all of that that goes with it, for it to be able to do all of this on its own without people? Not yet. But we are getting better and better all the time.

Now, a lot of people worry this is going to put humans out of jobs, but I will tell you that’s not going to be the case because we still need people to make decisions. We still need people to look at those things. All this is doing in this deep learning scenario is labeling it. It’s saying this is bad or this isn’t bad, but then a human is going to look at it and verify that it is and take follow on actions. Now, some of the newer systems that they’re trying to build are going to try to take the human out of the loop completely. But that is a very dangerous thing to do because you’re relying solely on the computer’s decision and they can take follow on actions like removing that system from the network, reimaging the machine, and other things. So you have to keep those things in mind too, when you’re deciding how far you want to go with machine learning and deep learning.

  1. Data Enrichment (OBJ 3.4)

Data enrichment. In this lesson, we’re going to talk about data enrichment and how we can use machine learning to help us with that. One of the best things that machine learning can assist us with is data correlation because there is so much data across all of our systems, and if we throw it all into our theme, it’s still a lot of data for us as an analyst to go through. So by using machine learning, we can have it surface up to the top what it thinks is the most important thing for us to look at as an analyst. Now, when we do this, one of the things that it can also do to help us is use data enrichment. Now, data enrichment is a process of incorporating new updates and information to an organization’s existing database to help improve its accuracy.

So if I have my scene with all of the things I saw in my network, but I don’t have any threat information from other third party sources, I may not catch something. But by taking machine learning and doing data enrichment, bringing all that data together, I can have different open source feeds. I can have other partner systems and my own systems telling me what I’m seeing. And that can help me figure it out. Now, AIbased systems can help combine all of these indicators from multiple threat feeds to help reduce our false positives and our false negatives inside of our systems. So it makes our systems even better by using this as an enrichment technique. Again, our goal here isn’t to eliminate the person. It’s to make sure that the person has the best information to make the best human decisions.

Now, let me give you an example. Let’s say that I needed to create Malware signatures. That was my job. I’m a reverse Malware analyst, and I’m going to go through and take apart those binaries, decompile them, look at the hex code and figure out exactly what they’re doing. And from that, I can create a signature. So here on the screen, you can see an example of an old piece of Malware called the I love you bug. Now, in this one, there were certain pieces of code that you can pull out as your signature. Now, that’s great, but what happens if somebody changes one of those bytes of code? Well, now it has a different signature and that means I can miss it. And so somebody keeps doing that by changing the code.

It’s going to make my job as a Malware analyst really, really hard because I’m going to have to keep going back and creating new signatures all the time. If I use AI, I can give it a basic signature and say, this is what I’m looking for. Now, if you find any variations that look similar to this, go ahead and flag those as well, because somebody might have changed one bit here or one bit there. And by doing that we can integrate AI and machine learning to help us still identify what is malicious, even if they change the source code. Now, AIbased systems can really help us as we’re trying to identify this malware that’s been jumbled up. And it can do it a lot better than our human counterparts. People like me just can’t do it nearly as fast as a computer can.

So these AIbased systems, though they do have struggles, they’re not really good when they’re trying to identify things like administrative actions. So if somebody went in there and did something malicious as an administrator, that’s really hard for them to detect, because what is malicious about somebody changing a password or creating a new user account or changing a policy setting or moving files around? All of those are normal actions we do on a daily basis. What makes them malicious is the intent behind them. And so if somebody’s creating a new user account to be able to have an extra admin account, so they get kicked off the system, they can log back in that’s malicious. But just creating a new admin account, that’s not necessarily malicious.

That might happen just because I hired a new system administrator. And so you have to think about these things. And this is where humans are better at identifying things than AIbased systems. Now, when you’re dealing with your machine learning, remember, as I said before, machine learning is only as good as the data sets that you use during training. If you give it really bad data sets, you’re going to have a lot of false positives. If you have really good data sets, you’re going to be able to catch a lot more malware and a lot more threats. So keep that in mind. And this is a place where using the right training and the right data sets really does pay off for your machine learning.

  1. SOAR (OBJ 3.4)

Soar. Now when I talk about Soar, I’m not talking about soaring like a bird. No. Soar is an acronym and it stands for the Security Orchestration, Automation and Response, also known as Soar. This is a class of security tools that helps facilitate instant response, threat hunting and security configurations by orchestrating and automating run books and delivering data enrichment. Basically, think, think about this as a seam version 20. Now, when you’re dealing with Sore, sore is primarily used for instant response, but there is a large part of it that’s used for threat hunting as well. But really the number one place you’re going to see Sore use is Instant Response because it can automate so many of your actions. Now, as I said, I like to think about this as Seam 20.

Essentially, it’s a next generation Seam. This takes a security information and event monitoring system and integrates it in with Sore. And when you put those two together, this really does become your next generation seam, just like when you deal with next generation Firewalls. They took you from dealing with layer three and layer four and brought you all the way up to layer seven. It made it just so much better and so much more capable. Same thing here. When you integrate a Soar in with a theme, you get this really awesome product. It’s going to give you the ability to scan security and threat data to be able to identify different things. You can then analyze it using machine learning.

And then you can also automate the process of doing data enrichment to make that data inside that seam even more powerful for you as an analyst to use. And finally, you can do incident response. So you can provision new resources. That means you can create new accounts, you can create new VMs. If you’re using VDI, you can actually delete somebody’s infected box and then create a new virtual machine for them to use. And all this can be done using automated playbooks if you use this Sore capability. Now, when we talk about this, I just mentioned the word playbook. What exactly is that? Well, a playbook is essentially a checklist of actions that you’re going to perform to detect and respond to a specific type of incident.

So if you said, hey, if I have an alert that says there is a phishing campaign and somebody clicked a link on this machine, we’re going to do steps one through ten and then we’re going to reimage their machine and we’re going to give them a new computer that might be your steps. So, for example, if you have somebody who clicked on a link in a phishing campaign, you might have steps one through five which says go to the machine, isolate from the network, do a virus scan to make sure. They haven’t infected themselves.

Check the registry to make sure there’s nothing in there for persistency and then back up all the user data, reformat the computer and then reinstall the computer and put their data back on. These might be the actions you’re going to do. Now, these could be manual or automated but in the case of a playbook usually you’re talking about just the steps involved. Now if I can automate a lot of that, that becomes a runbook. Now, a runbook is an automated version of a playbook and it leaves clearly defined interaction points for human analysis. For example, my Sore might say if somebody clicks a link in a phishing email, do these steps one through five.

When you get to step two, pause, send it to an analyst who will then say reimage the machine or don’t reimage the machine. These are the ways that we can use these things. And they all work together to create a better environment and to help reduce the workload of our analysts. Because again, we only have so many cybersecurity professionals and if we’re having them waste their time on very minor things that we can automate, that’s not very helpful to us. So instead we want to automate what we can and sora allows us to do a lot of that automation.

 

 

img