Cyber Network Data Processing

Flash and JavaScript are required for this feature.

Download the video from Internet Archive.

Description

In this video, which plays from the beginning to 00:23:28, Dr. Vijay Gadepally discusses anomaly detection and data conditioning in machine learning.

Instructor: Vijay Gadepally

[SQUEAKING]

[RUSTLING]

[CLICKING]

 

VIJAY GADEPALLY: I hope you all are enjoying the class so far. It's always interesting-- we have people with a variety of backgrounds here, so it's just great to kind of hear the questions as we're going along.

So I'm going to give a quick example to begin the class right now that kind of drive in some of the concepts that we talked about on the first class. But this is kind of real rather than sort of cartoon neural networks. But it's sort of a real research result that we have.

Before I begin, I'd like to thank Emily Do who was one of my graduate students who actually did all the work. I just put some of the slides together-- not even all of them, just a few. So really, the credit for this work goes to Emily. Anything I misrepresent is my fault, not hers. She graduated, so I'll take the blame for anything that's not interesting. So with that, we'll begin.

All right, so the overall goal of this project was really to detect and classify network attacks from real internet traffic. And in order to do this, as many of you can imagine, we had to find a data set that was of interest. So this is probably a problem many of you are currently thinking about, which is what data set should I get my hands on? Right, so we have a variety of data sets. Some are sensitive in nature, right? So think of internal network traffic that we're trying to collect. No one's going to let us hand this over to graduate students to kind of work on, and then more importantly, publish.

So the first thing that we wanted to do was look for a data set that was kind of open and out there. And we're fortunate that there is a group in Japan called the MAWI working group, which stands for the measurement-- and I'll use my cool tool-- the measurement and analysis of wide area internet traffic working group, that actually has 10 gig link that they've tapped using a network tap. And they actually make this data available. It's actually continuously updated even to today.

So that's really cool, and this is within that. It's called the Day-in-the-Life of the Internet. And so this has been going on for multiple years. The data set is reasonably large when you convert it into what we call an analyst friendly form, which is something that you or I could read and make some sense out of. It's about 20 terabytes in size. So a reasonably large data set, not going to fit on a single computer or a single node, so certainly a good use case for using high performance computing or supercomputing.

And one additional piece that we should note-- and this is something that, as you're starting off your projects, you may also think about-- IP addresses are often seen as reasonably sensitive. So you can see where traffic is coming and going from. So in this particular data set, they've actually deterministically anonymized each of the IP addresses. And as you're downloading the data, you have to, essentially, sign a user agreement saying that you will not attempt to kind of go back and figure out what the original IP addresses were.

So as you're coming up with the data sets-- and I think Jeremy is going to talk a lot more about this in the next part-- you might think about, you know, are there certain fields in the data that I'm collecting that might be deemed sensitive, either today or in the future? And if so, are there just simple techniques that I could use to anonymize the data? And then with a decent end user agreement, you might be able to enforce some level of people not trying to break it and trying to go back. It's not impossible to do it, but any legitimate researcher who is using data really shouldn't care about what the original IP addresses in this particular case are or maybe other such data within whatever your collecting.

All right, so just a quick definition of anomaly detection. I really love this definition of what an outlier is from a paper by Hawkins. So, "An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism." So when we're looking at network traffic, that's what we're looking for. Right?

We're looking for these mechanisms that are present in network traffic that deviate so much from normal behavior that it arouses suspicion. And it could be for a variety of reasons. It could be somebody uploaded this cool video of somebody falling flat on their face, or it could be a botnet, or DDoS attack, or something like that.

So outlier can be sometimes referred to as an anomaly, surprise, or exception. And within the context of cyber networks, these mechanisms can be botnets, C&C, or command and conquer servers, insider threats, or any other network attacks, such as distributed denial of service or port scan attacks.

There are a number of general techniques to deal with outlier detection. So the first one is to look for changes-- is to essentially use statistics. You look at what's going on in your mechanism, and you look for statistical anomalies from that. And you'll highlight those anomalies, and then you'll kind of drive deep into that. You could do clustering. So if you're looking at unsupervised learning, you could cluster your data. And things that are really far away from existing clusters, or known clusters, are probably something that you want to take a look at.

You could lose-- Similar to that is also distance-based techniques. So these are kind of closely related that you look for observations that very significantly from other observations in some sort of a feature space. And finally, you could use a model-based technique, where you attempt to model the behavior of normal-- right, and I use air quotes for the word normal-- and then, you come, and you look for things that deviate from this background model. So all of these four techniques are approaches to anomaly detection kind of related with each other. It's not a hard and fast rule about how they deviate from each other.

But for the popularity and the complexity of network traffic, we decide to go with a model-based technique because we found that any other means of trying to represent the data would be difficult. But this is sort of a top down approach of how we kind of thought about, well, let's look at network traffic-- let's think about anomalies in network traffic-- and let's come up with what's a good approach to address that.

So when we talk about network attacks-- not to go into detail-- there is a wide variety of network attacks out there. Every day we see news articles about new ones. The focus of this work was to look at just a couple of these. One was in this section that we call probing and scanning, and another was in this resource usage or resource utilization area. And specifically, we looked at port scanning and distributed denial of service attacks.

As a quick example of what happens in one of these network attacks-- so this is an attack, often, within the bucket of probing and scanning. So essentially, what happens is you have an attacker that attempts to find out what ports are open on a victim or a target system. It does this by sending requests, similar to pings, to a victim or a target system. If the target's system acknowledges one of these things, you can say, oh, this particular port is open. And then, one may go and look at what software typically runs on those ports and attempt to use some of the known vulnerabilities of that software.

So a lot of software packages have specific ports that they tend to use. So you can say, oh, you know, port-- I'm making up a number, here-- port 10,000 is open, and I know that's used by Microsoft SQL server or any other piece of software. And then you can say, well, here is a known vulnerability that I could use, and then I'll try attacking with that. So this is just a simple technique that a lot of attackers use, very low bar, very easy to do. You could write one by mistake. But just to find out what ports are open, and then try to use known vulnerabilities on different ports.

Now, the least common denominator, the easiest piece of data to collect, is what's called the network packet. I'm sure many of you are familiar with the concept of a network packet. For those that are not, think of if you're sending a letter, it's all the material on the envelope. So it's the address, where it's coming from, who it's going to, plus a little bit of information of sort of what's in the package as well.

The reason that we often use network packets-- and I'll kind of open this up to the class. A lot of the cyber research actually focuses on network packets, but those often form a very small percentage of the actual data, the payloads, where actually a lot of the data actually is. Can anyone guess why we tend to use a network packets rather than payloads? Not Albert?

AUDIENCE: Header.

VIJAY GADEPALLY: Sorry, the network header. Yeah, sorry the network header, not the packet, yeah. So, Yes. Anyone guess why we tend to use the header rather than the payload of a packet?

AUDIENCE: Encrypted.

VIJAY GADEPALLY: Yeah, encryption, exactly. So very often, especially these days, the payload of the packet is typically encrypted, so there's not too much that you can do with it. So using the header information-- the header is not, right? It's the outside of the envelope. The inside of the envelope is something that you cannot see by putting it up to the light. So that's why we tend to use the header of the packet.

So the headers-- so this is what a packet looks like, on the left. On the left side, that's what a packet looks like. And if you kind of convert that into a human readable form, just the header information, this is the type of data that you get out of it. So it gives you things that you would expect. What was the IP address that I started from? What was a port that I started from? What was the destination IP address? What was the destination port? Plus a bunch of other information about certain flags that were set, whether it's what direction to flow-- what direction the traffic is moving in and stuff like that, and then any flags associated with it, what type of packet it is, so what type of protocol is the packet using, et cetera?

All right, so just as a reminder, the data we're using is from the MAWI working group, but there is a lot of other data out there that you might be able to get your hands on. So one of the challenges, how many people here have worked in cybersecurity or done some research in cybersecurity? So a handful of people.

So one of the huge challenges-- this is something that you may run into in your data sets-- is ground truth, is understanding when-- So if you're trying to find out when there was an attack in your data, someone needs to have-- you know, there needs to be some ground truth associated with that. As much as we'd all love to sit and write DDoS attacks and port scan attacks on ourselves, sometimes, you know, getting that at scale can be very difficult.

So for the purpose of this work, we used a synthetic data generator, in which we took real internet traffic data but then injected synthetic attacks into it. And that was sort of our way of looking for-- of establishing ground truth.

So the MAWI working group actually does published ground truth based on a number of detectors that they have. A lot of these are heuristics-based detectors. In our use and in our looking at that ground truth, it was really difficult to actually understand when an attack was occurring.

It was sort of like, somewhere in this hour an attack occurred. Which, if you're trying to train a machine learning classifier on that or an anomaly detector, that's kind of vague. Because there's a lot happening in an hour of internet traffic, and so if it's somewhere in that hour an attack occurred, that can be very difficult to find. So that's sort of a reason that we focus on using a synthetic attack generator. Yeah?

AUDIENCE: Can you say a little bit more about what kind of data gets contracted? You know, is it just additive, where you're showing some new compromised post that is injecting data to some destination? Is there also, like, return flows that are in there?

VIJAY GADEPALLY: Yep.

AUDIENCE: OK.

VIJAY GADEPALLY: Yeah, so the question that was asked is, you know, I guess, how sophisticated is this tool, and how detailed can you do it? So the exact details of that, I would probably forward you to Emily, but from looking at what was being done, these are reasonably sophisticated. So it's not as simple as, you know, you have to give it a text file, all the IP addresses, source and-- But you do have to set some parameters.

So what you can do, for example, if it's a DDoS attack-- right, that's maybe the easiest one to describe-- is you would give maybe a range of IP addresses and ports that you'd like to see injected, you'd give it a flow rate, and then you could tell it what type of packets you'd like introduced. And it sort of allows you to do a number of these different-- this particular tool at least-- allows you to pick a number of these different features that it then injects into the original pcap file.

If you're looking for some of the other attacks-- so this is just a list of the attacks that they currently support, as of a few months ago, and they might be adding to this. You know, we focus on a few of these that were also easy for us to reason and, kind of, go back into the actual data that we collected to see if the actual injection made sense to us or not. But we do have a number of different parameters that you could pick when you're doing that.

So that's just a little bit-- You know, as we talked about, in the data conditioning phase of the labeling data, is often a pretty difficult process. In the world of cyber anomaly, in the world of network packets and network security, going back and trying to label is an extremely time consuming task and very difficult and up to a lot of interpretation as to what one calls an attack. So while we did manually inspect a few of these, and that's certainly not scalable. So we decided to go with a synthetic attack generator. And I'm sure for some of your projects, you may also kind of think about is that a route that you need to take, at least to start off with.

OK, so this is the overall pipeline of the system. So as I mentioned the first time we chatted, we spent a majority of our time in the first step here, which is data conditioning. So within data conditioning, we did, you know, taking the raw data, which is downloading this from the internet. This comes to you in a binary format, which is a pcap. For those who are familiar with packet capture outputs, that's that binary format that comes from, like, tcpdump, for example. We then convert that pcap file into flows, parse that into human readable form, feature engineer, training machine learning classifier, do the same thing with the synthetic data, and then get good results. Or we hope that you get good results.

So step one of data conditioning was to download the actual raw data. The data was downloaded in a compressed binary format, which is probably what a lot of you will get your hands on. A typical size of a single zip file or typical compressed file is about two and a half gig. When you decompress that, it goes to about 10 gigs, so about 4 x. And a single pcap file corresponds to-- so it's about 10 gigs for about 15 minutes worth of traffic.

So if you're trying to do multiple days, you can imagine how this starts to explode in terms of volume. And about 15 minutes, on average, is about 150,000 pack-- it corresponds to about 150,000 packets per second. So this a reasonable amount of network traffic that we're getting our hands on, but not even close to the types of volume that large scale providers have to deal with on a regular basis.

Once we have the actual-- once we have the pcap file downloaded, just the network packets don't really tell us much. It's stuff to do things with. So what we want to do is convert this into a representation that's useful to do analysis. And so a lot of work has been done using network flows. And a network flow is, essentially, a sequence of packets from a single source to destination.

So as you can imagine, when you're streaming a YouTube video, for example, it's not one big packet that comes to you, but it's a series of packets. And so flow essentially tries to detect that and say, OK, all of these things was a person watching this video. Right? So from this source to this destination within some time window is defined as a network flow.

And so we convert each of these 15 minute pcap files into network flow representation using a tool called yaf, which stands for yet another flow meter. And so that helps us do some of this flow-- establishing these network flows. And so the size of about 15 minutes worth of flows goes to about two gigabytes, and we set a time out of flow to be about five minutes.

OK, once we have the flows, they're still in this binary format. We need to now convert them into a representation that we can look at. So flow comes with an ASCII converter called the yafscii-- Sorry, yaf comes with the ASCII converter called yafscii, and that essentially converts the data from this floor representation into a tabular form.

And if there's one lesson that I think Jeremy will talk to you a lot about next, it is tables, tables, tables. People love looking at tabular data, it's very easy to understand. I wouldn't recommend it, but you could open this up in Excel. It would be huge, but it would open up. But it's very easy to, kind of, look at, especially when we're trying to establish truth in the data and go back and take a look at it.

So for our pipeline, we take the yaf files, convert them into text files. Each of these ASCII tables, so again, about 15 minutes worth of data, is about eight gigabytes in size. And we record a number of different data pieces per flow. I won't go into the details, but if you're interested, you can kind of look at what are the various entities or features that come up in a network flow?

So now that we have network flows, that's sort of like part one of data conditioning, right? We've converted it into some representation that makes sense. We have it in human readable form. So, step one data conditioning done.

Now, the next thing is to convert it into something that actually makes sense for our machine learning algorithm. And so this is sort of bucketing into the area of feature engineering, when we kind of think back at it. And so feature engineering is really, really important. You'll spend a lot of time doing this on new data sets.

And so, each of these flows contains about 21 different features. And one of the-- we need to look for essentially which of these features makes sense for us to pass into a machine learning model. Others we're just going to end up training on a [? noisy. ?]

So we have to use some domain knowledge, right, where we talk to experts in cybersecurity and say, well, you know, IP address is kind of an important thing, but maybe this other flag is not as important. And then, we use a lot of trial and error and luck, as well, to help pick some of these features.

Once we have these-- Once we kind of pick a set of features that we're interested in, when we look back at the anomaly detection literature, a lot of work has been done with this concept of, you know, you have these flows, but you need to convert them into some other format that makes sense to look for anomalies. So it's just converting into a feature space in which anomalies make sense.

And so we looked at using entropy. And the basic intuition behind this is that if we're looking at various flows between different fields and that the entropy of these flows should be roughly equal-- should be roughly similar, should be about the same, not equal-- but should be unchanging when there is no big mechanism, defined by an outlier detection mechanism, that's involved with changing the entropy.

So for example, just to make this more clear, if you're having a DDoS attack, this is typically a number of different source IP addresses attempting to talk to a single destination IP address. So you would expect to see an increase in the entropy of the source IP field in that particular example. And in port scan attacks, you would probably see something on the destination port entropy would go up. Right, that's a little bit of intuition, it's a little bit more complicated than that. But that's a very high level view of what's going on.

And for entropy, we just use the standard-- I'm sure many of you who are familiar with information theory are very familiar with Shannon entropy. And we compute the flow associated with each of the features that we pinged from the network flow before. So from the feature engineering, we picked a subset of features associated with that.

We all remember when neural networks are. This is just there for completeness. So we take these now, network flow sees these entropies associated with the network flow, we use those as input features for a neural network model. Now, you'll recall from the neural network talk that we had last week, we have inputs and outputs, we have weights associated with them, and then we have this nonlinear equation that sort of represents inputs and outputs at each of the layers. That was the equation. I could see Barry giving me the cue that I'm walking towards the slides.

AUDIENCE: Oh no.

VIJAY GADEPALLY: So the features that we have over here correspond to the various entropies that we've computed in the previous step. And the outputs of this network correspond to a class of attack. So in particular, we focused on, obviously, no attack, DDoS attacks, port scans, network scans, and P2P network scans.

The model itself had three hidden layers with 130 and 100 nodes, respectively, and an output layer. And we used ReLU activation. I'm happy to talk in more depth about why we selected these, but I'll save that for another time.

And going through this process, we actually had very good results. So that's sort of the short form of the research work. But what I wanted to really emphasize is the amount of effort that went into the data conditioning, the importance of collecting data, anonymizing data, making it human readable, converting it into a format that people understand, and then kind of walking through the sort of-- it was certainly an iterative process. As much as this is a very short form of what we did, this took a long time to get to these results.

And what we're presenting over here is the attack type and the intensity. So this is the ground truth from the synthetic data generator, and then this is sort of the prediction that our model has across the site. And this is a confusion matrix. So a higher number or a darker blue indicates that we did a really good job. And we've also used varying intensities of these in attacks. So certainly, if we have strong, which means we've injected about 20,000 packets in the 15 minutes, we can detect those really, really well. As you have weaker and weaker attacks, our system doesn't do as well. it's quite reasonable, but certainly an area of ongoing research.

So with that, I just wanted to hand it over to Jeremy. But these are some initial work and results of detecting and classifying network attacks. The keynotes was-- data conditioning was where we ended up spending the majority of the time. Cleaning up the collected data, coming up with-- We spent a lot of time trying to figure out how to label this data because we were just unable to really do it. And finally, we ended up going the synthetic generator path. But we actually did spend a significant amount of time to try to figure that out. And then, we spent a lot of time and feature in engineering, which did consist of a lot of trial and error and a lot of that stuff like that. OK, thank you.