Monday, May 18, 2015

Getting the Data for Data Driven Security, Part One


Continuing to building on my earlier post, "Questioning your security data", I'm now going to dive into looking into looking at active attacks.

The IT Operations team complained about the rampup of SSH attacks recently.  These particular services were protected by a variety of controls (which I won't get into). One of these controls is a detective controls which generates alerts when the service is attacked.

Historically, I've found this particular control has a very low false positive rate, which makes me happy. The bad news is that the alert goes via email, which makes it hard to suck into my ELK stack. The good news it that I've got a nice historical archive of these alerts. I've got about 37k logs of attacks against 4 different SSH servers in 3 different locations from Jan 2013 to present.

Well, I've had Jay & Bob's book sitting around so now I figured it was time to do some Data Driven Security.

Note: the purpose of this post is to give you ideas and pointers to learn some new techniques.  I'm not going to define an exact step-by-step on how to do this.  If you want to learn more on specific commands, follow the links or drop a comment.  Also, I know my code is crude and lame, if you've got pointers to improve, comment away.

First off, the data exists as email alerts.  I was able to simply do an export from Outlook to text (file->save-as) into one large msg file. So the contents of the file look like:





What's nice is that we've got a real time GEOIP look up, which is handy when reviewing data that's a few years old and IP ownership might have shifted.  The question is how to get this into a form usable for analysis?   The answer for me was AWK.  This quick and ugly AWK script quickly tears through this text file:

BEGIN      {  FS = " " ; x = 0 }
    $1 == "Sent:"  { month=$3; day=$4; year=$5;str = day; sub(/,/, "", day); time1=$6;  time2=$7; }
    month == "January" { month="1" }
    month == "February" { month="2" }
    month == "March" { month="3" }
    month == "April" { month="4" }
    month == "May" { month="5" }
    month == "June" { month="6" }
    month == "July" { month="7" }
    month == "August" { month="8" }
    month == "September" { month="9" }
    month == "October" { month="10" }
    month == "November" { month="11" }
    month == "December" { month="12" }
    $1 == "Subject:" { who=$5; where=$2 ; }
    $1 == "country:"  { print where "," who "," month "/" day "/" year " " time1 " " time2 ", " toupper($2);  }   
    $1 == "missing"  { print where "," who "," month "/" day "/" year " " time1 " " time2 ", " who;  }   
END        { print " " }


You can see that my script works email headers and mostly converts the date into a format that's easier for a machine to read. I just shove my file through this script and it gives me a nice CSV that looks like this:




Column one is the targeted SSH system (I'm giving them some cryptic names here based on asset classification), the attacker's IP, the date/time of the alarm and the GEOIP lookup.   Nice.  I could pull this into Excel and bang away at it, but instead I'll yank into R for some deep analysis.

A quick aside, I'm using RStudio for this example, so
At the R command prompt, I'm pull in the csv file into a dataframe with
alertlist <- file="alerts.csv" pre="" read.csv="">

And name columns:
colnames(alertlist) <- arget="" c="" ountry="" pre="" ttackerip="" ventdate="">


Oh, and I better convert the dates into R format:
alertlist$Eventdate <- alertlist="" as.date="" d="" m="" pre="" ventdate="">
which yields
Now I'm all set to do some analysis.  A quickie summary already tells me lots of interesting things:


Stay tunes for Part 2 where I do some of that.

No comments: