After spending tens of thousands of dollars on commercial security solutions that did not meet our needs, our security team opted for a DIY approach. One of the first tools we wanted was a decent DLP. We were also very disappointed in the DLP solutions available, especially when it came to tracking confidential data elements across both Linux and Windows file systems. Many were hard to use, difficult to configure, and/or dragged along an infrastructure of servers, agents and reporting systems. We wanted something small, flexible, and dead simple. At this point, we were either looking at going back to the well for more resources to get the job done or coming up with some crafty. None of us were coders beyond some basic sysadmin scripting, but we decided to give it a shot.
The problem was that we potentially had confidential data laying around on several large file repositories. Nasty stuff like name + SSN, birthdate, credit card, etc. We tried several commercial and open source DLP scanners and they missed huge swaths of stuff. What was particularly vexing is that our in-house apps were generating some of this stuff, but it was in our own format. It was pure ASCII text but the actual formatting of the data was making it invisible to the DLP tools. It was structured but not in a way that any other tool could deal with. Most of the tools didn't offer much flexibility in terms of configuration. Those that did were limited to single pass reg-ex.
Our second problem is that we also wanted a way to cleanly scrub the data we found. Not delete it, not encrypt it, but excise like a tumor with precision of a surgeon. We were tearing through log files and test data load files used by developers. Some of these files came directly from customers who did not know better to scrub out their own PII. We had the blessing of management to clip the Personal out of PII and anonymize it in place. No tool on the market did that.
Luckily we knew what we were looking for and how it was structured and what we wanted to do with it. It allowed us to do contextual analysis... when you see these indicators, look here for these kinds of files. Using Python and some hints based on OpenDLP (one of the things we looked at), plus a little Luhn test, and did a first pass.
We got a ton load of stuff back. Almost none of good. This was not unexpected, as this was our experience with a lot of the DLP tools.
So we then started a second pass of contextual and content analyses. We dove in and looked at look at these false positives and found what made them false. This second pass scan would weed out those cases with pattern matching and algorithms. We rinse, lathered and repeated with bigger and bigger data sets until we were hitting exactly what we want with no false positives.
Next we added a scrub routine that replaced the exact piece of PII in a file with a unique nonsense data element. For example, some of these files were being used as test loads by developers. If we just turned all credit card numbers in 9's, their code would fail. They also needed unique numbers for data analysis. If you turn a table of SSNs into every single entry being 99999, the test will fail. So we selectively changed digits but maintained uniqueness. I can't get into too much detail without giving away proprietary code, but you can read all about it here
We also kept a detailed log of what was changed to what, so that we could un-ring that bell if it ever misfired. And of course, we protected those log files since they now have confidential data elements in them.
What we ended up with was a single script that given a file path, would just go to town on the files it found. No agents, no back-end databases, no configuration. Just point and shoot.
The beauty is we knew what we were willing to trade off, which was speed, against precision. Our goal was the reduction of manual labor and better assurance. Our code was clunky, ran in a slow interpreted language, and it took hours to complete. But it was also easy to modify, easy to pass around to team members, and the logic was very clear. Adopting the release early and often approach, we had something usable within weeks that proved more functional than the products on the market.
The tool proved to be laser-precise in hunting down the unique PII data records in our environment, preventing costly and embarrassing data leaks. After showing it around, we were given precious developer resources to clean up our code, add functionality, and fix a few little bugs. It's been so successful as an in-house tool that our management will soon be releasing it as a software utility to go along with our product.
No comments:
Post a Comment