Continuously Monitoring the Haystack (Needle Inventory Report)

In the previous two posts (Part 1 here and Part 2 here) I first shot down the much over-used “Finding a Needle in a Haystack” analogy by showing that the problem facing IT security professionals is far more complex and then defining a new approach that will find the <unknown> in a <ill-defined, shifting maelstrom>.  I closed the second post by adding a “but wait, there is more”, so here it is: what if I told you that the approach I described not only solves the needle in the haystack problem, it also provides an innovative approach to continuous monitoring and situational awareness.  Without changing the core process one iota.

Think of it – this is like inventing cold fusion and finding out that the solution repairs the ozone layer.  Or inventing a beer that tastes great and is less filling.  Or something like that.

To recap, my thesis is that the analogy assumes that you know are looking for a known thing in a well-defined and completely homogenous population, when in fact we do not what we are looking for in most cases and the machine population that doubles as our proverbial haystack is anything but well-defined and completely homogenous.  Therefore, the proper expression of “Finding a Needle in a Haystack” is in fact “Finding an <unknown> in a <ill-defined, shifting maelstrom>”.

I then outlined how you could solve the problem by first building a normalized view of the endpoint population to create a baseline of the ill-defined, shifting maelstrom, effectively giving you a haystack.  Then you could continuously monitor machines for changes under the assumption that any change may be someone introducing something to your haystack.  By analyzing and grouping the changes and using the baseline as context, you could then assess the impact of those changes to the machine and determine if the changes represented a needle (malicious attack) or just more hay.

Now for the continuous monitoring part.  Because I do not know what I am looking for, I have no choice but to scan everything persistent on the host machine.  Since I am building my normalized model on the server, I have to move the raw data to the server and store that data into some form of data repository.  Logic dictates that the repository that I use to power the process of finding needles in the haystack can be used as the data source for all forms of situational awareness and continuous monitoring activities.

It gets better!  I can perform those activities without incurring any additional burden on the endpoints or on the network.  I can ask questions of the repository without the need to collect additional data from the endpoints.  Most solutions use an agent that scans only segments of the data, or use an agentless scan to collect segments of the data.  If either is not currently scanning the data needed, the process has to be altered and repeated.  For example, a new scan script might have to be pushed to the agents.  Furthermore, organizations often run multiple scans using different tools, each dutifully thrumming the endpoints for data and often times collecting much of the same information collected by the scan that ran an hour earlier.

Of course, I must continuously refresh the data in my repository to be accurate.  Luckily, I already thought of that, and I am using my agent-based precision to detect and cache changes on each host machine in a very efficient way.  I then send those changes to the server once per day and use the changes to refresh the repository.  Given I have a large number of data attributes that rarely change, sending only the changes across the wire keeps the network burden to a minimum.  Obviously, I have to move down a full scan when I initially deploy the agent, but for subsequent updates using the change-data-capture approach results in comparatively small answer sets per machine per day.

My result?  I have a complete repository of the granular data for each endpoint machine efficiently collected and managed and available for reporting and analysis.  I can feed my highly sophisticated information portals, scorecarding processes, and advanced analytics.  I can build integrations to feed other systems and applications in my security ecosystem.  I can create automated data feeds such as the feed required to comply to the FISMA CyberScope monthly reporting requirement.  Best of all, this activity does not require me to go back to the endpoint ceaselessly for more information.

I have implemented true continuous monitoring and comprehensive situational awareness without running any incremental data collection processes.  I am continuously scanning every persistent attribute on every machine.  My data collection routine to find the needles in my haystack is the only data collection process required!  You want situational awareness?  From this data I can readily produce reports for application inventories, patch inventories, vulnerabilities, and non-compliance with policies and configurations.  I can tell you the machines that have had a USB key plugged into them in the past week.

I can keep history of my machine scans and show you the detail of the granular elements of what I am scanning.  Think of the impact for investigating incidents on a machine.  You could pull up the snapshot for any given day, or select two dates and generate a summary of the diffs between the two images so you could see exactly what changed on the machine.  .

I am just getting started.  Since I have my normative baseline in place to interpret the changes I detect on each machine, I can also provide reports on anomalous applications and other activity that is exceptional.

To recap, the data collection processes required to implement my approach to finding the <unknown> in a <ill-defined, shifting maelstrom> provides true continuous monitoring and broad situational awareness.  Or you could say that my approach is a continuous monitoring solution that can identify the<unknown> in a <ill-defined, shifting maelstrom>.  Either way, what results is an accurate picture of my <ill-defined, shifting maelstrom> without having to run any additional scans or data collection, so I get the benefits without incremental burden to the endpoint machines or the network.

But wait, there is more.

Needle in a Haystack? How to Find an Unknown in an Ill-Defined, Shifting Maelstrom

In the March 17,2011, post, I demolished the “Finding a Needle in a Haystack” analogy by pointing out that in IT Security we don’t know what we are looking for (the needle) and our haystack is not a homogonous pile of hay but is instead a continuously changing, utterly non-homogenous population of one-off configurations and application combinations.  We went from “Finding a Needle in a Haystack” to “Finding an <unknown> in a <ill-defined, shifting maelstrom>”.

I ended by promising you a solution and that is where I begin.

The first step toward a solution is getting your hands around the “ill-defined, shifting maelstrom” that is your endpoint population.  To find what is unwanted or anomalous in that population, you first need a way to establish what is normal for that population.  You could build and dictate normal, and then enforce that normal in a total lockdown.  That is expensive and hard to do, and in my many travels, I have seen exactly two such environments.  The alternative is to monitor the machines in that population, and accurately create a baseline learned from the environment itself.  One that captures all of the exceptions and disparity in all of its glory.  The end result is a normalized, well defined representation of your ill-defined, shifting maelstrom.  A normalized haystack, as it were.

Easy, right?  Not really.  You have to remember that your target is unknown, so you have no idea where it will appear and in what form.  You must also consider that whoever is putting the unknown in your haystack does not want it to be found, and will so design the unknown to evade detection.  Zero day attacks don’t show up as shiny needles.  You can assume nothing; therefore, you must monitor everything as part of your normalized haystack.  You must also remember that the population shifts (wanted change) and drifts (unwanted change) by the moment, so you will need to keep it current.

In short, you will need continuous monitoring that is comprehensive and granular.  Not the kind the scanner vendors sell you that sees some of the machines in weekly or monthly increment, or the kind the AV vendors sell you that sees parts of the machine and not the entire picture.  You will need comprehensive and truly continuous monitoring.

In yesterday’s post, I noted that if you had a homogonous haystack you could remove everything that was hay and what is left should be the thing you are looking for, even if you do not know what that thing was.  Our haystack is not homogonous, but now we have created a baseline that provides the next best thing.  We can’t throw out the hay, so we need a slightly modified approach that uses changes to the machines as our potential indicators to compliance issues and malicious attacks.

If we are smart, we can use this approach to our advantage because once we establish our normative haystack we can continuously monitor the machines and identify changes.  This fuels our detection process and drives efficiency in managing the shift (we want to control the drift, but that is another post) in the population.  By capturing changes, we can keep the image of the population current with minimal drag on the endpoints and the network by moving changes across the wire.  No need to move large images when incrementally smaller change captures will do.

Once we identify the changes, we will need analytics that assess the impact of those changes to the associated machine.  These analytics will leverage the context provided the normalized model of the haystack to identify those changes that are anomalous.  Changes identified as anomalous are further analyzed to gauge their effect on the state of the machine and identify those changes believed to be malicious.  We can use the context and other analytic processes to group changes so that we see the malicious code and all of the damage done to the machine by the malware.

We have successfully identified the unknown in our ill-defined, shifting maelstrom, which, like I said yesterday, is infinitely harder than finding a needle in a haystack.  We did not just find the unknown, we have detailed its composition, analyzed the effect to the machine, and identified its path of destruction.

I think we are onto something here.  This could revolutionize malware detection, creating a detection capability that is agnostic to attack type, vector, and delivery.

But wait, there is more

Finding a Needle in a Haystack – Child’s Play!

“Finding a Needle in a Haystack” is without doubt one of the most overused analogies in IT security.  After seeing it repeatedly at RSA I offer the following analysis of the analogy:

Finding a needle in a haystack is child’s play.  A walk in the park.  Enormous oversimplification.  Luxury (yes, this is a reference to Monty Python’s “Four Yorkshiremen” sketch – look it up, it fits).

First, the “needle” component is all wrong because it presumes what we are looking for is known.  The problem in malware detection is in seeing those attacks for which there is no prior knowledge.  We don’t know what we are looking for, but we are expected to find it – whatever “it” is.  Unfortunately, the vast majority of malware detection software relies upon prior knowledge to detect malware, leaving a wide detection gap for unknown attacks such as zero day attacks and the advanced persistent threat.  This is problematic as the number of unknown attacks increases in volume and complexity daily, and, as we saw in the recent report of the Nasdaq breach, some needles remain undetected in the haystack for over a year.

The analogy has now degraded to “Finding a <unknown something> in a Haystack”.  Obviously not knowing what it is you are looking for causes some complication, but this problem is addressable given that we are looking for an unknown thing in a well-defined, consistent population, namely the hay in the haystack.  Logic dictates that if you were able to remove everything that was hay, what is left should be the thing you are looking for, even if you do not know what that thing was.

The trouble with that solution is that the people hiding stuff in your haystack do not want you to find it.  So they will make their stuff look, feel, and smell like hay, making it very difficult to readily distinguish what is hay and what is the unknown.  You also face the very real possibility of discovering multiple non-hay unknowns after the hay is removed.  Do you assume they are all undesirable?  If not, how do you differentiate the benign unknowns from the undesirable unknowns?

None of that matters anyway, because the analogy breaks down further because endpoint populations are most definitely not haystacks.  In fact, unless you have instituted the most epically draconian lock-down process of all time, there is very little homogeneity in any endpoint population.  Consequently, you are not looking for something in a well-defined, consistent population, you are looking for something in a confusing maelstrom of one-off configurations that changes daily.  It is the furthest possible opposite of homogenous.

We are now left with “Finding an <unknown> in a <ill-defined, shifting maelstrom>”.  Makes you yearn for the luxury of “Finding a Needle in a Haystack”, doesn’t it?

Finally, no one really suffers from a needle in a haystack, unless you metaphorically jump into the metaphorical hay and through a turn of enormous bad luck get inadvertently, metaphorically stuck.  Needles do not exfiltrate intellectual property or financial data.  Needles do not turn computers into spam spewing Conficker zombies.  Moreover, needles do not land your organization on the front page of the New York Times and create reputational risk that can lower the market cap of the company.  Of course, the <unknown> in your <ill-defined, shifting maelstrom> certainly can.

The obvious question: how do I find an <unknown> in my <ill-defined shifting maelstrom>?  I will offer you one solution tomorrow.

The Nasdaq Breach Illustrates the Need for Continuous Monitoring

Dear Nasdaq, call me.  I am here to help.

The Wall Street Journal reported late Friday that Nasdaq had discovered that they had been hacked.  The hackers never made it to the trading programs, but instead infiltrated an area of Nasdaq called Directors Desk where directors of publicly traded firms share documents about board meetings.

What caught my eye was the following quote from the AP story filed about the attack: “…Nasdaq OMX detected “suspicious files” during a regular security scan on U.S. servers unrelated to its trading systems and determined that Directors Desk was potentially affected.”

People, people, people.  You have got to get on the continuous scanning bandwagon.  Seriously.

Connect the dots.  The story says that “the hackers broke into the service repeatedly over more than a year”.  Notice that the scans that found the suspicious files were “regular” meaning periodic.  Monthly? Quarterly?  How many of these regular scans were run before the activity was discovered.  I understand the need for network based, agentless scans.  I also know their limits, and deep down inside in a place most IT security people don’t want to admit, so do you.  “Regular” is not continuous.

Don’t stop yet, because the story says that the scan determined that the systems were “potentially affected”.  The diagnosis was partial because agentless scans, even credentialed scans, only get part of the story and therefore can only point out “potential” exploitation.

I have zero data about the actual attack and therefore am speaking in general terms.  But I am confident that a granular, continuous scanning tool should have been able to detect enough anomalous and exceptional artifacts on the Nasdaq servers to spot an attack like this.  The story says that suspicious files were ultimately discovered, so we know that there were persistent artifacts created by the attack.

This is a prime example of why you must have continuous, granular monitoring of endpoints and servers.  Periodic scans, while effective, leave too many blind spots.  A continuous scanning tool should have fond the artifacts.  And if the tool used change detection like Triumfant, it would have flagged the files as anomalous at a minimum within 24 hours of the attack.

Don’t throw the shield argument at me here.  These attacks went on for over a year.  Triumfant would have spotted the artifacts in 24 hours or less.  If you can’t see that difference and want to live the lie of the perfect shield, you are on the wrong blog.  In fact, if those files triggered our continuous scan that looks for malicious actions (an autostart mechanism, opening a port, etc.), Triumfant would have flagged the files within 60 seconds.

Regardless of which of our continuous scans would have detected the incident, Triumfant would have performed a deep analysis of the files and been able to show any changes to the affected machine that were associated with the placement of the suspicious files on the machine.  You likely could have deleted the word “potentially” from the conversation almost immediately.  I would also add that we would have built a remediation to end the attack.

Strong words for someone who has no details?  Perhaps.  But I would bet the farm that we would have found this attack in less than a year.

I don’t understand where we have arrived in regards to why organizations don’t implement continuous scanning.  Innovative solutions like Triumfant get throttled by old predispositions and the disconnect between IT security and the operations people who manage the servers and endpoints.  The security teams are forced to use agentless tools because the ops people refuse to consider a new agent, even if that agent is unobtrusive and allows them to remove other agents in a trade of functionality.  As a result, the IT security people to protect machines with periodic scans that cannot possible see the detail available when an agent is used.

Machines get hacked, the organization is placed at risk, countless hours and dollars are spent investigating the problem and then more hours and dollars are spent putting useless spackle over the cracks.  This is worth dismissing even the consideration of an agent?

Let me put it a different way.  We allows users to run whatever they want on endpoint machines, yet block IT security from deploying granular, continuous scanning tools that can actually detect attacks such as the one we see in Nasdaq.

What am I missing here?

Dear Nasdaq, call me.  Don’t rinse, repeat and be in the WSJ again.  I can help.  Promise.

Triumfant and Situational Awareness – The Google Model

I have written in this blog that while Triumfant is useful, innovative technology, I often struggle to come up with word pictures or analogies that help others grasp how useful and innovative it really is.  Thankfully, we employ lots of smart people and one of our developers came up with what I think is an exceptional analogy.

Because Triumfant assumes nothing, it scans just about every persistent attribute on every machine in the endpoint population and sends this to the server for analysis.  Since the majority of the state data on each machine rarely changes, after the first snapshot is collected the Triumfant agent performs change data capture and only sends changes up the wire for subsequent scans.  This is, of course, the proven, prudent and efficient way to monitor large amounts of data that is predominantly static.  Otherwise, you end up moving large answer sets across the wire needlessly.  The data is available at the server level in a repository to power all forms of situational awareness.

The analogy suggested by our developer is the Google approach.  Google does not know what questions will be asked of its search engine, so it uses crawlers to traverse the World Wide Web to collect data and store it in anticipation of any question.  Google puts that raw data through correlation and pattern matching algorithms to further speed the search process.  The logic is simple – a search against the open Internet would be grossly inefficient and utterly preposterous.  By gathering the data before the question is asked, Google actually returns answers while you are asking the question.

Triumfant does essentially the same thing as Google for endpoint state data, because like Google, we do not know the question until it is asked.  Triumfant does not rely on prior knowledge and instead detects malware and configuration problems by monitoring change.  We use our agent to continuously monitor over 200,000 attributes per machine and then collect that data at the server level.  Queries, online views, and data feeds execute against the repository data at the server and require no interaction with the endpoints.  Put this in contrast to other tools that have to get the data from the endpoint for every question asked.

Triumfant’s repository can be queried directly and a report produced in hours (more likely minutes but I don’t like to show off).  You would know almost immediately how many machines have the new vulnerability and therefore be able to assess the risk to your organization.  It would not matter what machines are connected at that time nor would it impact the network or the endpoints.   Why?  Because like Google, the hard work of gathering and processing the raw data is done and the data readily available.  Best of all, the Triumfant agent performs its continuous monitoring unobtrusively and efficiently, and only sends back changes across the wire once a day.  You get faster access to the data with no impact to the endpoints or your network.

With other tools, you would either have to initiate an agentless scan of the machines to collect the required information, or push some new query or script to the endpoint agents for execution.  Either way, this activity places a burden on the endpoint and on the network as potentially large answer sets are returned across the wire.  The necessary data would then be collected in some repository and evaluated over time.  I was recently at a prospect that I would judge to be progressive and perceptive, and that prospect told me that it takes two weeks to identify machines affected by a new vulnerability for a population that is not large by most standards.

One hour versus two weeks.  Impressive.  Most Impressive.

But wait, there is more.  Most vulnerabilities have a short term mitigation strategy that involves setting some registry keys to temporarily disable the vulnerability until a patch is created and applied.  With Triumfant, a simple policy can enforce the temporary fix and applied in less than 24 hours.  Since there is likely no signature for an attack that quickly moves to leverage the new vulnerability, Triumfant will see those attacks and build a remediation to stop the attack.  Triumfant sees the new vulnerability, effectively closes the vulnerability, and detects anything that attempted to exploit the vulnerability.

The concept of accessing the central repository rather than continuously interrogating the endpoint machines works for all forms of situational awareness, business intelligence, data mining and analysis, and external feeds.  For example, Triumfant stores SCAP attributes for CCEs, CPEs and CVEs in the repository, so when the organization wants to build a CyberScope (Triumfant is a certified CyberScope provider) feed it does so from the repository without intrusion on the endpoint or consumption of network bandwidth.

So there you go.  Triumfant is like a web crawling search engine for the state data of your endpoint population.  The data is there so you can ask questions and get the situational awareness your organization needs to keep pace.  Gartner and other have been talking with increasing frequency about the importance of situational awareness and Enterprise Security Intelligence. I cannot think of a more more efficient and detailed source for endpoint state data than Triumfant.

The Yin and Yang of Triumfant – Agent Based Precision With Network Level Analytical Context

Yesterday I was in a conversation with Dave
Hooks, our CTO, and a very smart person from the intelligence community, and, as often happens when I engage with people smarter than myself, I had an epiphany:

Triumfant provides agent level precision, with network level analytical context.

There is a set of trade-offs when working with endpoint security tools based on their perspective and architecture.  Agent based solutions allow for monitoring at very granular levels, but there are limitations to the amount of analysis they can perform.  That is because when the analysis only happens in the context of the machine, the lack of broader context creates far too many false positives to make the analytic processes effective.  In most tools, the agent uses prior knowledge to detect, remediate or both, resulting in the need to continuously update the prior knowledge on the agent, creating a network and administrative burden.

In contrast, a server-based agentless tool trades a lack of intrusiveness with a lack of precision.  Even the most efficient scanning tools using credentialed scans cannot see the levels of detail needed to be absolutely sure about many potential problems, whether it be malicious activity or vulnerabilities or compliance.  For example, a credentialed scan can point out machines that may have a specified vulnerability, while Triumfant can probe deeply to say without question if a given machine has a vulnerability.  Agentless scanning also tends to gather large answer sets, which places a burden on the network.

Which leads me to my epiphany – Triumfant’s approach provides the best of both worlds while eliminating the drawbacks of each.  Triumfant has achieved harmonic balance between what appear to be opposing forces – a true Yin/Yang relationship.

The Triumfant agent performs continuous scanning at a level of precision that I have not seen in any other tool – over 200,000 attributes per machine.  The agent recognizes changes and sends only changes to the Triumfant server for analysis, minimizing network burden through an effective application of change data capture.  The agent uses no prior knowledge, and therefore requires no regular updates of signature files or remediation scripts.  No network impact outbound, very low network impact inbound.

Triumfant performs the analysis of the detailed data collected at the machine level on the Triumfant server, empowering Triumfant’s analytics to view changes in the context of the broader population, driving analytical accuracy and eliminating false positives.  The context also empowers Triumfant’s patent pending donor technology that uses the population as a donor pool to build remediations that address missing and corrupted attributes.  When a new attack is identified, the context allows for investigation of broader attack patterns which will ultimately provide the IT security team the information they need to proactively protect the organization from other similar attacks.

The context that I speak of in the previous paragraph is unique to Triumfant and is at the heart of our patents.  The context takes the detailed attribute data collected by the agent and builds a normative, rule-based model of the endpoint population.  Again the Yin/Yang relationship is manifested: the context thrives because of the detail provided by the agent, but logically and logistically can only be implemented at the server level.

By using the agent to do what it does best, and using the server to perform the heavy lifting of analysis, Triumfant captures the best of both worlds.  The agent is extremely unobtrusive and efficient, and requires near-zero maintenance.  Using change detection means that you can assume nothing, and must therefore monitor everything, which would be impossible to do efficiently and accurately without an agent.  Equally impossible is the task of making sense of detected changes without a broader context.  That is why performing the analysis at the server level is critical.  It is important to note that the analysis is only as good as the data provided, and the server’s analysis would not have the depth and accuracy it generates without the granular data that could only be obtained through the agent.

So there you have my epiphany – Triumfant harnesses the data collection power of an agent based approach with the analytical power and contextual perspective of a server based approach.  Triumfant uses the power of each to neutralize the weaknesses of the other to create a solution that is unique and certainly powerful.  We can detect, analyze and assess the impact of changes to identify malicious attacks that evade other defenses, and build a contextual remediation to repair that attack.  We can continuously enforce security policies and configurations.  And we can provide deep insight into the endpoint population.