For the last decade plus, the default first step to starting a security monitoring program is “Collect all your logs and store them in a central place”. The logic behind tracks. If you want to know what’s happening in your environment, you need to review the logs. If you want to constantly watch for conditions in your logs that could indicate a breach or an attack, you need them to be in a predictable place and format. Makes sense, right?
When this logic first emerged and security practitioners started pushing this more, most companies were not using public cloud services or SaaS platforms, but rather primarily built and managed their own infrastructure. The problem is that there have always been significant hurdles to this approach. In the days of on-prem everything, this was a difficult task but not impossible.
Log Centralization pre-Cloud
The primary way to accomplish log centralization in an on-prem organization is to essentially force configurations on your company's devices. Setting standards in images that are used to spin up new servers is relatively easy. This is because most organizations tend to standardize on what servers they’ll run in production. Whether it’s Windows or a specific flavor of Linux, you could solve the problem of configuring automatic log shipping 1 - 3 times, and be able to rest comfortably that you’ll get visibility into nearly every device. This assumes having a centralized way to deploy new infrastructure, as it was pre-AWS days, when most companies tended to have a process by which users would request new servers. As part of that process, the team building the servers would use a “Gold Image” with all the approved settings and configurations.
If for some reason this was not how servers were being built, though, most companies employed Active Directory. This would allow them to set specific configurations for log shipping and enforced by Group Policy.
The problems begin to show up when you start to store this data. If you’re operating at scale, this very quickly becomes terabytes or petabytes of data. This means you need to buy sufficient storage. And don’t forget, if you’re storing this data in your own data centers, you’ll need to mirror this data elsewhere for redundancy. Also, make sure you back this data up regularly. Additionally, be careful not to over permission people’s access to this data, as it could be very sensitive. You’ll also need to regularly review this data to see if anything unexpected is coming through. What if you are collecting application logs and unbeknownst to the security team, your primary application is writing highly sensitive data to error logs?
This is just scratching the surface, we haven’t even talked about how to do detection from this log data. Once you’re ready to start doing that, new questions emerge:
Do I have the right data?
If I don’t have the right data, how do I get it?
What data do we collect that we are certain we do not need?
How much of this data is actively being used to drive detections?
How will we efficiently index and query this data?
Obviously, a good SIEM will help you solve some of these problems, but that’s another cost to add to this already very expensive endeavor. In my experience, it’s not uncommon to have very low utilization of the data being collected. Even if you’re actively utilizing 50% of the data, there is still a tremendous cost associated with the unutilized 50% meaning it’s very hard to show any real return on this investment.
Sure, maybe you keep the logs for incident response and investigation if something happens, but that’s a difficult sell to convince the business it should spend tens or hundreds of thousands of dollars “just in case”.
Cloud Native Log Centralization
I’ve been working at a startup that is 100% “in the cloud” for all their tech. This includes all of our operations tools needed by various parts of the company to do their work. I think most startups today find themselves in this category. This environment brings about some incredible challenges to the already troubling standard of collecting and centralizing all logs.
First, not all SaaS vendors even offer audit logs to begin with! And if you choose a vendor that does, get ready to pay big money for them. I’ve seen costs from a few thousand dollars to as high as “you need to upgrade to this SKU” which is 4x the price of our current SKU. This is why audit-logs.tax exists.
Let me also quickly say, I don’t have a problem with charging more for audit logs. It takes effort to build them into your app and it carries a legitimate cost. But I don’t think it’s fair to use a customer’s desire to be proactive as an opportunity to get much larger contracts.
With that out of the way, let’s discuss the problems that exist once you are working with cloud products that do make logs available to their customers. These are generally accessible through an API. I have seen some vendors only provide audit logs in the products web GUI, which leaves you with no realistic way to gather these programmatically. Those are in the minority, though.
Depending on what products you’ve chosen to use for monitoring your environment, your SIEM provider may have integrations with the other tools and products you are using. Great! But what if they don’t have an integration for another tool your company uses? The most you can do is usually submit a feature request and sit and wait. What if their integration is collecting way more data than you need, resulting in increased costs? Some will have the ability to filter this, but not all. What if the integration doesn’t collect data you need? Well, again, you’re left waiting.
Some SIEMs will have the ability to create custom integrations that will allow you to ingest data from arbitrary apps. Generally, this works by configuring the SIEM to gather the data from a specific location, like an S3 bucket or maybe a data lake. There are some SIEMs on the market that will abstract this part away, but not all. So in the case where you need to put this custom integration data somewhere for your SIEM to analyze, you are probably paying for this data multiple times. Once for storage in your infrastructure and then again for the SIEM to pick up, analyze, and store it on their side. If you have a large SaaS footprint, these costs compound very quickly. Remember, most SIEMs will charge based on the volume of data ingest.
Control Ingest Costs
One way to control those costs is by taking control of how your data is being queried and stored. I have seen that some security teams at engineering-focused companies have several people dedicated to just collecting, shaping and safely storing this data. This means, modern detection engineering teams are not only required to have the security knowledge to effectively detect malicious behavior, they also need to be data analysis experts.
Vendors across the SaaS landscape rarely use a common schema with their logs and frankly, you’re lucky if they all use the same format. Most use JSON, but there are always those odd ones out of the bunch that will throw you a giant wrench. This requires spending more time and effort to create a custom solution. Regardless, there are several steps needed to begin ingesting data in this fashion.
The workflow looks like this -
Detection Engineering determines they need to detect X action in Y application.
Engineer reviews the application and determines if audit logs are available.
Once logs are located, the engineer will need to review documentation to understand the structure of the logs and the events that create logs.
Identify which events are written to the logs to detect X action.
Play around with the API and determine how to get the events required without over-ingesting and causing excess cost for storing data.
Implement production code that effectively gathers the required logs and normalizes them into an expected format for querying.
In order to do this in a scalable, consistent way, you’ll need some kind of system to handle making these API calls and normalizing this data. You can definitely build scalable systems to accomplish this, but there are also many vendors that offer this kind of product. However, this adds yet another tool to your tech stack, one that generally is not a “set and forget” type of tool.
This is all time, effort, and budget that are going to activities that do not immediately move the needle on security monitoring. Sure, you have the data after all this, but you still need to use it effectively and strike the balance between spending on data storage vs. detecting threats against your company. It is very easy to end up using way more storage than you expect or plan for and increasing costs in an unmanageable way. Throw in the complexity of being at a company experiencing very strong growth, and it doesn’t take long to recognize this is not sustainable.
Most detection engineering teams simply do not have the bandwidth to be good at handling these data problems AND detecting threats against their organization.
All the problems
Let’s recap all the issues with log centralization:
Data storage is expensive
Moving data at scale is expensive
Querying the data in a performant way is expensive
Sometimes, the data you need just isn’t available
You might have to pay vendors more for the logs
You need dedicated data pipelines to scale collection and normalization
It takes a significant amount of time to build and manage these pipelines
You’ll likely need several products or custom built tools to tackle most of these problems
The amount of data generated by your organization will likely go up every year
You may be under utilizing the data, resulting in hard to justify costs
You still have to secure this infrastructure and data, on top of the infra and data you were setting out to protect in the first place
I’m sad to say, I don’t have a solid answer to this problem. What I know is that security teams in a SaaS-driven world need to be able to query data from their various purchased applications to check for conditions they deem risky. Where the underlying data exists isn’t as important, they were just means to an end, in my opinion. What matters is being able to routinely check what’s happened in an application and alert relevant teams if something occurred that should not have or is indicative of a problem. Here are a few ideas of ways to at least reduce the pain, though none fully solve this problem.
Federated Search
A new product category that is emerging that has my interest is “Federated Search”. Essentially, this would be a product that would make use of data where it is already and query the data as it’s needed. This approach could significantly reduce the need to store data and potentially result in large cost savings. This is not fool proof though, as I see several ways this could be challenging.
Some vendors will likely take the path of abstracting away the work from security teams by building integrations with popular applications and handling all the API calls themselves. The problem with this approach was mentioned above, if the search product does not include logic to search an event you need, you may be stuck waiting for the vendor to add it. If one of the applications you are integrating with changes their API or its response to your pre-defined API call, this will require time to fix. Depending on how the product is created, I could still see this as an improvement. If you are operating your own data pipeline, the app changing its API behavior is already part of the maintenance you’re doing anyway. But you get the advantage of not needing to do loads of data storage and shaping. There will likely still be pockets of centralization, based on the platforms you need to monitor. For most organizations, you’re likely to end up with 3 or 4 primary data stores, plus maybe pockets of one off data stores for those odd ball applications.
Ideally, this would result in a leaner process to create detections for your applications. It would still require effort and several of the steps I mentioned above, but you’re lowering the burden and time to deliver, increasing security value of that effort, as well increasing overall capacity to create more detections or focus engineering efforts elsewhere.
Managed Detection and Response (MDR)
One thing I’ve seen several security teams do with positive results is hire an MDR to manage parts of their detection program. Generally, this is done on “IT” systems that the company uses for every day operations. Then, the internal detection team is solely focused on the company's product and securing it effectively. The advantage here is that you get a laser focused detection program. Part of what makes this effective is that MDR providers can build out their data collection pipelines and then monetize them by reusing these pipelines over and over to protect their customers.
The downside is that you lose some control and visibility into the rest of your infrastructure. I personally have not worked with an MDR in this way, but I am sure this could be managed with an effective relationship and collaboration. This still leaves you with a lot of the problems I mentioned above, but with a more narrowly defined focus, it is a much easier problem to solve. A lot of the complexity comes in when you’re trying to manage data from several vendors, apps, cloud hosting platforms, etc.
Go All In on Data Analysis Skills
Another way to solve this is to just admit this is not just a “security” problem and form a team with a foundational understanding of creating and managing data pipelines. If you can build a team that becomes very good at that, they can act as a service to the Detection Engineering team and basically provide a platform for detection and threat hunting, backed by a data lake. One way to do this would be custom built systems in a public cloud platform with services like data lakes, OpenSearch, etc. There are some SIEMs on the market that are a good fit here too. There are some downsides though. It’s still expensive to store this data and it would be expensive to hire data engineers. You will also be left to make most queries/detections in SQL, which is not a super common skillset for a lot of security engineers. It’s not impossible to learn and skill up, but it would take some time and commitment.
At the end of the day, you have to pick which trade offs make the most sense for your organization. But I do know that most security teams are always pressed for budget, so efficiency is nearly always a factor.
I’d love to hear what others are doing around this problem! If you know of a way to better handle this that I’ve missed, I’d love to hear more!
Your analysis of the issue is impressive. It is clear that the establishment of an industry-wide standard for handling security logging is essential. Just as SaaS products routinely declare their support for standards like OAuth, SAML, and SCIM, we should advocate for vendors to provide audit logs in a standardized format and ensure the ability to query them through a standardized API. While the Common Event Format (CEF) offers a potential solution, its effectiveness in addressing the core issue is uncertain. What is evident is the lack of a uniform approach to delivering audit logs to customers, representing a significant opportunity for innovation and enhancement.
This only looks at security logging. It assumes legacy vendor costs and physics. It presumes the limited vendors that can only do schema / normalizatkon at ingest. Federated search doesn’t deal with the many aggregation scenarios for the full range of logging use cases. Axiom.co is slashing the costs and providing a system that adapts to highly variable log schemas and formats.