Case study: The Importance of Operating your Platform

Background

The client, a large organisation relying on Splunk for its security operations, encountered a significant issue when a key member of their team went on long-term sick leave. This individual was the primary point of knowledge and oversight for their Splunk platform. This lead to a situation where their platform’s health began to decline, with critical security functions being neglected. As a result, the platform fell into disrepair, creating a potential security risk for the organisation.

At this critical juncture, our team was brought in to assess the situation, stabilise the platform, and ensure that the client’s security posture was restored and enhanced. This case study outlines the importance of operating your platform properly.

Initial Assessment

During Apto’s initial engagement consisting of resolving and advising on a backlog of items, it had become apparent the customers platform was performing inefficiently and data mapping misconfigured. This posed gaps in their security events and view.

We onboarded the client onto our Operate service and have since identified critical systems going down, storage volume running low on key security data aggregation servers, and when vital hosts stop forwarding expected events.

Identifying Gaps in Data Health

One of the first issues we uncovered was a significant gap in data health. The platform was missing some data models, which led to blind spots in their security coverage. This posed a serious risk as the client believed certain security use cases were functioning correctly; in fact, they were not running at all.

The absence of data models is particularly dangerous because it opens attack vectors. If the client had been aware of these blind spots, they could have acted accordingly. However, the unawareness meant that potentially dangerous security gaps were left unchecked. Our team quickly identified the missing data models and mitigated the risks by re-establishing the data flow, ensuring that security use cases were fully operational again.

Failing Saved Searches

Another critical issue involved saved searches that were failing to run. These searches were linked to security use cases, meaning that their failure resulted in a lack of visibility in areas of the organisation’s security posture. This was due to a misconfiguration of macros embedded within numerous security alerts and detections. A further review of the alert and detection catalog identified more skipped/failed searches where it was diagnosed of other platform issues surrounding over usage of compute resource, with misaligned expensive searches running over similar periods.

Data Ingest Volume

We also performed a detailed analysis of the client’s ingest volume, being able to find trends in their ingest against their licence. The client was using only around half of their available Splunk licence, which is measured based on data ingest rather than other factors. This presented two options:

  1. Maximise Licence Usage: We advised the client on how to maximise their license by onboarding additional data from their estate, which could enhance their security use cases.
  2. Reduce Costs: Alternatively, we suggested potential cost savings by cutting back on unused portions of their licence, providing them with a clear picture of trends and usage patterns to support future cost optimisation.

Addressing Recurring System Errors

Recurring system errors were another issue that had gone unnoticed due to the absence of key team members. The platform had suffered from a lack of ownership, which exacerbated these issues. Our team systematically identified and addressed these errors, ensuring that the platform was running smoothly once again.

This situation highlighted the risk of relying too heavily on a single individual for critical platform management. When that person is unavailable, as in the case of the long-term sick leave, the platform becomes vulnerable to oversights and errors.

Essentially lookups, which are key to running searches against specific security data (such as IP addresses flagged as threats), were also found to be missing or incorrectly configured. This caused searches to run without the necessary reference data, rendering them ineffective and creating further blind spots in the client’s security posture.

We investigated the root causes of the missing lookups and corrected them, ensuring that all security use cases were referencing the correct data. This significantly improved the client’s visibility and closed potential security gaps.

Outdated Apps and Forwarders

During our review of the platform, we identified numerous outdated apps and forwarders. These outdated components posed risks to the platform’s performance and security, so we created a detailed change management table to guide the client through necessary updates. We performed updates that did not require maintenance windows and scheduled more complex updates at times convenient for the client, minimising disruption to their operations.

This links to load balancing which is a key factor to consider when removing work.

We reviewed the scheduling of searches across the platform. The client had scheduled many security searches to run at the same time each day, creating significant performance spikes, which led to search skips and further blind spots. By rescheduling searches to run at different times throughout the day, we balanced the load on the platform and prevented these issues from recurring. This also improved the overall health and responsiveness of the platform.

Conclusion

Our ongoing support through Operate allows their internal Splunk team to focus on higher-level tasks while we handle day-to-day maintenance, giving them peace of mind and enabling them to focus on more strategic security initiatives.

See how we can build your digital capability,
call us on +44(0)845 226 3351 or send us an email…