In 2020, there was an estimated 59 trillion gigabytes of data in the world. Most of which was created in the latter half of the 2010s decade. This figure continues to grow. To convert this raw, chaotic data into valuable intelligence we use data mining tools and analytical techniques. Digital Shadows (now ReliaQuest) routinely uses several of these tools and techniques to assist its clients with determining trends within the cyber threat landscape. This has included our recent blogs on vulnerability intelligence and initial access brokers (IAB). In this blog, we’ll detail some of the common techniques used to support our research. 

The guiding principle behind data analysis is the data, information and intelligence pipeline. Raw data comes in many forms including text based, numerical, date, boolean and many more. In order to be useful, it must be converted to information. This is achieved through cleaning the data before running statistical tests and/or analytical algorithms on it. The results of the analysis are then interpreted to present an overall view of the intelligence picture. Ultimately, this process exists to give anyone who needs to make security decisions the capability to make more informed ones.

This blog is part of a two-blogs series where we’ll dive into how we use data analysis in threat intelligence here at Digital Shadows (now ReliaQuest). Today will focus on the initial steps we take before working on our data and some use cases related to data visualization and data analysis. These models will help us cover the basics of data analysis before we’ll delve into the more advanced techniques.

First step: Cleaning and Extracting the Data

Data can come from many sources. The first step in data mining is finding where the data is located and determining whether it is appropriate and sufficient for the intended analysis. This will involve studying the data schema and structure (or lack thereof) and having an unambiguous knowledge of how the data was collected. Knowing this is fundamental for determining what conclusions the data can theoretically support. The following factors must also be considered: scope for sample bias, collection gaps and sample size.

Depending on the analysis being performed, the presence of sample bias doesn’t necessarily make a dataset unsuitable if the conclusions are heavily caveated. An incomplete conclusion drawn from a slightly biased sample can still offer some useful insights. Data from multiple sources can be combined to perform an integrated analysis. When performing such a task, it is vital to consider how the datasets are related.

After the dataset has been extracted, and is known to be appropriate for the task at hand, it must be cleaned. Data cleaning actions will depend on the type of data in each column, but they include:

  • Removing missing values
  • Removing illogical values (e.g. numbers outside an expected range)
  • Converting text to all lowercase and removing punctuation
  • Ensuring that data is stored as the correct type within the tool being used

After finding, extracting, joining and cleaning the data, the analysis can commence.

A Universal Truth: The Normal Distribution

When sampling a continuous variable, most values will be found near the average. As one moves away from the average in either direction, the frequency of values decreases. Extreme values are rare. This is because continuous variables are a summation of multiple factors, and there are more possible combinations that make the middle values than the extremes. 

To visualize these values, we often use a model called the “normal distribution”. A normal distribution is a probability distribution used to model phenomena that have a default behavior and cumulative possible deviations from that behavior. Wherever one sees a mean value (i.e. the arithmetic average of a set of values), there will almost certainly be a bell curve behind it. This is the underpinning principle behind most statistical analysis techniques.

A normal distribution will typically have a mean equal to the median and mode. 68% of the data will fall within one standard deviation, 95% within 2 standard deviations and 99% within 3 standard deviations. Normal distributions can have skew, in which the peak of the curve is biased in a particular direction, as well as kurtosis, in which the curve is broader or narrower. 

Normal Distribution of CVSS scores sourced from the MITRE CVE database.

Normal distributions are universal and visualizing them can help us determine the overall spread of the data . For example, the figure above shows the normal distribution of CVSS scores. While it may not be a very smooth bell curve, it does show that most vulnerabilities have a score between 5 and 7 and that extreme CVSS scores are rare. 

Visualising Data: The Box Plots

The graphs used to visualise and show patterns in data come in many forms. The best graphs are simple and intuitive, while simultaneously conveying as much useful information as possible. They must be able to convey a message without requiring more than a couple of lines of explanatory text below.

The most appropriate graph types will depend on the type of analysis being performed and what conclusions one wishes to portray. Bar graphs are mainly used to compare categories, or show a trend over time. Line graphs are normally used to compare continuous variables, but can also be used when one wishes to depict several trends over time without overcrowding the page. In addition to those very commonly used graphs, exist highly specialised plots.

One example that has been used in a Digital Shadows (now ReliaQuest) blog on Initial Access Brokers is the boxplot. This model is used to visualize and compare several normal distributions. The figure below depicts a boxplot used to compare the price distributions of several types of initial access.

A box plot depicting the price distributions of several types of initial access

The box indicates the interquartile range, where half of the data points lie. The whiskers on the ends of the box indicate the extreme ends of the distribution, where each whisker accounts for a quarter of the data. Outliers are usually depicted separately, but have been omitted from this particular analysis. 

From this graph we can see that WebShell is the most valuable access type on average and shows the greatest spread. We can infer from this that WebShell allows a threat actor the highest level of access on average, but that the level of access also varies greatly. RDP shows the lowest average value despite offering effectively a “hands on keyboard” level of access to a machine. From this it can be inferred that most RDP offerings are low-privileged machines, and are of little value to threat actors.

Test your Hypotheses: Null Hypothesis

Null hypothesis based statistical testing is a model used to determine whether the relationship seen between samples is reflected in the population, or is down to sampling error. The starting assumption is called the null hypothesis, which assumes that there is no significant relationship between the samples and that any relationship observed is by chance alone. The type of test used will depend on the type of relationship being tested for, the type of data within each sample and whether population statistics are known. The table below summarizes some of the most commonly used tests.

Commonly Used Statistical Tests

The output figures of these statistical tests are converted into a ‘p’ value. A p value is the probability, expressed as a decimal, of observing results at least as extreme if the null hypothesis is correct. If this value is less than 0.05 (i.e. there is a less than 5% chance of us seeing the result if there is no significant difference between the populations) then we reject the null hypothesis and assume the results to be statistically significant. 

An example of ANOVA in action can be seen in the IAB graph above, where it shows a significant difference between the types of access as indicated by the presence of p<0.05 in the plot title. This tells us that we can draw conclusions from the graph.

Conclusion

This blog covered the early stages of the data analysis process, spanning data collection through to basic analytical techniques. Some data problems however, require more advanced solutions. Such problems often require the use of machine learning methods, which will be covered in the second part of this blog series. If you’d like to access Digital Shadows (now ReliaQuest)’ constantly-updated threat intelligence library providing insights on a wide range of cyber threats, sign up for a demo of SearchLight (now ReliaQuest’s GreyMatter Digital Risk Protection) here.