Bad Data: Is Cybersecurity Data Any Good at All?

Cybersecurity data is notoriously unreliable. What should we look for when we want trustworthy insights?

June 7, 2023

8 Min Read

CHIH YUAN Ronnie Wu via Alamy Stock Photo

There are principles that apply when presenting data. When possible, datasets should be transparent. Where did the data come from, how was it gathered, who was involved, and how was it analyzed? That is, research should be presented in an empirical fashion if it is to be trusted.

When was the last time you saw a report issued by a cybersecurity organization that fully abided by these principles? There is a surfeit of intriguing data, to be sure. Some of it is presented in compelling -- and beautifully illustrated -- fashion. But most organizations are vague about their data sources and opaque when it comes to their methodology. Proprietary concerns trump transparency in the vast majority of cases.

Thus, we are left with an etiolated stream of information. There are efforts to rectify this problem with governmental information sharing programs such as the Netherlands’ National Detection Network (NDN) and standardized frameworks such as Common Vulnerability Exposure (CVE). Despite their best efforts, these programs offer piecemeal data, in inconsistent fashion.

In the meantime, we must rely on occasional, usually generalized government reports and the scant academic literature. And we must sort through scads of seemingly well-researched reports from the private sector that are riddled with methodological flaws.

As the authors of a 2016 workshop paper note, “The data quality of shared threat intelligence plays an important role as inaccurate data can result in undesired effects.” This is not an insignificant problem.

Cybersecurity data experts Yuval Wollman, president of CyberProof and chief cyber officer at UST; Rahul Telang, a professor of information systems at Carnegie Mellon University; and Fred Rica, a partner at BPM, share their thoughts on the integrity of cybersecurity data -- and how to improve it.

The State of Cybersecurity Data

The cybersecurity space is deluged with data. But how much of it is reliable? Juicy, headline-generating conclusions abound. But dig a little into most reports issued by cybersecurity companies and you’ll find tenuous conclusions, sloppy methodology, and a lack of data transparency -- even when reports are based on publicly available information.

“There is a lack of consistent data upon which one can make decisions,” Telang laments.

That is not to say we should not examine these reports -- or take them seriously. Even conclusions based on flawed data can be useful, as long as they are understood as such. For all the discussion of cybersecurity as a general concern for businesses, real data about the causes of breaches, who is affected, and what can be done to prevent it in the future is but a trickle. We have to take what we can get.

“In general, I tend to see data as ‘directionally correct,’” says Rica. That is, the crude data and vague conclusions offered in most reports do indicate general trends we should take notice of. However, as Telang observes, these data sets do not have the integrity that defines our understanding of demographics, crime, and other statistical categories.

Fred Rica

Rica urges observers to take heed of the source. “Data published by independent sources such as trade organizations or other third parties is probably more accurate,” he advises. “Vendor data and vendor claims should be closely scrutinized in order to fully understand context, meaning and applicability.”

Wollman concurs, adding that “data from reputable sources such as government agencies, academic institutions, and established cybersecurity companies may be higher quality.”

The problem, of course, is that this data is often very general and thus not necessarily scalable to individual industries. These more rigorous reports rely on public reporting requirements -- and nearly all companies release exactly the amount of data mandated by law.

The Proprietary Veil

Corporations have motivation to offer only the data they need to. Much of the available cybersecurity data is the result of mandatory reporting in the wake of breaches.

“Private entities share their data once in a while, but not often enough to conduct consistent analysis,” Telang observes.

“There is much data that is not shared. There is a gap there,” Wollman agrees. "When it comes to cyber, in particular, one issue that impacts the quality of the data is that there is a general lack of trustworthy datasets because although companies collect huge quantities of data, they are generally reluctant to share it.”

Rahul Telang

“These companies are not eager to share their data,” Telang says. “They’re forced to.”

And half the time, organizations don’t have much of a grasp on their own data in the first place.

“Researchers have observed that many organizations are more focused on eradication and recovery and less on security incident learning,” claims a 2019 conference paper.

“I don’t believe this is because of any attempt to deceive or mislead,” says Rica. “Breach and vulnerability data is a bit of a “moving target” for most organizations. The full extent of a breach, or other incident may remain unknown for an extended period of time.”

Wollman is slightly more skeptical. “One issue to consider,” he says, “is potential conflicts of interest or biases in the private sector. Could these issues impact the reliability of a particular report?”

“The data regarding losses that happen because of security incidents is really sketchy,” Telang confides. Still, he allows, “It's really hard to be a distributor of security information.”

“When companies refuse to disclose their methodology or data sources, it becomes difficult to evaluate the reliability and accuracy of their findings,” adds Wollman. “Lack of transparency can create confusion and uncertainty, leading organizations to struggle in making informed decisions about how to protect their assets. Bottom line: This means the shared reports or information are not necessarily as helpful as they could be in preventing the next attack.”

How Data is Treated

“In general, I consider data to be more trustworthy when it comes from established and reputable cybersecurity companies or organizations that rely on rigorous methodology, and clear and concise presentation of findings,” Wollman says. “A good data report is also timely and relevant, providing insights into the latest cybersecurity threats and trends to inform decision-making.”

Still, the authors of the 2016 paper, from the University of Innsbruck in Austria warn, “Some types of data quality problems will get worse as the number of participants and integrated data sources increases. As security data is shared between stakeholders, aggregated from different sources and linked to other data already present in the data sets, the number of base errors also increases.”

Data is also often highly generalized, making it difficult to apply conclusions about security threats to individual industries. As the researchers from the University of Innsbruck note, “Combining short-lived shared threat intelligence from disjunct industries makes the important intelligence hard to find.”

Even more specific data is not cataloged in usable fashion according to the authors of the 2019 paper. They note that human error plays a huge role here: dates are not cataloged in consistent format and events are not documented using rigorous taxonomic categorization.

A Plague of Surveys

Take a look at the slick reports issued by cybersecurity companies and you will frequently find them rife with survey data. This is not necessarily a bad thing; if properly designed and executed, surveys can offer interesting general insights and indicate noteworthy trends. But these findings are less useful in cases where sample sizes are small and selection bias is high.

“Many times, cybersecurity surveys are more anecdotal, rather than comprehensive. Information shared is based on speaking to a small number of CISOs, for example,” says Wollman.

Does a finding based on data from ten companies, whose executives are known to the researchers, really offer a realistic perspective?

When presenting data, he advises, “be transparent in your intention, your survey methodology, your sample size and your analysis algorithms.”

Yuval Wollman

“The reliability of survey data depends on the quality of the methodology, including the sampling technique, questionnaire design, and data analysis,” Wollman claims. “Surveys that use random sampling techniques, have large sample sizes, and employ rigorous data analysis techniques are more likely to produce reliable results.”

“Survey data can be a good indicator of trends, but it is a ‘reader beware’ situation,” Rica cautions. “It is important to understand the scope of the survey, the sample size and the collection technique in assessing a survey’s applicability to your intended purpose.”

“Well-known names like Forrester and Gartner share their methodology and the number of companies they have interviewed,” Wollman notes. “This obviously makes their work more valuable compared to others.”

Improving Cybersecurity Data Reporting

Experts seem to concur that the current state of data reporting and exchange is not tenable.

“Organizations need to enhance the quality of data generated during security incident response investigations,” argue the collaborators on the 2019 paper.

“Best research principles (not specific to cyber) include transparency in methodology and data sources, rigor in data collection and analysis, clear presentation of findings, and acknowledgement of any limitations or biases in the research,” Wollman exhorts.

The scientists from the University of Innsbruck suggest that, “currently available threat intelligence sharing tools lack the customization, filters, news stream aggregation capabilities, and search capabilities required for their daily work.”

Telang hopes for the formation of an effective government entity that collects cybersecurity data, analyzes it, and disseminates it in an organized fashion. That way, he says, “everybody can believe in it and form some reasonable opinions.”

In the meantime, a stricter application of basic empirical principles -- with a particular emphasis on transparent data sets -- and the construction of usable interfaces for accessing data seem like worthy goals in future cybersecurity reporting.

What to Read Next:

12 Ways to Approach the Cybersecurity Skills Gap Challenge in 2023

6 Ways Cybersecurity Can Boost Revenue

Closing the Cybersecurity Talent Gap

About the Author(s)

Guest Commentary

The InformationWeek community brings together IT practitioners and industry experts with IT advice, education, and opinions. We strive to highlight technology executives and subject matter experts and use their knowledge and experiences to help our audience of IT professionals in a meaningful way. We publish Guest Commentaries from IT practitioners, industry analysts, technology evangelists, and researchers in the field. We are focusing on four main topics: cloud computing; DevOps; data and analytics; and IT leadership and career development. We aim to offer objective, practical advice to our audience on those topics from people who have deep experience in these topics and know the ropes. Guest Commentaries must be vendor neutral. We don't publish articles that promote the writer's company or product.

See more from Guest Commentary

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

Bad Data: Is Cybersecurity Data Any Good at All?

The State of Cybersecurity Data

The Proprietary Veil

How Data is Treated

A Plague of Surveys

Improving Cybersecurity Data Reporting

What to Read Next:

About the Author(s)

Editor's Choice