Sepio | Blog

The Importance of Diversity and Quality of Data

importance of data quality

With vast amounts of data now available, companies in nearly every industry, such as cybersecurity, are focused on exploiting data for competitive advantage against other companies. We live in the era of Big Data, and the volume and variety of data have far outstripped the capacity of manual analysis, and in some cases have exceeded the capacity of conventional databases, requiring each time more processing power. At the same time, computers have become far more powerful. Networking is ubiquitous, and algorithms have been developed that can connect datasets to enable broader and deeper analyses, leading companies to turn their heads to Data Science and its illimited potentialities.

Machine Learning and Artificial Intelligence

Machine Learning and Artificial Intelligence are becoming terms used on a daily basis in our working days. A 2020 Deloitte survey found that 67% of companies are using machine learning, and 97% are using or planning to use it in the next year.

importance of data quality

In 1959, Arthur Samuel defined Machine Learning as the subfield of Artificial Intelligence that “gives computers the ability to learn without being explicitly programmed” and over the last quarter of a century, Machine Learning has become one of the most important parts of the IT revolution impacting our lives.

Although ML dates from the early days of Artificial Intelligence in the late 1950s, it underwent a first resurgence when the concept of data mining began to takeoff approximately 20 years ago. Data mining algorithms look for patterns in information. Machine Learning does the same thing but goes one step further: the program changes its behavior based on what it learns.

Only as good as the data they learn from

Machine Learning starts with data — numbers, photos, text… Various types of data imaginable are collected from various sources and prepared to be used as training data – the information the machine learning model will be trained on. The more diverse the training data is, the better the Machine Learning Algorithm will perform.

But although Machine Learning algorithms can really help leverage a company utilizing its data assets for better results and better products, they will always be as good as the data they learn from. If the data they learn from is not diverse enough, is not cleaned or processed, the Machine Learning algorithms can result in overfitting (when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data); in simpler terms, when the data is not diverse and its quality it’s not as high as it could be, Machine Learning models will produce extremely good results in the data they use for training  but they will perform poorly on new and unseen data.

Data quality and diversity have become an extremely important pillar in any Data Science activity, especially in cybersecurity where there is no room for small mistakes. When collecting data, here at Sepio, we make sure that the data is cleaned and as diverse as possible to guarantee the maximum performance and success of our Machine Learning models and algorithms.

April 19th, 2022