By Liviu Arsene
Machine learning and artificial intelligence are often used interchangeably, but to be precise, machine learning is an application of artificial intelligence and focuses on algorithms that can parse large amounts of quantitative and qualitative data, learn from it, and then draw statistical inferences.
Security-centric machine learning algorithms are developed and trained to accurately and efficiently identify new and unknown threats in a timely manner, going beyond traditional static or behavioral analysis. However, these algorithms should not be considered a magic bullet that can proactively protect against any threat; they’re more of a booster shot that can help fight off new strains of malware.
The security industry uses machine learning to augment the efficacy of security layers - such as anti-malware, anti-spam, anti-fraud and anti-phishing detection – by making them proactive instead of reactive.
Here are the five key elements to evaluate machine learning capabilities:
1) Data Models
Unlike traditional signatures – or hashes - that identify a known malicious file using a unique string of characters, machine learning algorithms use data models. The key benefit is that they have a small footprint, usually less than 1 kilobyte in size, and describe a set of shared characteristics known to be associated with a set of malicious files.
All this information is represented as a mathematical equation that needs to be “resolved” by a file. Instead of having to constantly update a large list of hashes, a single model can spot whether or not a large number of unknown files is malicious or safe, based on how they “resolve” the model.
Models can be both cloud and locally deployed, boosting protection against unknown threats and varied types of attacks. This means that some locally deployed machine learning models can better protect specific organizations, as they’ve been trained to defend that organization’s particular infrastructure.
Machine learning models can also overlap in detection for increased resiliency. For example, if some models focus on specific threats while others are more generic, two or more models can solve for the same malicious file, increasing the chances of catching unknown malware.
Implementing the machine learning model, though, is more complicated than it looks. Various types of machine learning algorithms are normally used, as some yield better results than others, depending on their intended purpose.
For instance, perceptrons, binary decision trees, restricted Boltzmann machines, genetic algorithms, support vector machines, artificial neural networks and even custom algorithms are used either individually or together to identify specific types of malware or malware families. Based on the security applications for which the algorithm is intended to be used – such as anti-malware versus anti-phishing – genetic algorithms may be more accurate and effective at spotting malware than at spotting fraudulent or command and control domains.
Consequently, it’s important to first understand the task, then test various options or algorithms before committing to a single type of algorithm.
3) Data Sets
Data sets are also key in evaluating machine learning algorithms. The initial training sets – clean and malicious - must be used to properly build an effective model, which is then applied on a production data set of relevant and fresh samples. This means that before you deploy a model in production, you need to make sure it “understands” what clean and malicious files look like with an accuracy as close to 100 percent as possible, without sacrificing performance.
For example, before having a machine learning algorithm differentiate between apples and oranges, you need to teach it everything about apples and everything about oranges, individually.
4) Features and Feature Extraction Techniques
Feature extraction techniques and the number of extracted features per file is also vital in building effective models. Unpacking routines, pre-execution emulation, and packer reputation are only a few of the feature extraction techniques that can help extract thousands of features per file, used to build multidimensional matrices.
These matrices are used to represent the distance between each point or feature, then compute the actual model (mathematical function) that describes the set of features which determines if a file is clean or infected.
5) Tunable Machine Learning and Advanced Threats
Balancing performance and detection rate is vital when building security solutions, as false positives – the incorrect tagging of a clean file as infected – can cause downtime, affecting the overall business. Building machine learning algorithms that correctly tag unknown threats is a significant challenge.
However, some organizations face unique threats, as cybercriminals often try to exploit vulnerabilities found only within their infrastructure. It is tunable machine learning that allows IT administrators to set how aggressive or permissive machine learning detection needs to be, offering an enormous benefit in the fight against cyber threats. Now, some organizations prefer dealing with false positives rather than suffering a data breach. In this case, the tunable machine learning can allow for a more “paranoid” approach or a more aggressive setting to detect advanced and sophisticated threats.
Fileless attacks that rely on scripts – PowerShell, Visual Basic, etc. – to execute malware or exfiltrate data are particularly difficult to detect with traditional security tools. Scripts are usually considered benign, as they’re often used by IT admins to automate various tasks, but threat actors can also use them to issue commands to compromise the victim’s endpoint.
Tunable machine learning algorithms allow more aggressive monitoring of not just potentially malicious files, but also of commands being executed by scripts. For instance, setting machine learning algorithms to “paranoid mode” may reveal that seemingly legitimate applications can have abnormal behavior that would otherwise fly below the radar of traditional security tools. Consequently, tunable machine learning can increase potential threat visibility and report suspicious objects, preventing data breaches.
Integrating Machine Learning with Other Security Layers
Layered security is all about covering potential attack vectors that could compromise an endpoint or infrastructure. Each security layer is designed to defend against a specific type of threat, ranging from file-based or fileless malware to spam and online fraud.
However, the proliferation and increased sophistication of threats that rely on encryption, obfuscation and polymorphism has rendered traditional detection methods ineffective in dealing with the large numbers of threats that need to be detected. Tying machine learning algorithms to each security layer enables security solutions to boost not just efficiency, but also efficacy in detection.
While machine learning is great at spotting new and unknown threats, older threats may get by as some models are only trained with new malware instead of decade-old threats. This means that augmenting existing traditional security layers with machine learning makes them capable of fending off both known and unknown malware.
For example, if traditional security technologies have been optimized to detect and block known malware – in a sense, being very reactive to threats – machine learning adds the proactive component that specifically focuses on unknown threats. Since no single machine learning algorithm can defend against all types of threats, the use of highly specialized algorithms – each designed to detect a specific type or family of malware – combined with behavioral heuristics and even signature databases for known malware samples can significantly improve performance. This enables machine learning algorithms to focus only on new, unknown, and sophisticated pieces of malware that could go unnoticed by traditional security tools.
About Liviu Arsene
Liviu Arsene is a senior e-threat analyst for Bitdefender, with a strong background in security and technology. Reporting on global trends and developments in computer security, he writes about malware outbreaks and security incidents while coordinating with technical and research departments.