Machine learning is not a brand new topic, it can actually be dated back to the early 1940s. It appears to be new because there are several waves of boom and recession in this field. Many of the machine learning techniques and terms can be found in statistics and mathematics.
For almost all machine learning tasks, we are required to transfer the input that a human can understand easily into some form of mathematical representation. This process is generally called feature extraction. In classic machine learning, this process requires more expert knowledge to fine tune the features. On the other hand, deep learning models simplified this process by allowing more general features as input and create more abstract features themselves. However, they require much more intensive computations compared to classic methods. Often the target task’s domain-specific knowledge requirements limit the popularity of classic machine learning methods. The hardware resources were also limited compared to today’s standard. More importantly, it’s much harder to collect and share data for the target tasks.
The advancement of deep learning has created another era of machine learning. With the exponential increase in computation power, especially GPU unit, engineers can now build deep learning models that no one ever would have imagined in the past. In addition, the universal use of personal computers and internet creates a tremendous surplus of information, supplying the data needed by machine learning engineers.
Deep learning techniques have been successfully applied in many fields, to list a few, there mature industrial application for machine translation, image recognition, and medical image diagnostic. In 2016, the Go match between AlphaGo and Lee Sedol drew a large amount of attention to the field of machine learning and it is only going to get bigger.
Despite the huge success in other fields, the application of machine learning in the cybersecurity landscape has just begun. We can find quite a bit of research focused on misuse and anomaly detection in network traffic analysis. According to Buczak and Guven, the typical machine learning techniques used in 2016 are artificial neural network (ANN), propositional logic theory, Bayesian network theory, clustering, decision tree, ensemble learning algorithm, evolutionary computation, Hidden Markov Models, inductive learning, naive Bayes, sequence pattern mining, and support vector machine (SVM). It can be observed that most techniques used in the academic field are still classic machine learning techniques. The gap of available labeled data is also a growing problem. The static code analysis side is quite similar where a majority of the techniques used belong to the traditional machine learning category.
In the past three years, there are more attempts to construct deep learning models to tackle these security problems. Deep learning techniques used in natural language processing also start to apply in the static code analysis to construct meaningful code representations. More recent research can even identify potentially vulnerable parts of the code by using an attention mechanism. Other interesting attempts include applying image processing techniques on the assembly code screenshot to quickly find malicious code. In addition, there is a growing number of datasets. Take Juliet-dataset for example, there were 45309 C++ code cases when it was first published in 2010 and it grew to 64,099 code cases in its latest release in 2017.
In summary, machine learning in cybersecurity is still a relatively new research field. The focus is shifting from classic machine learning techniques to deep learning techniques. Comparing to three years ago, machine learning predictions are becoming more precise and accurate than ever before, as a direct result of the vast amount of publicly accessible data.
. Mukkamala, S., Sung, A., & Abraham, A. (2005). Cyber security challenges: Designing efficient intrusion detection systems and antivirus tools. Vemuri, V. Rao, Enhancing Computer Security with Smart Technology.(Auerbach, 2006), 125-163.
. Buczak, A. L., & Guven, E. (2016). A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(2), 1153-1176.
. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5), 359-366.
. Vapnik, V. (2013). The nature of statistical learning theory. Springer science & business media.
. Chandra, S., Dhoolia, P., Gowri III, M., Gupta, M., Shyamasundar, R. K., & Sinha, S. (2014). U.S. Patent No. 8,806,441. Washington, DC: U.S. Patent and Trademark Office.
.Malhotra, R. (2015). A systematic review of machine learning techniques for software fault prediction. Applied Soft Computing, 27, 504-518.
.White, M., Tufano, M., Vendome, C., & Poshyvanyk, D. (2016, August). Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (pp. 87-98). ACM.
. Russell, R. L., Kim, L., Hamilton, L. H., Lazovich, T., Harer, J. A., Ozdemir, O., ... & McConley, M. W. (2018). Automated Vulnerability Detection in Source Code Using Deep Representation Learning. arXiv preprint arXiv:1807.04320.
. Kumar, R., Xiaosong, Z., Khan, R. U., Ahad, I., & Kumar, J. (2018, March). Malicious Code Detection based on Image Processing Using Deep Learning. In Proceedings of the 2018 International Conference on Computing and Artificial Intelligence (pp. 81-85). ACM.
. NIST (2017). Juliet Documents. URL: https://samate.nist.gov/SRD/around.php#juliet_documents