PhD position: Reusable and Adaptable Machine Learning for Network Security

Keywords: machine learning, feature embedding, graph neural network, reinforcement learning

Context

Cybersecurity is a major concern everywhere with the growth of connected devices that are beyond common computers. To circumvent these problems, decades of research and development have led to build new techniques and tools to fight back against the attacks on the Internet. Nonetheless, the number of attacks and their magnitude still grow. The attack surface continues to increase along with the number of connected devices but also due to the number of applications, services or software that today make the IT ecosystem far from its origin.

Techniques used by both attackers and defenders evolve to complex mechanisms [1]. This leads to the massive use of encryption to avoid data leaks but simultaneously attackers benefit from encryption to hide their own activities. As a result intrusion detection methods relying on artificial intelligence have been investigated both in research and in industry [2].

During the last twenty years, there has been an increasing adoption of advanced analytics techniques, especially machine learning, in all areas of networking [3]. Many proposals are being developed to achieve a higher level of automation, including data-driven networks [4], knowledge-defined networks [5] and more recently self-driving networks [6]. The key objectives of all these techniques is to extract relevant information from observations in order to reach different goals such as enhancing performance or end-user experience, lowering the carbon footprint or improving network security in the context of this thesis.
Scientific challenges

Like in other domains leveraging Machine Learning (ML), each proposed ML-based solution for network operations will require to select, configure or extend a ML technique according to a particular scenario. Major problems concern the definition of features, metrics and ML algorithms. The re-usability or adaptation of existing results is limited. Context-specific data interpretation or integration in an analytics framework is required. Some proposals have been made for port numbers and IP addresses [7] but the proposed metrics are too coarse-grained. This is still far from being satisfactory and the same applies for mathematical elements manipulated in the algorithms like kernel functions or neural networks, which have not been specifically designed for networking [8].

A major research challenge is the definition of network-based features that are meaningful and reusable in a variety of scenarios (with a focus on network security) and that can be integrated in different ML algorithms. For instance, applying ML algorithms on network data requires the definition of new metrics capable of capturing the properties of network configurations, packets, flows, etc. Therefore, a key challenge is to represent them in a meaningful space such that semantic operations, like distance, similarity or comparisons can be applied.

It is also important to evaluate the impact and contribution of the collected attributes for the final targeted goals (e.g. detecting attacks). A second challenge is to select the right attributes according to given criteria. Obviously, a major criterion would be the contribution of a feature to the accuracy of the learnt model but others must be taken into account: overhead/cost to collect and transform necessary data or privacy impact.

Objectives of the thesis

The first objective of the thesis is to define new representations of network data as features for ML algorithms. An in-depth study of usable raw data is necessary to identify their different nature (numerical, categorical, discrete,…). Catching these characteristics is required to define usable features, metrics or distances over these data. From raw data to usable data, several transformations might be necessary. Different approaches will be considered. First, to be easily integrated in common ML algorithms, embedding techniques can be defined to represent various types of network elements (flows, packet, topologies, forwarding tables, etc.) as fixed-size vectors. The latter must catch the intrinsic properties of the data they represent, for example structural properties of topologies or functional properties of forwarding tables. Second, graph neural networks have been leveraged to model the dependencies between a network topology, routing and traffic [9]. We also expect to explore this direction by using graphs as representing other types of data such as flow or packet dependencies. In addition, temporal graph neural network can be leveraged to catch temporal features. The PhD candidate will evaluate the relevance of the different features by using them in conjunction with different ML algorithms.

The second objective of the thesis is to define a method to automatically select the right set of features from those defined in this first objective. Also, the data to be collected accordingly need to be inferred. Assuming as input some constraints regarding the targeted goal (for example a minimal accuracy and/or a maximum amount of data to be collected), the method would select the best features and the minimal set of data to avoid gathering too much data while reaching a high level of accuracy. Under the context of network security, the goal will be to identify and so mitigate the attacks promptly. (Deep) Reinforcement Learning will be considered as a first orientation in order to continuously adapt the feature sets in an evolving environment. Generative models will be also investigated to discard and modify data, or even insert or build synthetic information [10] in order to keep the accuracy at the targeted level while lowering privacy impact.

References

[1] I. Friedberg, F. Skopik, G. Settanni, and R. Fiedler. Combating advanced persistent threats: From network event correlation to incident detection. Computers & Security, 48:35 – 57, 2015.

[2] A. L. Buczak and E. Guven, ‘A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection,’ in IEEE Communications Surveys & Tutorials, vol. 18, no. 2, pp. 1153-1176,Secondquarter 2016, doi: 10.1109/COMST.2015.2494502

[3] R. Boutaba, M. A. Salahuddin, N. Limam, S. Ayoubi, N. Shahriar, F. Estrada-Solano, and O. M. Caicedo. A comprehensive survey on machine learning for networking: evolution, applications and research opportunities. Journal of Internet Services and Applications, 9(1):16, Jun 2018.

[4] J. Jiang, V. Sekar, I. Stoica, and H. Zhang. Unleashing the potential of data-driven networking. In COMSNETS (Revised Selected Papers and Invited Papers), volume 10340 of Lecture Notes in Computer Science, pages 110–126. Springer, 2017.

[5] A. Mestres, A. Rodriguez-Natal, J. Carner, P. Barlet-Ros, E. Alarcón, M. Solé, V. Muntés-Mulero, D. Meyer, S. Barkai, M. J. Hibbett, G. Estrada, K. Ma’ruf, F. Coras, V. Ermagan, H. Latapie, C. Cassar, J. Evans, F. Maino, J. Walrand, and A. Cabellos. Knowledge-defined networking. SIGCOMM Comput. Commun. Rev., 47(3):2–
10, Sept. 2017.

[6] N. Feamster and J. Rexford. Why (and how) networks should run themselves. CoRR, abs/1710.11583, 2017.

[7] S. E. Coull, F. Monrose, and M. Bailey. On measuring the similarity of network hosts: Pitfalls, new metrics, and
empirical analyses. In Network and Distributed System Security Symposium, 01 2011.

[8] M. Lopez-Martin, B. Carro, A. Sanchez-Esguevillas, and J. Lloret. Network traffic classifier with convolutional and recurrent neural networks for internet of things. IEEE Access, 5, 2017.

[9] Krzysztof Rusek, José Suárez-Varela, Albert Mestres, Pere Barlet-Ros, and Albert Cabellos-Aparicio. 2019. Unveiling the potential of Graph Neural Networks for network modeling and optimization in SDN. In Proceedings of the 2019 ACM Symposium on SDN Research (SOSR ’19).

[10] N. Patki, R. Wedge, and K. Veeramachaneni. The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 399–410, Oct 2016

Required qualifications
• Required qualification: Master in computer science
• Required knowledge: networking, programming (python, java or others…)
• Knowledge and skills in the following fields will be appreciated:
machine learning, artificial intelligence, big data, Linux (command line use, shells)

Team

The PhD position is proposed by the RESIST team of the Inria Nancy Grand Est research lab, the French national public institute dedicated to research in digital Science and technology. The team is one of the European research group in network management and is particularly focused on empowering scalability and security of networked systems through a strong coupling between monitoring, analytics and network orchestration. https://team.inria.fr/resist/

Contact

• Prof. Olivier Festor, University of Lorraine (olivier.festor@loria.fr)
• Dr. Jérôme François (co-encadrant), Inria (jerome.francois@inria.fr)

Application deadline May 31, 2021 (Midnight Paris time)

How to apply

Upload your file and send it as well by email to jerome.francois@inria.fr and olivier.festor@loria.fr. Your file should contain the following documents:
• Your CV.
• A cover/motivation letter describing your interest in this topic.
• A short (max one page) description of your Master thesis (or equivalent) or of the work in progress if not yet completed.
• Your degree certificates and transcripts for Bachelor and Master (or the last 5 years).
• Master thesis (or equivalent) if it is already completed and publications if any (it is not expected that you have any). Only the web links to these documents are preferable, if possible.
In addition, one recommendation letter from the person who supervises(d) your Master thesis (or research project or  internship) should be sent directly by his/her author to jerome.francois@inria.fr and olivier.festor@loria.fr.

Applications are to be sent as soon as possible.

 

Logo d'Inria