PROJECTS AT VISL FINISHED IN 2006


String Matching using Neural Network Architecture

by Bondarchuk Sergey
Supervised by Raichelgaus Igal & Odinaev Karina

Abstract

Neural microcircuits are basic components of human brain and biological mechanisms, responsible for reflexes. Although, the speed of signal processing in modern electronic devices is faster by millions of times than in biological organism, there is no doubt, that human brain remains being the most powerful computer. That derives from advanced and very effective parallel processing techniques, applied in biological organisms, which developed during millions of years of evolution.
In science and engineering the Neural Networks models have proved their recognition and parallel processing power in many fields. It is widely used for recognition and even for prediction, for example in prediction the behavior of stock market. In this project another application of recurrent neural networks architecture - real time String Matching - was researched, and this approach has definitely several advantages, respectively to existing solutions of this problem.

The problem overview:

String matching is a comprehensive and effective applicable key technology beyond Intrusion Detection Systems (IDS). These applications are usually deployed at choke points of a network where there is heavily traffic. Many areas can benefit from a faster string matching algorithm, which can be used in IDS, firewalls, and detecting long sequences from fast streaming data. It is necessary to develop a faster string matching algorithm, which would not be a bottleneck in the network.

Solution:

  • The most effective real time String Matching algorithms are based on Final State Machine (FSM) approach or its modifications like dividing the states of FSM into several substates.
  • The architecture of recurrent Neural Network (NN) is also called the Liquid State Machine (LSM). In fact, each signal propagates through the NN by pathways, while the intensity of such pathway decays exponentially with time, but on the other hand the number of pathways is also grows exponentially with time. It is very similar to dispersion process. I define the state of NN at moment t as a vector V with length, equal to number of neurons in the NN, where V(n)=1 if neuron #n had a spike in short time interval, centered at t and V(n)=-1 else. Considering that such dynamic is stable and applying this state definition of the NN, one can assume that number of different states in such architecture is "almost infinite" because in a small NN, consisted of 256 neurons there are 2^256 different states. This fact justifies partly the name of NN as a Liquid State Machine.
  • The task is to detect selected strings (100 - 1500 bits) from real-time streaming data without a-priory knowledge of the selected string beginning. It should be noted, that detecting the selected string at its beginning would increase significantly the False-Alarm error, because a short part of a "good string" can be also similar to the beginning of an intrusion string. Based on the above, the task requires detecting a unique state of the NN, to which it led at the end of selected string (network intrusion) input.
  • Each state of a NN is a result of recent inputs and so of old-established inputs, propagating by the pathways. This influence of preceding temporal layers represents the Neural Networks' dynamic memory. That means that the NN state at the end of intrusion input containes the information about all intrusion string, but at the same time it containes unnessesary information of inputs befour the beginning of the intrusion input. This problem is a trade-off between two types of errors: False-Alarm and Mis-Hit. The simplest and intuitive solution of this dilemma is creating NN with dynamic memory, suitable to the average length of intrusions strings. Indeed, the length of the dynamic memory of NN is controllable by the following parameters: connectivity between neurons; average value of synaptic weigths.
  • The illustration below is a scheme of LSM:


    In case of String Matching application of LSM, the input u(t) is continuously (in real time) streaming binary data from network and each readout function is aimed to detect in real time the corresponding network intrusion. For example if this LSM is purposed to detect 1000 network intrusions (viruses warms and ets.), then we need 1000 readout functions.
    It is important that the NN (called microcircuit at the figure) has not being changed once it created at the beginning of the process, because training considers finding the readout functions f(X(t)). However, creating suitable NN requires the knowledge of it purpose and influence of NN parameters on its properties.
  • Finally, the algorithm, developed in this project shows how to define:
    1) The input connections u(t) ;
    2) The NN state x(t) ;
    3) The readout functions f(x(t))

    Results:

  • As examples of intrusion string (worms and viruses signatures)
    were used strings from Clam Antivirus Database and the detection algorithm was applied to them. The system was trained and tested in order to detect 1000 longest (1300-1500 bits) intrusion strings from the database and a large set of random strings. The result can be presented by two types of errors:
    - Miss-Hit per NN: 0.001 - 0.8
    - False-Alarm per NN: negligible
  • One of the commonly used methods to bring down an error in such systems is to use several different systems (NN's), which perform the same task in parallel. In this case one has to define the decision rule. I used the "democratic" decision rule (majority rule). For example if number of NN is 15 and an average error of a single NN is err, then the total error of the system is given by:



    In other words, the probability of full system error is the probability that 8 NN's has wrong decision at the same time, multiplied by all permutations, plus 9 NN's are wrong and ets...
    One can assume that the first element of the sum has the highest and most significant value, while others are negligible. Based on the above, the calculation of full system error becames very simple. This approach gives the possibility to reduce the total error to as low as required values. In other words :
    - Both types of error of the combined system are negligible.

  • The following Graphical User Interface was built during the project in order to examine the performance of the researched design. It allows performing three processes: Creating; Training; Testing. At the illustration below there is a test of the system, trained for 120 long (1300 - 1500 bits) strings and built from 15 parallel NN's.



  • As it described at the GUI itself, the definition of the input string is simply by defining a vector of integers. Non-zero elements are indexes of selected string, it trained for, and zeros mean a random string (400 bits). This vector can be written in Matlab language as it appear in the example, i.e. zeros(1,10) is a vector of 10 zeros (4000 bits random string). After pressing test button, all given strings are combined into a single string and passed through the detection system.
  • Notes:
    - Starting the simulation with a random string and inserting random strings between intrusions' strings is not necessary, but it immitates a realistic situation in the network.
    - The best way to examine the system for False-Alarm error is to test it on a long random string (in instance zeros(1,1000) = 0.4 Mb).


    Hardware model concept:

    The illustration below shows how to integrate a String Matching algorithm based Neural Networks into IDS hardware:


    The system receives a binary input from the net, 8 bits each clock cycle and branches it into two directions. The first is only delayed by the system's latency. The second passes through the detection system (Neural Networks plus the Readout). The output of the detection system is a binary decision block/enable. In addition, it is possible to give an information about what exactly intrusion has been detected. In case an intrusion has been detected, the system blocks the input and reports to the corresponding software, which performs an appropriative action. Based on several researches, the time characteristics of biological neuron and synapse can be scaled by factor of over 0.5 - 1 millions by implementing it on similar hardware model. That means, that hardware implementation of the presented String Matching machine can withstand a throughput of over 16 Gbps if it receives the input of 8 bits per clock cycle and 32 Gbps with 16 bits per clock cycle.

    Tools

  • MATLAB
    MATLAB was used as an environment for executing and analyzing experiments data.
  • CSIM
    CSIM framework (http://www.lsm.tugraz.at/) have been used for simulating neural networks in experiments.


    Related documentations

    For futher information please contact Bondarchuk Sergey: bondarchuks@gmail.com

    Acknowledgment

  • I am grateful to my supervisors Karina Odinaev and Igal Raichelgauz for their patients, and interesting and useful discussions.
  • I am grateful to the Lab engineer Johanan Erez for his help and support.
  • I am grateful to the Ollendorf Minerva Center for their support of this projectand the Vision and Image Sciences Laboratory.


    FULL DOCUMENTATION