How a DLP system works - SearchInform

DLP system is used when it is necessary to protect confidential data from internal threats. And if information security specialists have sufficiently mastered and apply the tools of protection against external intruders, the situation with internal ones is not so smooth.

The use of a DLP system in the information security structure assumes that the information security specialist understands:

  • how company employees can organize confidential data leakage;
  • what information should be protected from the threat of a breach of confidentiality.

Comprehensive knowledge will help the specialist to better understand the principles of DLP technology and configure leak protection in the correct way.

The DLP system must be able to distinguish confidential information from non-confidential information. If you analyze all the data within the organization's information system, there is a problem of excessive load on IT resources and personnel. DLP works mainly in conjunction with a responsible specialist who not only "teaches" the system to work correctly, introduces new and removes irrelevant rules, but also monitors current, blocked or suspicious events in the information system.


To configure SearchInform DLP, security policies are used - rules for responding to information security incidents. The system has 250 predefined policies that can be adjusted to suit the company's objectives.


The functionality of a DLP system is built around a "core" - a software algorithm that is responsible for detecting and categorizing information that needs to be protected from leaks. At the core of most DLP solutions are two technologies: linguistic analysis and technology based on statistical methods. Also, the kernel can use less common techniques such as labeling or formal analysis methods.

Leak mitigation systems developers complement the unique software algorithm with system agents, incident management mechanisms, parsers, protocol analyzers, interceptors and other tools.

Early DLP systems relied on one method at the core: either linguistic or statistical analysis. In practice, the shortcomings of the two technologies were offset by the strengths of each other, and the evolution of DLP has led to the creation of systems that are universal in terms of the "core".

The linguistic method of analysis works directly with the content of a file and a document. This allows you to ignore such parameters as the file name, the presence or absence of a stamp in the document, who created the document and when. Linguistic analytics technology includes:

  • morphological analysis - search for all possible word forms of information that needs to be protected from leakage;
  • semantic analysis - search for occurrences of important (key) information in the contents of the file, the impact of occurrences on the quality characteristics of the file, assessment of the context of use.

Linguistic analysis shows the high quality of work with a large amount of information. For voluminous text, a DLP system with a linguistic analysis algorithm will more accurately select the correct class, assign it to the desired category, and launch the configured rule. For small documents, it is better to use the stop word technique, which has proven effective in the fight against spam.

Learning in systems with a linguistic analysis algorithm is implemented at a high level. Early DLP complexes had difficulties with assigning categories and other stages of "learning", however, modern systems have well-established self-learning algorithms: identifying the signs of categories, the ability to independently form and change reaction rules. To set up such data protection software systems in information systems, it is no longer necessary to involve linguists.

The disadvantages of linguistic analysis are attributed to the binding to a specific language, when it is impossible to use a DLP system with an "English" core to analyze Russian-language information flows and vice versa. Another drawback is associated with the complexity of a clear categorization using a probabilistic approach, which keeps the response accuracy within 95%, while for a company it may be critical to leak any amount of confidential information.

On the contrary, statistical methods of analysis demonstrate accuracy close to 100 percent. The lack of a statistical core is associated with the analysis algorithm itself.

At the first stage, the document (text) is divided into fragments of an acceptable size (not character by character, but enough to ensure the accuracy of the response). A hash is removed from the fragments (in DLP systems it is found as the term Digital Fingerprint - "digital fingerprint"). The hash is then compared with the hash of the reference fragment taken from the document. If there is a match, the system marks the document as confidential and acts in accordance with security policies.

The disadvantage of the statistical method is that the algorithm is not able to learn on its own, form categories and type. As a result, there is a dependence on the competence of a specialist and the likelihood of setting a hash of such a size, at which the analysis will give an excessive number of false positives. It is not difficult to eliminate the flaw if you follow the developer's recommendations for configuring the system.

There is another drawback associated with the formation of hashes. In developed IT systems that generate large amounts of data, the database of fingerprints can reach such a size that checking traffic for matches with the reference will seriously slow down the operation of the entire information system.

The advantage of the solutions is that the performance of the statistical analysis does not depend on the language and the presence of non-textual information in the document. The hash is equally well removed from the English phrase, and from the image, and from the video fragment.

Linguistic and statistical methods are not suitable for detecting data of a certain format for any document, such as account numbers or passport numbers. To identify such typical structures in the array of information, technologies for the analysis of formal structures are introduced into the core of the DLP system.

A quality DLP solution uses all analysis tools that work consistently, complementing each other.

You can determine which technologies are present in the kernel by describing the capabilities of a particular DLP complex.

Just as important as the functionality of the kernel are the control levels at which the DLP system operates. There are two of them:

  • the network level when the network traffic in the information system is monitored;
  • the host level, when information on workstations is monitored.

Developers of modern DLP products have abandoned a separate implementation of layer protection, since both end devices and the network must be protected from leakage.

At the same time, the network level of control should provide the maximum possible coverage of network protocols and services. We are talking not only about "traditional" channels (mail protocols, FTP, HTTP traffic), but also about newer network exchange systems (Instant Messengers, cloud storage). Unfortunately, it is impossible to control encrypted traffic at the network level, but this problem is solved in DLP systems at the host level.

Host-level control allows for more monitoring and analysis tasks. In fact, the information security service receives a tool for complete control over user actions on the workstation. DLP with a host architecture allows you to track what is copied to removable media, what documents are sent to print, what is typed on the keyboard, recording audio materials, taking screenshots. At the level of the end workstation, encrypted traffic (for example, Skype) is intercepted, and data that is being processed at the moment and which are stored on the user's PC for a long time are open for verification.

In addition to solving common tasks, DLP systems with control at the host level provide additional measures to ensure information security: control of software installation and changes, blocking of I / O ports, etc.

The disadvantages of the host implementation are that systems with an extensive set of functions are more difficult to administer, they are more demanding on the resources of the workstation itself. The management server regularly contacts the "agent" module on the end device to check the availability and up-to-date settings. In addition, some of the resources of the user workstation will inevitably be "eaten up" by the DLP module. Therefore, even at the stage of selecting a solution to prevent leakage, it is important to pay attention to the hardware requirements.

The principle of technology separation in DLP systems is a thing of the past. Modern software solutions to prevent leaks employ methods that compensate for each other's shortcomings. An integrated approach makes sensitive data within the information security perimeter more resilient to threats.

14.12.2020

Subscribe to get helpful articles and white papers. We discuss industry trends and give advice on how to deal with data leaks and cyberincidents.

هل ترغب بالانتقال الى الصفحة الرئيسية,
او التعرف على المزيد عن الخدمات لمنطقة الشرق
الاوسط و شمال افريقيا؟
Do you want to visit main website
or learn more about MSS for MENA market?