Concept of DLP Systems

DLP software is used to protect confidential data from internal threats. However, even if information security specialists adopted protection tools and use them against external violators, struggle with internal violators is more challenging.

Using a DLP system in the information security structure implies that a IS specialist understands:

  • How employees can attempt the leakage of confidential data
  • What information should be protected from the threat of confidentiality breach.

In-depth knowledge will help specialists with a better understanding of DLP technology and proper setup of leakage protection.

DLP systems should be capable of distinguishing confidential information from non-confidential. When analyzing all the data in the company’s information system, you can face a problem of excessive workload on IT resources and staff. Most often a DLP works together with a responsible specialist who not only "teaches" the system to work correctly, introduces new rules and removes irrelevant ones, but also monitors current, blocked or suspicious activities in the information system.

The functionality of DLP is built around the kernel – a software algorithm responsible for detecting and categorizing information that should be protected from leaks. The kernel of most DLP solutions has two technologies: linguistic and statistical analysis. The kernel can also use less common techniques, for example, the use of flags or formal analysis methods.

Developers of leakage control systems add system agents, incident management mechanisms, parsers, protocol analyzers, interceptors and other tools to the unique software algorithm.

Early DLP systems were based on one method in the kernel: either linguistic or statistical analysis. In practice, the shortcomings of the two technologies were compensated by each other's strengths, and DLP evolution led to the creation of systems with universal kernels.

Linguistic analysis deals directly with the content of files and documents. This allows you to skip such parameters as the file name, presence or absence of security mark in the document, and the data about who and when created the document. The technology of linguistic analytics includes:

  • Morphological analysis – search for all possible word forms of information that must be protected from leakage.
  • Semantic analysis – search for entries of important (key) information in the file content, the impact of entries on the file quality characteristics, the evaluation of the context.

Linguistic analysis produces high quality of work with a large amount of information. A DLP system with linguistic analysis algorithm will find the right class for a voluminous text, assign it to the necessary category, applying the customized rule. For small documents, it is better to use the stop-words technique which has proved to be effective against spam.

Learning capability of the systems with linguistic analysis algorithm is implemented at a high level. Early DLP complexes had difficulties with assigning categories and other learning stages, but modern systems have well-established algorithms for self-learning: the capability to identify the characteristics of categories, independently form and modify response rules. Thus, it is no longer necessary to attract linguists for configuration of such data protection systems.

The disadvantages of linguistic analysis include a language-bound approach. This means that the DLP system with an "English" kernel cannot analyze information flows in Russian and vice versa. Another disadvantage is the complexity of clear classification using the probabilistic approach, which ensures the accuracy of the response to within 95%, while any amount of confidential information leaked may be critical for the company.

Statistical methods of analysis, by contrast, demonstrate near 100% accuracy. The disadvantage of the statistical kernel is connected with the analysis algorithm.

At the first stage, the document (text) is divided into fragments of an acceptable size (not character-by-character, but enough to ensure the accuracy of the response). Then a hash is taken from the fragments (in DLP systems it is found as Digital Fingerprint). The hash is then compared with the hash of the reference fragment from the document. When matches are found, the system flags the document as confidential and acts in accordance with the security policies.

The disadvantage of the statistical method is that the algorithm is not able to independently learn, form categories and typify. As a result, there is the dependence on the specialist’s expertise and the probability to define a hash size, at which the analysis will produce an excessive number of false responses. Eliminating this gap is easy if you follow setup recommendations from the developer.

There is another drawback of hash formation. In developed IT systems that generate large amounts of data, the fingerprint database can reach such a size that the match of traffic against the standard file will seriously slow down the operation of all information system.

The advantage of solutions is that the efficiency of statistical analysis does not depend on the language and the availability of nontextual information in the document. Hash is equally well taken from an English phrase, image, and video.

Linguistic and statistical methods do not detect data of a specific format for any document, for example, account or passport numbers. In order to identify such typical structures in the array of information, technology of formal structure analysis is introduced into the kernel of DLP systems.

A quality DLP solution uses all analysis tools, which work consistently complementing each other.

You can determine what technologies are used in the kernel by referring to the description of the advantages of DLP suites.


DLP control levels are equally important as kernel functions. There are two:

  • Network level: control of network traffic in the information system
  • Host level: control of information on workstations


Developers of modern DLP products moved away from the separate integration of protection levels since endpoint and network both need to be protected from leakage.

Network level of control should ensure the maximum possible coverage of network protocols and services. They include not only traditional channels (mail protocols, FTP, HTTP traffic), but also newer network exchange systems (Instant Messengers, cloud storages). Unfortunately, network level doesn’t allow controlling encrypted traffic, but DLP systems solve this problem at the host level.

Control at the host level allows you to tackle more monitoring and analysis tasks. It gives the IS Department full control over the user's actions on workstations. The DLP with host architecture allows you to monitor the information being copied to removable media, the documents sent to printing, to see what is being typed on the keyboard, to record audio material, and to take screenshots. Endpoint DLP systems capture encrypted traffic (for example, Skype) and monitor the data that is being processed and that has been stored for a long time on a user's PC.

In addition to addressing common problems, DLP systems with host level control offer additional measures to ensure information security: the control of software installation and changes, blocking of I/O ports, etc.

The disadvantage of host implementation is that systems with an extensive set of functions are more difficult to administrate as they require more workstation resources. Controlling server regularly accesses the agent module on the endpoint to check whether the settings are available and relevant. In addition, the DLP module eats up a part of the workstation resources. That’s why it is important to refer to hardware requirements when selecting a data leak prevention solution.

Technology separation in DLP systems is a thing of the past. Modern software DLP solutions use techniques that compensate for each other's shortcomings. With the comprehensive approach the confidential data inside the information security perimeter becomes more resistant to threats.