Practice shows that enterprises do not have an automatic anonymization tool. For this reason, they most often reset or delete data in database tables – provided that it does not cause problems, does not disturb the integrity of the system and the structure of a given process.
However, there is another problem. ERP applications are often duplicated in large amounts (tests, development), including backups.
This raises many serious difficulties, including those related to anonymization, the solution of which is sometimes impossible in practice. This raises difficulties with anonymization that are difficult or impossible to solve.
Challenges related to the effectiveness, speed, precision, completeness, and repeatability of mass data anonymization
- Frequency. A multitude of environments and their updates
In the case of ERP, the problem with anonymization is not limited only to the size of the database or the multitude of connections between tables. An additional difficulty is the use of the extended data type fields (EDTs in Dynamics 365). Depending on the context of their use, they can have different values.
These are specific fields available to developers from anywhere in an ERP application. For example, an EDT defined as an Address with a certain number of characters will be dedicated to all address elements (fields) in the system.
- Completeness of the ERP system. A multitude of locations for sensitive data
Moreover, EDTs can be extended with additional values (extensions). This means that based on each defined EDT you can create an unlimited number of inherited fields with the same characteristics.
Both EDT and its extensions can be used in any table or anywhere in the application. Therefore, to identify all sensitive data, e.g. defined as an address, a company has to search and analyze all (hundreds or thousands) elements containing such data. The more so because without the appropriate tool, the process would have to be carried out manually.
- Volume – database sizes
Even if we assume that we have identified the entire list of fields for anonymization, there is another challenge. It relates to the amount of data. On the one hand, you can imagine a company that has a permanent database of about 100 customers. On the other, there is a telecommunications company with a database of several million subscribers. Although the anonymization process in both entities is similar, the difference in scale is almost astronomical.
For example, it is currently not surprising that large companies, like banks, spend a total of
2 TB of data for anonymization every day for just one business process!
At this point, it is also worth citing an example of the PESEL (personal id number) used in Poland. It consists of 11 digits and is used within the Universal Electronic System for Registration of the Population. If we assume that the service provider wants to anonymize as many as 9 million such numbers (used in transactions), including also a validation of the field, the matter becomes non-trivial.
- The specificity of anonymized fields. Validations
It should be taken into account that such fields can also be validated in terms of quality. This is due to the simple fact that the said personal id number has a specific and complex structure. The same applies to other data such as email address, which should contain the ‘@’ sign or the bank account number in the IBAN format.
In practice, this means a series of validation mechanisms that check that each hash meets certain criteria of the anonymization. This poses an additional challenge for anonymization at the SQL level, which is very often impossible to implement in practice. Besides, it involves additional activities, greater costs and procedures.
- Repeatability
The above considerations should also take into account the fact that activities related to data anonymization require repeatability and should eliminate human errors. This is due to the fact that despite the best practices related to the creation of procedures at the SQL level, they are not free from programming errors (error in the command, omission of a step, etc.). In short: repeatability in the anonymization process should disregard human error.
Performing manual anonymization in a safe, consistent and error-free manner is practically impossible. It engages a huge amount of time, expert knowledge and expenditure. While in smaller companies, with fewer business processes, such tasks are carried out, in most cases, and in larger companies, the situation is difficult. The more that the reality shows that the financial penalties imposed on entities not complying with GDPR are high.
For this reason, despite the legal requirements, anonymization is carried out to a limited extent or contrary to the recommended practices.
Best practices
Good practices of anonymizing sensitive data mean that it is done in the following way:
- repeatable – on-demand or automatically
The process is carried out quickly, efficiently and at any time – within hours instead of days. Everything is done automatically, which reduces human errors at the SQL level to a minimum. In different environments, circumstances and times, all tasks are performed using the same method.
- controlled – precisely and unambiguously
This means that the company knows what data it anonymizes – it knows their structure, complexity and length – and defines what parameters the hashes should contain and according to which algorithm it is built.
- complete (EDT search)
Automatic EDTs search to ensure complete identification of where sensitive data occurs.
- consistent – same field types, the same way
Example. If we want to hash the Country field inherited from EDT, for both the customer and supplier fields, we can anonymize them in different ways, considering that they are two different types of fields. However, good practice for hashing such a field referring to a specific country name would be to hashing them in the same way.
What does it mean? If the Poland field is replaced with the string Abcdef in one place, it will be replaced with the same string elsewhere. Thus, data consistency is ensured.
- increasing the legibility and quality of hash
In practice, this means designing strings of characters that replace the original values in the database so that they are relatively readable and easy to read.
For example, in the Name field, a string beginning with a capital letter is much easier to read and read. Likewise, the length should approximately match the length of the original name. In turn, the quality refers to the preservation of the structure of the replaced data, e.g. preserving the ‘@’ sign in email fields.
At the same time, it is recommended to:
- anonymizing sensitive transactions or transaction totals
This is a very good practice. You might find that some tables contain summaries that suggest certain values. The sum of costs generated in cooperation with the top 10 suppliers can be strategic information in a given industry.
In other words, this approach avoids a situation where the reader of the record can guess who or what values the hashed information in a given table refers to.
- limiting the number of people involved in the anonymization setup process
The fewer people involved in the project, the better. This allows to avoid human errors and reduces the risk of accessing sensitive data.