Before using the service, please read the preliminary information containing a description of steps that enable access to the CLARIN-PL developer interface.
A service for automatic anonymization of unstructured text documents that removes or replaces sensitive data in the text, such as:
It is available for the Polish language.
Anonymization consists in transforming sensitive data in a way that prevents individual information from being linked to specific entities.
The service uses WiNER which is a named entity recognition (NER) tool.
For more information available in Polish, see Wytyczne ręcznej anonimizacji. The guidelines describe how the Anonymizer should work and focus on pseudonymization.
The following tags will be used by default to replace sensitive data:
[OSOBA]
- first name, last name, pseudonym composed of first and last name; applies also to fictitious characters[MIEJSCE]
- address data, street, road, motorway, city, country, other place name[ORGANIZACJA]
- organization (e.g., institution, association, political alliance)[NAZWA WŁASNA]
- other proper name[TYTUŁ]
- work title (e.g., film, book)[CYFRY]
- year, phone number, TIN, KRS number, other sequence of digits[NAZWA WODNA]
- name of bodies of water (e.g., river, lake)[WWW]
- web address@[USER]
- user name[MAIL]
- email address[DATA]
- full date[NUMER IDENTYFIKACYJNY]
- e.g., serial numberThe list of tags was created based on the annotations for the Named Entity Recognition (NER) tool in KPWr.
My name is Jan Kowalski and I live in Wrocław.
Anonymization modes:
My name is and I live in .
My name is [OSOBA] [OSOBA] and I live in [MIEJSCE].
Nazywam się Janek Stolarczyk i mieszkam w Krakowie.
This service may be useful in research involving sensitive data, such as in medicine or judiciary, where the anonymity of the individuals whose data is used must be preserved.
Anonymizer can be run:
method
- defines the anonymization method:
tag
- changes sensitive data into tags corresponding to the categories of data (default)delete
- removes any words that could be sensitive data,pseudo
- replaces sensitive data with other randomly chosen words from the appropriate category.The service can be run in the Windows system with default values using the following LPMN query: ['any2txt',{'postagger':{'method':'ner'}},'anonymizer']
.
[['any2txt',{'postagger':{'method':'ner'}},'anonymizer']]
- input data in the form of a compressed directory (.zip)['any2txt',{'postagger':{'method':'ner'}},{'anonymizer':{'method':'tag'}}]
- sensitive data replaced with tags corresponding to the categories of data['any2txt',{'postagger':{'method':'ner'}},{'anonymizer':{'method':'pseudo'}}]
- Sensitive data replaced with randomly chosen words from the appropriate categoryText, text file or ZIP directory containing text files.
Anonymized text, text file or ZIP directory containing anonymized text files
In Colab: Anonymizer - Removal of sensitive data from text
(C) CLARIN-PL