Before using the service, please read the preliminary information containing a description of steps that enable access to the CLARIN-PL developer interface.
Easymatcher is a service that allows you to search for selected words or phrases in a text and mark them. It is available for all languages.
The service is based on cosine similarity between individual words, so correct spelling of the input data is very important.
Example: if the dictionary contains the word beton, and the document has brtunu, then the cosine similarity may be too low to mark the latter as beton. Not because the word is inflected (M. beton - D. betonu), but because the word in the document has too many spelling errors. Hence, as the data in the dictionary deteriorates, the ability to correctly mark the data in the document also deteriorates.
This service can be useful in situations where you need to quickly and easily mark up text with predefined categories or labels.
Easymatcher searches through each document separately for words/phrases found in the label dictionary. The matching is calculated based on cosine similarity, whose minimum value is set by the user. After encountering a word/phrase with a match higher or equal to the set value, the service creates a tuple describing the match found and adds it to a list that is placed at the end of the returned document in the output file. The tuples appear in the following form:
(start_index_of_word, end_index_of_word, key_from_the_label_dictionary).
.
Note: Case (size of letters) is not taken into account when marking up the text.
Easymatcher can be run:
The service requires you to upload two files: the first one containing the labels to be found, and the second one containing the text to be searched.
lpmn_client_biz ['any2txt'] labels2.json -it file -v -ot file_id
The result of the processing will be a path to the uploaded file on the server. Once uploaded, the file will be stored in the user space and will not require re-uploading
lpmn_client_biz [{'easymatcher': {'labels_path': '@clarin://results/lpmn_client/*'}}] zwierzeta.txt -it file -v
Using @clarin://
will cause the file to be searched in the user space.
sim_threshold
- the minimum value of the accepted cosine similarity, default: 0.65
,n_workers
- the number of worker instances working in parallel to mark up the text, default: 1
,documents_path
- the path to the file with the documents,labels_path
- the path to the file with the labels.The service can be run in the Windows system with default values using the following LPMN query: [{'easymatcher':{'labels_path': '@clarin://terminy.json'}}]
- a file with terms named terminy.json.
[[{'easymatcher':{'labels_path': '@clarin://terminy.json'}}]]
- input data in the form of a compressed directory (.zip)You need to upload two files:
The file structure:
{
"labels": {
"Example label 1": ["example 1", "example 2"],
"Example label 2": ["example 3", "example 4"],
...
}
}
The file structure:
{"text": "Example text 1"}
{"text": "Example text 2"}
...
A JSONL file containing:
text
- the text to be searched,label
- the start index of the matched word, the end index, and the label.Example structure of the output file:
{"text": "Example text 1", "label": [(2, 6, "Example label 1"), (15, 23, "Example label 7")]}
{"text": "Example text 2", "label": [(7, 21, "Example label 33")]}
...
In Colab: Easymatcher - Searching for and marking selected words or phrases in the text
(C) CLARIN-PL