Before using the service, please read the preliminary information containing a description of steps that enable access to the CLARIN-PL developer interface.
Fextor is a tool that determines a vector describing a given document based on selected text features. The values of individual vector dimensions are the frequencies of occurrence of selected features in the document.
The service can be run by using an LPMN query in the LPMN Client service:
features
- a list of features to extract from the input data. Possible values:
lemma_count
- counting lemmas, default value,interp_sign
- counting punctuation marks,mm_stat
- overall statistics on the number of tokens in a text, lemmas and their frequency, and Parts of Speech tags,tags
- a list of tags that tokens must have in order to be counted; if not provided, all tags are counted,exclude_tags
- a list of tags that tokens cannot have in order to be counted,stoplist
- a path to a file containing lemmas (each on a new line) that should not be counted,startlist
- a path to a file containing lemmas (each on a new line) that should be counted, or to a JSON file.To run Fextor in Windows, you can use the following LPMN query: ['fextor3']
- it runs Fextor with default values for a JSON CCL file.
[['postagger','fextor3']]
- input data in the form of a compressed file (.zip)['postagger',{'fextor3':{'features':'mm_stat'}}]
- extraction of statistics['postagger',{'fextor3':{'features':['lemma_count','interp_sign']}}]
- extraction of lemmas and punctuation marks['postagger',{'fextor3':{'startlist':'@clarin://startlista.txt'}}]
- defining a startlist, only the lemmas included in the startlist will be counted['postagger',{'fextor3':{'stoplist':'@clarin://stoplista.txt'}}]
- defining a stoplist, only the lemmas not included in the stoplist will be counted['postagger',{'fextor3':{'features':['lemma_count','interp_sign'],'tags':['subst','interp']}}]
- extraction of lemmas and punctuation marks, NKJP tags startlist['postagger',{'fextor3':{'features':['lemma_count','interp_sign'],'stoplist':'@clarin://stoplista.txt','tags':['subst','interp']}}]
- extraction of lemmas and punctuation marks, stoplist of lemmas and NKJP tags startlist[{'fextor3':{'features':['lemma_count','interp_sign'],'tags_excluded':['ADJ','SPACE','PUNCT','NOUN']}]
- UD tags stoplistA file from Postagger in JSON CCL format.
The files containing the stoplist and startlist should be saved in the user space on the https://services.clarin-pl.eu/storage website.
Example for a slice from a dictionary:
terms | lemmas |
---|---|
szkoła jezior | szkoła jezioro |
szkoła literacka | szkoła literacki |
szkoła strukturalna | szkoła strukturalny |
szkoła sycylijska | szkoła sycylijski |
szkoła śląska | szkoła śląski |
szkoła ukraińska | szkoła ukraiński |
"szkoła": [
{"lemma": "szkoła", "parts": []},
{"lemma": "szkoła jezioro", "parts": ["jezioro"], "term": "szkoła_jezior"},
{"lemma": "szkoła literacki", "parts": ["literacki"], "term": "szkoła_literacka"},
{"lemma": "szkoła strukturalny", "parts": ["strukturalny"], "term": "szkoła_strukturalna"},
{"lemma": "szkoła sycylijski", "parts": ["sycylijski"], "term": "szkoła_sycylijska"},
{"lemma": "szkoła śląski", "parts": ["śląski"], "term": "szkoła_śląska"},
{"lemma": "szkoła ukraiński", "parts": ["ukraiński"], "term": "szkoła_ukraińska"}]\
The element {"lemma": "szkoła", "parts": } does not have a "term" key because the word "szkoła" does not occur as a separate term.
JSON file with counted frequencies of selected features.
mm_stat
:{
"tokens": {"count": count},
"lemmas": {"lemma_name": count, ...},
"tags": {"tag_name": count, ...},
}
In Colab: Fextor - zliczanie częstości wystąpień wybranych cech w tekście
(C) CLARIN-PL