Before using the service, please read the preliminary information containing a description of steps that enable access to the CLARIN-PL developer interface.
Websty is a tool for analyzing and visualizing the similarity of documents.
The service can be run:
type
- a method for determining vectors describing documents:
authorship
- authorship analysis100base
- 100 the most frequent lemmas100orths
- classic authorship attribution (100 most frequent words from the text)1000base
- classic content similarity (1000 most frequent lemmas)multilingual-e5-base
- default valuelang
- language of the text, crucial for TF-IDF based methods
pl
- default valuechunk_size
- the size (in bytes) of the fragments that the input files are divided into
20000
- default valuemetric
- a method for determining distance in similarity analysis for clustering and visualization (UMAP):
cosine
- default valueeuclidean
manhattan
chebyshev
minkowski
canberra
braycurtis
haversine
mahalanobis
wminkowski
seuclidean
correlation
n_neighbors
- The number of neighbors in the UMAP method
15
- default valuemin_dist
- the minimum distance in the UMAP method
0.1
The service can be run in the Windows system with default values using the following LPMN query: ['websty']
[['websty']]
- input data in the form of a compressed directory (.zip)
Corpus
JSON, JSONL and HTML files.
(C) CLARIN-PL