Before using the service, please read the preliminary information containing a description of steps that enable access to the CLARIN-PL developer interface.
Embedder is a service used to represent texts in numerical form using deep models. This allows for topic classification and visualization of semantic similarity. The input text is divided into fragments of the desired length (e.g., 300 words), which are analyzed by the selected model and for each of which a sequence of multidimensional semantic similarity vectors (word embeddings) or a sequence of words is generated.
The generated vectors can be visualized as points on a plane using methods such as UMAP. This allows for a chart where the points represent individual fragments of the text, and the distances between the points illustrate their similarity. Texts with similar topics will receive vectors lying close to each other, while text that are semantically distant will receive vectors that are distant from each other. Visualization of the text allows for quick analysis of the data and presentation in an aesthetic manner.
This service was created for the needs of analytics at Wroclaw University of Technology, so Embedder has been tested on data for a large organization. It is trainable and can be adapted to different types of communication.
Word embedding is a technique for encoding the meaning of words in a space as semantically meaningful word embeddings, i.e., vector representations. An embedding vector is a numerical vector that is the result of transforming a given word from the text into a numerical vector, and representing its occurrence in a specific context.
The current version of the service provides the following deep models:
Semantic word embeddings can be used in the information retrieval process, and Embedder can be the basis for tools for:
The service can also be used in research requiring the analysis of texts limited to one sentence, such as surveys.
Embedder can be run by using an LPMN query in the LPMN Client service:
The service can be run in the Windows system with default values using the following LPMN query: ['any2txt','embedder']
.
[['any2txt','embedder']]
- input data in the form of a compressed directory (.zip)['any2txt',{'embedder':{'n_words':300}}]
- specified maximum number of words per segment of the divided text['any2txt',{'embedder':{'type':'fast_kgr10'}}]
- model selection and the default number of words per segment of the divided text['any2txt',{'embedder':{'type':'sbert-distiluse-base-multilingual-cased-v1','n_words':400}}]
- model selection and specifying the maximum number of words per segment of the divided texttype
- model selection:
fast_kgr10
sbert-klej-cdsc-r
- default optionsbert-distiluse-base-multilingual-cased-v1
sbert-paraphrase-multilingual-mpnet-base-v2
t5-clarin-keywords-plt5-small-shuffle
t5-voicelab-vlt5-base-keywords
n_words
- it specifies the maximum number of words that can be contained in a text fragment, with the default value equaling 100
. The text is divided into fragments at sentence endings.For more information about each model, see here.
A text file.
A text file containing the text divided into fragments and the corresponding embedding vectors.
In Colab: Embedder - Determining the sequence of word vectors or the sequence of words
(C) CLARIN-PL