Before using the service, please read the preliminary information containing a description of steps that enable access to the CLARIN-PL developer interface.
The service is used to automatically extract information about the topics discussed in texts. It uses topic modeling (LDA), which detects topics based on the co-occurrence of words in a single document. The service assigns each document to several thematic groups. The detected topics are returned as a list of pairs: a word and a probability in the form of a word cloud. It enables qualitative analysis (detection of non-obvious topics) and quantitative processing of texts.
The processing consists of the following stages:
The obtained results require researcher's interpretation.
The service uses topic modeling - statistical methods for extracting hidden topics. A topic in the context of topic modeling is a set of pairs: a word and the probability of its occurrence.
The method used is Latent Dirichlet Allocation (LDA). LDA identifies hidden topics by detecting model parameters (word and topic probabilities) based on observed texts (which are assumed in the model to be the result of a random generator with given probabilities). The method also assumes that each analyzed text may consist of multiple topics, and each topic is a set of word occurrence probabilities. LDA identifies hidden topics: a given document is assigned to several topics, so it can be, for example, 30% about sports and 70% about politics. Therefore, it can belong to two thematic groups at the same time (fuzzy clustering).
The service allows selecting of the following parameters:
The service can be used in two ways:
The service is designed to work with large data sets. Examples of data sets that can be processed in the service include:
The service can be run:
no_topics
- number of topics to be returned (default value: 20
)no_passes
- number of algorithm iterations (default value: 100
). The value will be adjusted to the range (0, 2000>).method
- clustering method:
artm_bigartm
- default valuelda_mallet
guided_lda
seed_topics_path
- path to the file containing seed_topics, default value: /samba/seed_topics.json
. This parameter is only used in the guided_lda
method.{"topics":[["game", "team", "win", "player", "season", "second", "victory"],["percent", "company", "market", "price", "sell", "business", "stock", "share"],["music", "write", "art", "book", "world", "film"],["political", "government", "leader", "official", "state", "country", "american","case", "law", "police", "charge", "officer", "kill", "arrest", "lawyer"]]}
topic_scaling
- multidimensional scaling method used when creating topic_vis.html:
pcoa
- default valuemmds
tsne
alpha
- regulatory factor for the distribution of topics in documents, default value: 0.1
. More information here: Smooth/Sparse Theta, tau.beta
- regulatory factor for the distribution of words in topics, default value: 0.01
. More information here: Smooth/Sparse Phi, tau.model
- path to the trained model e.g. /request/service/resultID
. If the path points to a call that was not a training call (i.e. the model
parameter is defined), the same model from the query pointed to by the path will be usedThe service can be run in the Windows system with default values using the following LPMN query: [['postagger', 'fextor3'] 'feature2', 'topic3': {'method': 'lda_mallet', 'no_topics': 50, 'no_passes': 500}]
.
[['postagger', 'fextor3'] 'feature2', 'topic3': {'method': 'lda_mallet', 'no_topics': 50, 'no_passes': 500}]
- input data in the form of a compressed file (.zip)[[{'postagger': {'lang': 'pl'}}, {'fextor3': {'tags': ['subst'], 'stoplist': '@clarin://stoplista1'}}], {'feature2': {'filter': {'base': {'min_df': 2, 'max_df': 1, 'keep_n': 2000}}}}, {'topic3': {'method': 'lda_mallet', 'no_topics': 50, 'no_passes': 500}}]
- a query with nouns (subst) filtering and a stoplist stoplista1
A text corpus of appropriate length:
The length of the text should be determined individually for each task.
The acquired data contains contamination specific to the source, which can distort the processing results. To avoid it, the data you use should be cleaned before processing begins:
The service returns a list of pairs - word and probability - in the form of a word cloud. The font size of words is directly proportional in 50% to the probability and in 50% to the position on the list.
(C) CLARIN-PL