Before using the service, please read the preliminary information containing a description of steps that enable access to the CLARIN-PL developer interface.
This service is used to visualize high-dimensional data by reducing dimensionality to 2D or 3D space.
The service can be run by using an LPMN query in the LPMN Client service:
method
: specification of the dimensionality reduction method. Available methods:
tsne
- t-Distributed Stochastic Neighbor Embeddingumap
- Uniform Manifold Approximation and Projectionmds
- Multidimensional Scaling, default optionnmds
- Non-Metric Multidimensional ScalingParameters specific to each method are described in a separate section.
normalization
: normalization method for vectors before dimensionality reduction.
l2
- currently, only one method that works exclusively with input data in the embedding
format is available:None
- default optiondim
: target dimensionality. Available options:
2
- default option3
seed
: seed used in the case of non-deterministic dimensionality reduction methods. Default: 1234
.
metric
: type of metric used to determine the distance between points. Available options:
cosine
- default optioneuclidean
export_png
- visualization of output data in .png format
False
- default optionTrue
export_bokeh
- visualization of output data exported to the results.html file.
False
- default optionTrue
input_type
: input format. Available options:
embedder
embedding
distance
similarity
topics
None
- default optionFind more information the Rodzaje Wejść
section.
MDS i nMDS:
n_iter
: maximum number of iterations for the SMACOF algorithm. It is possible that the method will converge to a solution in fewer steps.n_init
: Number of runs for the SMACOF algorithm with different random seeds. By default: 4
.eps
: SmaCof convergence threshold. By default: 1e-3
.T-SNE:
perplexity
: a parameter that controls the trade-off between local (low values) and global (high values) structure preservation. At high values (>1000, depending on the dataset), it becomes similar to PCA. Recommended values are typically in the range 10-50. By default: 10
.n_iter
: number of gradient descent optimization steps.learning_rate
PCA:
No parameters specific to the method.
Uwaga: PCA method works only for input data in formats embeddingg
and topics
. If the input data is in the form of a distance or similarity matrix, use method mds
or nmds
.
UMAP:
No parameters specific to the method.
The service can be run in the Windows system with default values using the following LPMN query: ['unzip','mds3']
['unzip',{'mds3':{'method':'umap','input_type':'embedder','export_bokeh':True}}]
- dimensionality reduction method: UMAP, input data in the form of embeddings obtained from Embedder, visualization of output data exported to the results.html
fileA .zip file containing input data in one of the 5 supported formats:
The input directory should contain a file corresponding to the selected input data type (input_type
):
weighted.json
distance.json
similarity.json
TOPICS_FILE
EMBEDDER_DIR
Detailed information about each format is provided below.
In this format, each point is described by a fixed-size vector.
Input directory format:
input_dir
|--weighted.json
|--...
weighted.json
file format:
{
"rowlabels": [
"document1",
"document2",
"document3",
...
],
"arr": [
[0.121, 0.421, -0.111, 0.125, ...], # Embedding vector of document1
[0.881, 0.511, -0.132, 0.625, ...], # Embedding vector of document2
[0.081, 0.211, -0.101, 0.925, ...], # Embedding vector of document3
...
],
... # Other entries are allowed but not parsed
}
A distance matrix is a square matrix in which the cell at position (m,n) defines the distance between the nth and mth points. This format assumes symmetry, meaning that M(m,n) must be equal to M(n,m).
Input directory format:
input_dir
|--distance.json
|--...
distance.json
file format:
{
"rowlabels": [
"document1",
"document2",
"document3",
...
],
"arr": [
[0.121, 0.421, 0.111, 0.125, ...],
[0.881, 0.511, 0.132, 0.625, ...],
[0.081, 0.211, 0.101, 0.925, ...],
...
], # A square nxn non-negative, symetric matrix where entry of index (m, n) is distance between m-th and n-th documents
... # Other entries are allowed but not parsed
}
A similarity matrix is very similar to a distance matrix, but the interpretation of the values in the matrix is different. Each value in a similarity matrix describes how similar two documents are, whereas in a distance matrix, it describes how different they are.
Input directory format:
input_dir
|--similarity.json
|--...
similarity.json
file format:
{
"rowlabels": [
"document1",
"document2",
"document3",
...
],
"arr": [
[0.121, 0.421, -0.111, 0.125, ...], # Embedding of document 1
[0.881, 0.511, -0.132, 0.625, ...], # Embedding of document 2
[0.081, 0.211, -0.101, 0.925, ...], # Embedding of document 3
...
], # A square nxn non-negative, symetric matrix where entry of index (m, n) is similarity measure of m-th and n-th documents
... # Other entries are allowed but not parsed
}
It is also possible to use the topic composition of documents and interpret it as an embedding in a sparse, multidimensional space. In this case, the embedding is defined as follows:
doc_m = [topic_1, topic_2, ..., topic_n]
Where topic_n
describes how much document m
belongs to topic topic_n
. For example, assuming that three topics are defined, then the following:
[0.91, 0, 0.19]
means that the document consists of 91% of topic 1, 19% of topic 2, and 0% of topic 3.
Input directory format:
input_dir
|--topics.json
|--...
topics.json
file format:
{
"docs": [
"document1": {
"topic1": 0.0019,
"topic3": 0.9981,
},
"document2": {
"topic2": 1.0 # Note - if topic is not specified in given document, it's implicitly assumed to be equal to 0
},
"document3": {
"topic1": 0.6457,
"topic2": 0.1986,
"topic3": 0.1557
},
...
],
"topics": [
"topic1": {
"water": 0.2152, # Tokens that define the topics. Topics definitions (currently) are not used for mds. MDS service only needs to know what topics exists
"earth": 0.1304,
"life": 0.0123,
...
},
"topic2": {
"fries": 0.2152,
"burger": 0.1304,
"hotdog": 0.0123,
...
},
"topic3": {
"church": 0.2152,
"priest": 0.1304,
"cathedral": 0.0123,
...
}
...
],
... # Other entries are allowed but not parsed
}
This format requires a subdirectory named 'catalog' containing the documents in JSON line format. Each document should be a JSON object containing the following keys:
"text"
- sentence content"embedding"
- embeddings assigned to the sentenceInput directory format:
input_dir
|--catalog
| --document1
| --document2
| --...
| --documentX
|--...
document
file format:
{"text": "Originally, the new album ...", "embedding": [0.06480148434638977, 0.1550612896680832, -0.008114025928080082, ...]}
{"text": "Today, recording music resembles a ...", "embedding": [-0.025914771482348442, 0.27748793363571167, -0.01087457500398159, ...]}
Below additional and optional input files are defined.
A file that assigns each point to a category. Multiple assignments per point can be defined in one file.
Input directory structure:
input_dir
|--labels.json
|--...
labels.json
file structure:
{
"rowlabels": [
"document1", # Document names compatible with the ones defined in the input file
"document2",
"document3",
...
],
"groups": {
"languages": {
"polish", # Assigment of document1 in category 'languages'
"polish", # Assigment of document2 in category 'languages'
"english", # Assigment of document3 in category 'languages'
...
},
"sentiment": {
"positive", # Assigment of document1 in category 'sentiment'
"positive", # Assigment of document2 in category 'sentiment'
"neutral", # Assigment of document3 in category 'sentiment'
...
},
... # More categories
},
... # Other entries are allowed but not parsed
}
The topics.json
file (See 4. Subject composition embedding) is parsed even if it is not selected as an input file. It contains information about the topics of each document that will be passed to the final JSON file.
Output directory structure:
output_dir
|--result.json
|--result.png
|--result_category1.png
|--result_category2.png
|--...
|--result_categoryN.png
The result*.png
files are only created if export_png
is set to True
.
result.json
file formatThe results.json
file consists of six fields:
x
: x coordinatesy
: y coordinatesz
: (optional) z coordinates. Appears only if 'dim' is set to 3.labels
: names of each pointcategories
: point-category assignments. In the example below, the books are divided by both time period and genre.type
: point type. Currently, two types are defined: 'data' for normal points, and 'topic' for special points that represent the center of a topic.{
"x": [
-105.99490356445312,
9.57337760925293,
17.01878547668457,
-101.49317932128906,
12.965667724609375,
...
],
"y": [
-97.20722198486328,
14.849251747131348,
-85.53638458251953,
-93.25206756591797,
-77.4927978515625,
-104.14188385009766,
...
],
"z": [
53.90522003173828,
-110.36262512207031,
-86.60794830322266,
46.40607833862305,
-92.60964965820312,
32.516143798828125,
97.01101684570312,
...
],
"labels": [
"książka1",
"książka2",
"książka3",
"książka4",
"książka5",
"książka6",
"książka7",
...
]
"categories": {
"epoka": [
"renesans",
"renesans",
"barok",
"renesans",
"odrodzenie",
"odrodzenie",
...
],
"gatunek": [
"romans",
"romans",
"wojny",
"romans",
"komedia",
...
],
...
},
"type": [
"data",
"data",
"data",
"data",
"data",
"data",
...
]
}
(C) CLARIN-PL