Before using the service, please read the preliminary information containing a description of steps that enable access to the CLARIN-PL developer interface.
This service is used to visualize high-dimensional data by reducing dimensionality to 2D or 3D space.
The service can be run by using an LPMN query in the LPMN Client service:
method: specification of the dimensionality reduction method. Available methods:
tsne - t-Distributed Stochastic Neighbor Embeddingumap - Uniform Manifold Approximation and Projectionmds - Multidimensional Scaling, default optionnmds - Non-Metric Multidimensional ScalingParameters specific to each method are described in a separate section.
normalization: normalization method for vectors before dimensionality reduction.
l2 - currently, only one method that works exclusively with input data in the embedding format is available:None - default optiondim: target dimensionality. Available options:
2 - default option3seed: seed used in the case of non-deterministic dimensionality reduction methods. Default: 1234.
metric: type of metric used to determine the distance between points. Available options:
cosine - default optioneuclideanexport_png - visualization of output data in .png format
False - default optionTrueexport_bokeh - visualization of output data exported to the results.html file.
False - default optionTrueinput_type: input format. Available options:
embedderembeddingdistancesimilaritytopicsNone - default optionFind more information the Rodzaje Wejść section.
MDS i nMDS:
n_iter: maximum number of iterations for the SMACOF algorithm. It is possible that the method will converge to a solution in fewer steps.n_init: Number of runs for the SMACOF algorithm with different random seeds. By default: 4.eps: SmaCof convergence threshold. By default: 1e-3.T-SNE:
perplexity: a parameter that controls the trade-off between local (low values) and global (high values) structure preservation. At high values (>1000, depending on the dataset), it becomes similar to PCA. Recommended values are typically in the range 10-50. By default: 10.n_iter: number of gradient descent optimization steps.learning_ratePCA:
No parameters specific to the method.
Uwaga: PCA method works only for input data in formats embeddingg and topics. If the input data is in the form of a distance or similarity matrix, use method mds or nmds.
UMAP:
No parameters specific to the method.
The service can be run in the Windows system with default values using the following LPMN query: ['unzip','mds3']
['unzip',{'mds3':{'method':'umap','input_type':'embedder','export_bokeh':True}}] - dimensionality reduction method: UMAP, input data in the form of embeddings obtained from Embedder, visualization of output data exported to the results.html fileA .zip file containing input data in one of the 5 supported formats:
The input directory should contain a file corresponding to the selected input data type (input_type):
weighted.jsondistance.jsonsimilarity.jsonTOPICS_FILEEMBEDDER_DIRDetailed information about each format is provided below.
In this format, each point is described by a fixed-size vector.
Input directory format:
input_dir
|--weighted.json
|--...
weighted.json file format:
{
"rowlabels": [
"document1",
"document2",
"document3",
...
],
"arr": [
[0.121, 0.421, -0.111, 0.125, ...], # Embedding vector of document1
[0.881, 0.511, -0.132, 0.625, ...], # Embedding vector of document2
[0.081, 0.211, -0.101, 0.925, ...], # Embedding vector of document3
...
],
... # Other entries are allowed but not parsed
}
A distance matrix is a square matrix in which the cell at position (m,n) defines the distance between the nth and mth points. This format assumes symmetry, meaning that M(m,n) must be equal to M(n,m).
Input directory format:
input_dir
|--distance.json
|--...
distance.json file format:
{
"rowlabels": [
"document1",
"document2",
"document3",
...
],
"arr": [
[0.121, 0.421, 0.111, 0.125, ...],
[0.881, 0.511, 0.132, 0.625, ...],
[0.081, 0.211, 0.101, 0.925, ...],
...
], # A square nxn non-negative, symetric matrix where entry of index (m, n) is distance between m-th and n-th documents
... # Other entries are allowed but not parsed
}
A similarity matrix is very similar to a distance matrix, but the interpretation of the values in the matrix is different. Each value in a similarity matrix describes how similar two documents are, whereas in a distance matrix, it describes how different they are.
Input directory format:
input_dir
|--similarity.json
|--...
similarity.json file format:
{
"rowlabels": [
"document1",
"document2",
"document3",
...
],
"arr": [
[0.121, 0.421, -0.111, 0.125, ...], # Embedding of document 1
[0.881, 0.511, -0.132, 0.625, ...], # Embedding of document 2
[0.081, 0.211, -0.101, 0.925, ...], # Embedding of document 3
...
], # A square nxn non-negative, symetric matrix where entry of index (m, n) is similarity measure of m-th and n-th documents
... # Other entries are allowed but not parsed
}
It is also possible to use the topic composition of documents and interpret it as an embedding in a sparse, multidimensional space. In this case, the embedding is defined as follows:
doc_m = [topic_1, topic_2, ..., topic_n]
Where topic_n describes how much document m belongs to topic topic_n. For example, assuming that three topics are defined, then the following:
[0.91, 0, 0.19]
means that the document consists of 91% of topic 1, 19% of topic 2, and 0% of topic 3.
Input directory format:
input_dir
|--topics.json
|--...
topics.json file format:
{
"docs": [
"document1": {
"topic1": 0.0019,
"topic3": 0.9981,
},
"document2": {
"topic2": 1.0 # Note - if topic is not specified in given document, it's implicitly assumed to be equal to 0
},
"document3": {
"topic1": 0.6457,
"topic2": 0.1986,
"topic3": 0.1557
},
...
],
"topics": [
"topic1": {
"water": 0.2152, # Tokens that define the topics. Topics definitions (currently) are not used for mds. MDS service only needs to know what topics exists
"earth": 0.1304,
"life": 0.0123,
...
},
"topic2": {
"fries": 0.2152,
"burger": 0.1304,
"hotdog": 0.0123,
...
},
"topic3": {
"church": 0.2152,
"priest": 0.1304,
"cathedral": 0.0123,
...
}
...
],
... # Other entries are allowed but not parsed
}
This format requires a subdirectory named 'catalog' containing the documents in JSON line format. Each document should be a JSON object containing the following keys:
"text" - sentence content"embedding" - embeddings assigned to the sentenceInput directory format:
input_dir
|--catalog
| --document1
| --document2
| --...
| --documentX
|--...
document file format:
{"text": "Originally, the new album ...", "embedding": [0.06480148434638977, 0.1550612896680832, -0.008114025928080082, ...]}
{"text": "Today, recording music resembles a ...", "embedding": [-0.025914771482348442, 0.27748793363571167, -0.01087457500398159, ...]}
Below additional and optional input files are defined.
A file that assigns each point to a category. Multiple assignments per point can be defined in one file.
Input directory structure:
input_dir
|--labels.json
|--...
labels.json file structure:
{
"rowlabels": [
"document1", # Document names compatible with the ones defined in the input file
"document2",
"document3",
...
],
"groups": {
"languages": {
"polish", # Assigment of document1 in category 'languages'
"polish", # Assigment of document2 in category 'languages'
"english", # Assigment of document3 in category 'languages'
...
},
"sentiment": {
"positive", # Assigment of document1 in category 'sentiment'
"positive", # Assigment of document2 in category 'sentiment'
"neutral", # Assigment of document3 in category 'sentiment'
...
},
... # More categories
},
... # Other entries are allowed but not parsed
}
The topics.json file (See 4. Subject composition embedding) is parsed even if it is not selected as an input file. It contains information about the topics of each document that will be passed to the final JSON file.
Output directory structure:
output_dir
|--result.json
|--result.png
|--result_category1.png
|--result_category2.png
|--...
|--result_categoryN.png
The result*.png files are only created if export_png is set to True.
result.json file formatThe results.json file consists of six fields:
x: x coordinatesy: y coordinatesz: (optional) z coordinates. Appears only if 'dim' is set to 3.labels: names of each pointcategories: point-category assignments. In the example below, the books are divided by both time period and genre.type: point type. Currently, two types are defined: 'data' for normal points, and 'topic' for special points that represent the center of a topic.{
"x": [
-105.99490356445312,
9.57337760925293,
17.01878547668457,
-101.49317932128906,
12.965667724609375,
...
],
"y": [
-97.20722198486328,
14.849251747131348,
-85.53638458251953,
-93.25206756591797,
-77.4927978515625,
-104.14188385009766,
...
],
"z": [
53.90522003173828,
-110.36262512207031,
-86.60794830322266,
46.40607833862305,
-92.60964965820312,
32.516143798828125,
97.01101684570312,
...
],
"labels": [
"książka1",
"książka2",
"książka3",
"książka4",
"książka5",
"książka6",
"książka7",
...
]
"categories": {
"epoka": [
"renesans",
"renesans",
"barok",
"renesans",
"odrodzenie",
"odrodzenie",
...
],
"gatunek": [
"romans",
"romans",
"wojny",
"romans",
"komedia",
...
],
...
},
"type": [
"data",
"data",
"data",
"data",
"data",
"data",
...
]
}
(C) CLARIN-PL