t-SNE in Python for visualization of high-dimensional data

Renesh Bedre    8 minute read

t-SNE in Python

What is t-Distributed Stochastic Neighbor Embedding (t-SNE)?

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-parametric dimensionality reduction technique in which high-dimensional data (n features) is mapped into low-dimensional data (typically 2 or 3 features) while preserving relationship among the data points of original high-dimensional data.

A t-SNE algorithm is an unsupervised machine learning algorithm primarily used for visualizing. Using [scatter plots]((scatter-plot-matplotlib.html), low-dimensional data generated with t-SNE can be visualized easily.

t-SNE is a probabilistic model, and it models the probability of neighboring points such that similar samples will be placed together and dissimilar samples at greater distances. Hence, t-SNE is helpful in understanding the data structure and distribution characteristics of the high-dimensional datasets.

Unlike principal component analysis (PCA) which needs linear data, t-SNE work better with both linear and non-linear well-clustered datasets and produces more meaningful clustering.

Why to use t-SNE?

The majority of big data datasets contain hundreds or thousands of variables. Due to a large number of variables, it is impractical to visualize the data (even with pairwise scatter plots), so we must use dimensional reduction techniques to understand their structure and relationships.

As an example, single-cell RNA-seq (scRNA-seq) produces the expression data for thousands of genes and millions of cells in bioinformatics analysis. To understand biologically meaningful cluster structures, such high-dimensional datasets must be analyzed and visualized.

Interpreting such high-dimensional non-linear data would be impractical without transforming them into low-dimensional data. Using dimension reduction techniques such as t-SNE, high-dimensional datasets can be reduced into two-dimensional space for visualization and understanding biologically meaningful clusters present in high-dimensional datasets.

How to perform t-SNE in Python

In Python, t-SNE analysis and visualization can be performed using the TSNE() function from scikit-learn and bioinfokit packages.

Here, I will use the scRNA-seq dataset for visualizing the hidden biological clusters. I have downloaded the subset of scRNA-seq dataset of Arabidopsis thaliana root cells processed by 10x genomics Cell Ranger pipeline

This scRNA-seq dataset contains 4406 cells with ~75K sequence reads per cells. This dataset is pre-processed using Seurat R package and only used 2000 highly variable genes (variables or features) for t-SNE visualization.

Now, import the pre-processed scRNA-seq data using get_data() function from bioinfokit package. If you have your own dataset, you should import it as a pandas dataframe.

# import scRNA-seq as pandas dataframe
from bioinfokit.analys import get_data
df = get_data('ath_root').data
df = df.set_index(df.columns[0])
dft = df.T
dft = dft.set_index(dft.columns[0])
dft.head(2)
# output
gene                AT1G01070  RPP1A  HTR12  AT1G01453  ADF10  PLIM2B  SBTI1.1  GL22  GPAT2  AT1G02570  BXL2  IMPA6  ...  PER72  RAB18  AT5G66440  AT5G66580  AT5G66590  AT5G66800  AT5G66815  AT5G66860  AT5G66985  IRX14H  PER73  RPL26B
AAACCTGAGACAGACC-1       0.51   1.40  -0.26      -0.28  -0.24   -0.14    -0.13 -0.07  -0.29      -0.31 -0.23   0.66  ...  -0.25   0.64       0.61      -0.55      -0.41      -0.43       2.01       3.01      -0.24   -0.18  -0.34    1.16
AAACCTGAGATCCGAG-1      -0.22   1.36  -0.26      -0.28  -0.60   -0.51    -0.13 -0.07  -0.29      -0.31  0.81  -0.31  ...  -0.25   1.25      -0.48      -0.55      -0.41      -0.43      -0.24       0.89      -0.24   -0.18  -0.49   -0.68

# check the dimension (rows, columns)
dft.shape
# output
(4406, 2000)

As there is a very large number of variables (2000), we will use another dimension reduction technique such as PCA to reduce the number of variables to a reasonable number (e.g. 20 to 50) for t-SNE.

Note: The PCA is a recommended method to reduce the number of input features (when there are a large number of features) to a reasonable number (e.g. 20 to 50) to speed up the t-SNE computation time and suppresses the noisy data points.

from sklearn.decomposition import PCA
import pandas as pd

# perform PCA
pca_scores = PCA().fit_transform(dft)
# create a dataframe of pca_scores
df_pc = pd.DataFrame(pca_scores)

Now, perform the t-SNE on the first 50 features obtained from the PCA. By default, TSNE() function uses the Barnes-Hut approximation, which is computationally less intensive.

# perform t-SNE on PCs scores
# we will use first 50 PCs but this can vary
from sklearn.manifold import TSNE
tsne_em = TSNE(n_components = 2, perplexity = 30.0, early_exaggeration = 12, 
               n_iter = 1000, learning_rate = 368, verbose = 1).fit_transform(df_pc.loc[:,0:49])
# output
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 4406 samples in 0.081s...
[t-SNE] Computed neighbors for 4406 samples in 1.451s...
[t-SNE] Computed conditional probabilities for sample 1000 / 4406
[t-SNE] Computed conditional probabilities for sample 2000 / 4406
[t-SNE] Computed conditional probabilities for sample 3000 / 4406
[t-SNE] Computed conditional probabilities for sample 4000 / 4406
[t-SNE] Computed conditional probabilities for sample 4406 / 4406
[t-SNE] Mean sigma: 4.812347
[t-SNE] KL divergence after 250 iterations with early exaggeration: 64.164688
[t-SNE] KL divergence after 1000 iterations: 0.840337
# here you can run TSNE multiple times to keep run with lowest KL divergence

Note: t-SNE is a stochastic method and produces slightly different embeddings if run multiple times. t-SNE can be run several times to get the embeddings with the smallest Kullback–Leibler (KL) divergence. The run with the smallest KL could have the greatest variation.

You have run the t-SNE to obtain a run with smallest KL divergenece.

In t-SNE, several parameters needs to be optimized (hyperparameter tuning) for building the effective model.

perplexity is the most important parameter in t-SNE, and it measures the effective number of neighbors. The number of variables in the original high-dimensional data determines the perplexity parameter (standard range 10-100). In case of large, datasets, keeping large perplexity parameter (n/100; where n is the number of observations) is helpful for preserving the global geometry.

In addition to the perplexity parameter, other parameters such as the number of iterations (n_iter), learning rate (set n/12 or 200 whichever is greater), and early exaggeration factor (early_exaggeration) can also affect the visualization and should be optimized for larger datasets (Kobak et al., 2019).

Now, visualize the t-SNE clusters,

# plot t-SNE clusters
from bioinfokit.visuz import cluster
cluster.tsneplot(score=tsne_em)
# plot will be saved in same directory (tsne_2d.png) 

Generated t-SNE plot,

scRNA-seq t-SNE plot without colors

As t-SNE is an unsupervised learning method, we do not have sample target information. Hence, I will recognize the clusters using the DBSCAN clustering algorithm. This will help to color and visualize clusters of similar data points

from sklearn.cluster import DBSCAN
# here eps parameter is very important and optimizing eps is essential
# for well defined clusters. I have run DBSCAN with several eps values
# and got good clusters with eps=3
get_clusters = DBSCAN(eps = 3, min_samples = 10).fit_predict(tsne_em)
# check unique clusters
# -1 value represents noisy points could not assigned to any cluster
set(get_clusters)
# output
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, -1}
# get t-SNE plot with colors assigned to each cluster
cluster.tsneplot(score=tsne_em, colorlist=get_clusters, 
    colordot=('#713e5a', '#63a375', '#edc79b', '#d57a66', '#ca6680', '#395B50', '#92AFD7', '#b0413e', '#4381c1', '#736ced', '#631a86', '#de541e', '#022b3a', '#000000'), 
    legendpos='upper right', legendanchor=(1.15, 1))

Generated t-SNE plot,

scRNA-seq t-SNE plot with recognized clusters and colors

In t-SNE scatter plot, the points within the individual clusters are highly similar to each other and in distant to points in other clusters. The same pattern likely holds in a high-dimensional original dataset. In the context of scRNA-seq, these clusters represent the cells types with similar transcriptional profiles.

Advantages of t-SNE

  • No assumption of linearity:t-SNE does not assume any relationship in the input features and it can be applied to both linear and non-linear datasets
  • Preserves local structure: t-SNE preserves the structure of high-dimensional datasets i.e. the close points remain closer and distant points remain distant in low-dimensional space.
  • Non-parametric method: t-SNE is a non-parametric machine learning method

Disadvantages of t-SNE

  • t-SNE is slow: t-SNE is a computationally intensive technique and takes longer time on larger datasets. Hence, it is recommended to use the PCA method prior to t-SNE if the original datasets contain a very large number of input features. You should consider using UMAP dimension reduction method) for faster run time performance on larger datasets.
  • t-SNE is a stochastic method: t-SNE is a stochastic method and produces slightly different embeddings if run multiple times. These different results could affect the numeric values on the axis but do not affect the clustering of the points. Therefore, t-SNE can be run several times to get the embeddings with the smallest Kullback–Leibler (KL) divergence.
  • t-SNE does not preserve global geometry: While t-SNE is good at visualizing the well-separated clusters, most of the time it fails to preserve the global geometry of the data. Hence, t-SNE is not recommended for classification purposes.
  • Hyperparameter optimization: t-SNE has various parameters to optimize to obtain well fitted model

Enhance your skills with courses on genomics and bioinformatics

Enhance your skills with courses on machine learning

References

  • Maaten LV, Hinton G. Visualizing data using t-SNE. Journal of machine learning research. 2008;9(Nov):2579-605.
  • Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nature communications. 2019 Nov 28;10(1):1-4.
  • Cieslak MC, Castelfranco AM, Roncalli V, Lenz PH, Hartline DK. t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis. Marine Genomics. 2019 Nov 26:100723.
  • Rich-Griffin C, Stechemesser A, Finch J, Lucas E, Ott S, Schäfer P. Single-cell transcriptomics: a high-resolution avenue for plant functional genomics. Trends in plant science. 2020 Feb 1;25(2):186-97.
  • Devassy BM, George S. Dimensionality reduction and visualisation of hyperspectral ink data Using t-SNE. Forensic Science International. 2020 Feb 12:110194.
  • Linderman GC, Rachh M, Hoskins JG, Steinerberger S, Kluger Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nature methods. 2019 Mar;16(3):243-5.
  • Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature biotechnology. 2018 May;36(5):411-20.
  • Ryu KH, Huang L, Kang HM, Schiefelbein J. Single-cell RNA sequencing resolves molecular relationships among individual plant cells. Plant physiology. 2019 Apr 1;179(4):1444-56.
  • C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their Applications to Handwritten Digit Recognition, MSc Thesis, Institute of Graduate Studies in Science and Engineering, Bogazici University.


This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.