# t-SNE in Python [single cell RNA-seq example and hyperparameter optimization]

## What is t-SNE?

- t-SNE (t-Distributed Stochastic Neighbor Embedding) is nonlinear dimensionality reduction technique in which interrelated high dimensional data (usually hundreds or thousands of variables) is mapped into low-dimensional data (like 2 or 3 variables) while preserving the significant structure (relationship among the data points in different variables) of original high dimensional data.
- The resulting reduced 2 or 3-dimensional data represents the structure of high dimensional data and easy to visualize on the scatter plot. The proximal samples will be placed together and dissimilar samples at greater distances. t-SNE is mostly used for the visualization purposes only and not for detailed quantitative analysis.
- For example, t-SNE is more suitable for single cell RNA-seq (scRNA-seq) as it produces the expression data for various cell classes which encompasses a biologically meaningful hierarchical structure.
- Unlike PCA, t-SNE can be applied and work better with both linear and nonlinear well-clustered datasets and produces more meaningful clustering

## t-SNE vs. PCA

t-SNE | PCA |
---|---|

t-SNE is nonlinear dimensionality reduction method | PCA is linear dimensionality reduction method |

t-SNE fails to preserve the global geometry but produces well-separated clusters. Keeping large perplexity parameter can help to preserve the global geometry of data | PCA preserves the global data structure but fail to preserve the similarities within the clusters |

t-SNE is computationally expensive and can take several hours on large dataset | PCA is much faster than t-SNE for large datasets. It is recommended to run PCA before running t-SNE to reduce the number of original variables. |

t-SNE is a stochastic method and produces slightly different embeddings if run multiple times | It is not necessary to run PCA multiple times |

Several parameters (hyperparameters) such as perplexity and learning rate needs to optimized based on the datasets | Generally no parameter needs to optimized in PCA |

## Perform t-SNE in Python

- To run t-SNE in Python, we will use the
`digits`

dataset which is available in the`scikit-learn`

package. I have also used scRNA-seq data for t-SNE visualization (see below). - The
`digits`

dataset (representing an image of a digit) has 64 variables (D) and 1797 observations (N) divided into 10 different categories of digits - we will use sklearn and bioinfokit (v0.8.5 or later) packages for t-SNE and visualization
- Check bioinfokit documentation for installation and documentation

Note: If you have your own dataset, you should import it as a pandas dataframe. Learn how to import data using pandas

```
from bioinfokit.analys import get_data
df = get_data('digits').data
df.head(2)
# output
pixel_0_0 pixel_0_1 pixel_0_2 pixel_0_3 pixel_0_4 ... pixel_7_4 pixel_7_5 pixel_7_6 pixel_7_7 class
0 0.0 0.0 5.0 13.0 9.0 ... 10.0 0.0 0.0 0.0 0
1 0.0 0.0 0.0 12.0 13.0 ... 16.0 10.0 0.0 0.0 1
df.shape
(1797, 65)
# run t-SNE
from sklearn.manifold import TSNE
# perplexity parameter can be changed based on the input datatset
# dataset with larger number of variables requires larger perplexity
# set this value between 5 and 50 (sklearn documentation)
# verbose=1 displays run time messages
# set n_iter sufficiently high to resolve the well stabilized cluster
# get embeddings
tsne_em = TSNE(n_components=2, perplexity=30.0, n_iter=1000, verbose=1).fit_transform(df)
# output
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1797 samples in 0.040s...
[t-SNE] Computed neighbors for 1797 samples in 0.447s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1797
[t-SNE] Computed conditional probabilities for sample 1797 / 1797
[t-SNE] Mean sigma: 8.132731
[t-SNE] KL divergence after 250 iterations with early exaggeration: 61.424686
[t-SNE] KL divergence after 1000 iterations: 0.736327
# plot t-SNE clusters
from bioinfokit.visuz import cluster
cluster.tsneplot(score=tsne_em)
# plot will be saved in same directory (tsne_2d.png)
```

Generated t-SNE plot,

Add colors to the cluster,

```
# get a list of categories
color_class = df['class'].to_numpy()
cluster.tsneplot(score=tsne_em, colorlist=color_class, legendpos='upper right', legendanchor=(1.15, 1) )
```

Generated t-SNE plot,

Add customized colors to the cluster,

```
# get a list of categories
color_class = df['class'].to_numpy()
cluster.tsneplot(score=tsne_em, colorlist=color_class, colordot=('#713e5a', '#63a375', '#edc79b', '#d57a66', '#ca6680', '#395B50', '#92AFD7', '#b0413e', '#4381c1', '#736ced'),
legendpos='upper right', legendanchor=(1.15, 1) )
```

Generated t-SNE plot,

## t-SNE with single cell RNA-seq (scRNA-seq) dataset

- I have downloaded the subset of single cell gene expression dataset of
*Arabidopsis thaliana*root cells processed by 10x genomics Cell Ranger pipeline (Ryu et al., 2019). - This scRNA-seq dataset contains 4406 cells with ~75K reads per cells
- I have preprocessed this data (for expression cut-off, sequence depth normalization, log-transformation, and molecular feature selection) using Seurat R package and exported highly variable molecular features for t-SNE visualization.

Note: If you have your own dataset, you should import it as a pandas dataframe. Learn how to import data using pandas

```
# import scRNA-seq as pandas dataframe
from bioinfokit.analys import get_data
df = get_data('ath_root').data
df = df.set_index(df.columns[0])
dft = df.T
dft = dft.set_index(dft.columns[0])
dft.head(2)
# output
gene AT1G01070 RPP1A HTR12 AT1G01453 ADF10 PLIM2B SBTI1.1 GL22 GPAT2 AT1G02570 BXL2 IMPA6 ... PER72 RAB18 AT5G66440 AT5G66580 AT5G66590 AT5G66800 AT5G66815 AT5G66860 AT5G66985 IRX14H PER73 RPL26B
AAACCTGAGACAGACC-1 0.51 1.40 -0.26 -0.28 -0.24 -0.14 -0.13 -0.07 -0.29 -0.31 -0.23 0.66 ... -0.25 0.64 0.61 -0.55 -0.41 -0.43 2.01 3.01 -0.24 -0.18 -0.34 1.16
AAACCTGAGATCCGAG-1 -0.22 1.36 -0.26 -0.28 -0.60 -0.51 -0.13 -0.07 -0.29 -0.31 0.81 -0.31 ... -0.25 1.25 -0.48 -0.55 -0.41 -0.43 -0.24 0.89 -0.24 -0.18 -0.49 -0.68
dft.shape
# output
(4406, 2000)
# as we have large number variables, we will first do to PCA to keep minimum number
# of variables for t-SNE
from sklearn.decomposition import PCA
import pandas as pd
pca_scores = PCA().fit_transform(dft)
# create a dataframe of pca_scores
df_pc = pd.DataFrame(pca_scores)
# perform t-SNE on PCs scores
# we will use first 50 PCs but this can vary
from sklearn.manifold import TSNE
tsne_em = TSNE(n_components=2, perplexity=30.0, early_exaggeration=12, n_iter=1000, learning_rate=368, verbose=1).fit_transform(df_pc.loc[:,0:49])
# output
[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 4406 samples in 0.081s...
[t-SNE] Computed neighbors for 4406 samples in 1.451s...
[t-SNE] Computed conditional probabilities for sample 1000 / 4406
[t-SNE] Computed conditional probabilities for sample 2000 / 4406
[t-SNE] Computed conditional probabilities for sample 3000 / 4406
[t-SNE] Computed conditional probabilities for sample 4000 / 4406
[t-SNE] Computed conditional probabilities for sample 4406 / 4406
[t-SNE] Mean sigma: 4.812347
[t-SNE] KL divergence after 250 iterations with early exaggeration: 64.164688
[t-SNE] KL divergence after 1000 iterations: 0.840337
# here you can run TSNE multiple times to keep run with lowest KL divergence
# plot t-SNE clusters
from bioinfokit.visuz import cluster
cluster.tsneplot(score=tsne_em)
# plot will be saved in same directory (tsne_2d.png)
```

Generated t-SNE plot,

Now, I will recognize the clusters using the DBSCAN algorithm. This will help to color and visualize clusters of similar data points

```
from sklearn.cluster import DBSCAN
# here eps parameter is very important and optimizing eps is essential
# for well defined clusters. I have run DBSCAN with several eps values
# and got good clusters with eps=3
get_clusters = DBSCAN(eps=3, min_samples=10).fit_predict(tsne_em)
# check unique clusters
# -1 value represents noisy points could not assigned to any cluster
set(get_clusters)
# output
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, -1}
# get t-SNE plot with colors assigned to each cluster
cluster.tsneplot(score=tsne_em, colorlist=get_clusters,
colordot=('#713e5a', '#63a375', '#edc79b', '#d57a66', '#ca6680', '#395B50', '#92AFD7', '#b0413e', '#4381c1', '#736ced', '#631a86', '#de541e', '#022b3a', '#000000'),
legendpos='upper right', legendanchor=(1.15, 1))
```

Generated t-SNE plot,

## Interpretation

- The points within the individual clusters are highly similar to each other and in distant to points in other clusters. The same pattern likely holds in a high-dimensional original dataset. In the digits dataset, t-SNE separated clusters of each digit class. In the context of scRNA-seq, these clusters represent the cells types with similar transcriptional profiles.

## Recommendations for running t-SNE and hyperparameter optimization

- t-SNE is a stochastic method and produces slightly different embeddings if run multiple times. These different results
could affect the numeric values on the axis but do not affect the clustering of the points. Therefore, t-SNE can be
run several times to get the embeddings with the smallest Kullback–Leibler (KL) divergence.
__The run with the smallest KL could have the greatest variation__. - If the original high-dimensional dataset contains larger number variables , it is highly recommended first to reduce the variables to small numbers (e.g. 20 to 50) using another dimensionality reduction technique (e.g. PCA) for t-SNE. It will help to speed up the t-SNE computation time and suppresses the noisy data points. Additionally, you can also use other variants of t-SNE such as Fast Fourier Transform-accelerated Interpolation-based t-SNE (FIt-SNE) for larger datasets (Linderman et al., 2019).
- In t-SNE, a most important parameter called perplexity, which measures the effective number of neighbors, controls the trade-off between global high-dimensional and local low-dimensional space, and possible to produce a more defined structure of the clusters. The number of variables in the original high dimensional data determines the perplexity parameter (standard range 10-100).
- While t-SNE is good in visualizing the well-separated clusters, most of the time it fails to preserve the global geometry of the data. Kobak et al., 2019 suggested keeping large perplexity parameter (n/100; where n is the number of cells) for preserving the global geometry.
- In addition to the perplexity parameter, other parameters such as the number of iterations, learning rate (set n/12 or 200 whichever is greater), and early exaggeration factor can also affect the visualization and should be optimized for larger datasets (Kobak et al., 2019).

## References

- Maaten LV, Hinton G. Visualizing data using t-SNE. Journal of machine learning research. 2008;9(Nov):2579-605.
- Kobak D, Berens P. The art of using t-SNE for single-cell transcriptomics. Nature communications. 2019 Nov 28;10(1):1-4.
- Cieslak MC, Castelfranco AM, Roncalli V, Lenz PH, Hartline DK. t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis. Marine Genomics. 2019 Nov 26:100723.
- Rich-Griffin C, Stechemesser A, Finch J, Lucas E, Ott S, Schäfer P. Single-cell transcriptomics: a high-resolution avenue for plant functional genomics. Trends in plant science. 2020 Feb 1;25(2):186-97.
- Devassy BM, George S. Dimensionality reduction and visualisation of hyperspectral ink data Using t-SNE. Forensic Science International. 2020 Feb 12:110194.
- Linderman GC, Rachh M, Hoskins JG, Steinerberger S, Kluger Y. Fast interpolation-based t-SNE for improved visualization of single-cell RNA-seq data. Nature methods. 2019 Mar;16(3):243-5.
- Butler A, Hoffman P, Smibert P, Papalexi E, Satija R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nature biotechnology. 2018 May;36(5):411-20.
- Ryu KH, Huang L, Kang HM, Schiefelbein J. Single-cell RNA sequencing resolves molecular relationships among individual plant cells. Plant physiology. 2019 Apr 1;179(4):1444-56.
- C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their Applications to Handwritten Digit Recognition, MSc Thesis, Institute of Graduate Studies in Science and Engineering, Bogazici University.

This work is licensed under a Creative Commons Attribution 4.0 International License