# DBSCAN in Python (with example dataset)

## What is DBSCAN?

Density Based Spatial Clustering of Applications with Noise (abbreviated as DBSCAN) is a **density-based unsupervised
clustering** algorithm. In DBSCAN, clusters are formed from dense regions and separated by regions of no or low densities.

DBSCAN computes nearest neighbor graphs and creates **arbitrary-shaped clusters** in datasets (which
may contain noise or outliers) as opposed to k-means clustering,
which typically generates spherical-shaped clusters

Unlike k-means clustering, DBSCAN does not require specifying the number of clusters initially. However, DBSCAN requires
two parameters viz. the radius of neighborhoods for a given data point *p* (`eps`

or `ε`

) and the minimum number
of data points in a given `ε`

-neighborhood to form clusters (`minPts`

).

DBSCAN is also useful for clustering non-linear datasets.

## DBSCAN common terminologies

Here we will discuss the core point, border point, noise point (Outlier), and density reachable terminologies used in DBSCAN.

The randomly selected data point *p* is called a **core point** if there are more than a minimum number of points (`minPts`

)
within a `ε`

-neighborhood of *p*.

A data point is called a **border point** if it is within a `ε`

-neighborhood of *p* and it has fewer than the minimum number
of points (`minPts`

) within its `ε`

-neighborhood.

If a point that is not a core point or border point is called a **Noise point (Outlier)**.

A point x is **directly density reachable** from point *p* if a point *p* is a core point and x is in *p*’s `ε`

-neighborhood.
A point y is **density reachable** from point *p* if a point y is **directly density reachable** to core point x, which
is also **density reachable** to core point *p*.

All points within a `ε`

-neighborhood of a core point belongs to same clusters and all points in a cluster are directly density
reachable to core point.

The points which are not density reachable from any core points are called noise (outliers) points.

## Steps involved in DBSCAN clustering algorithm

- Choose any point
*p*randomly - Identify all
**density reachable**points from*p*with`ε`

and`minPts`

parameter - If
*p*is a core point, create a cluster (with`ε`

and`minPts`

) - If
*p*is a border point, visit the next point in a dataset - Continue the algorithm until all points are visited

## Perform DBSCAN clustering in Python

To perform DBSCAN clustering in Python, you will require to install `sklearn`

, `pandas`

, and `matplotlib`

Python
packages. Check for how to install Python packages

### Get dataset

For clustering using DBSCAN, I am using a single-cell gene expression dataset of *Arabidopsis thaliana* root cells
processed by a 10x genomics Cell Ranger pipeline. The dataset is preprocessed t-SNE dimensionality reduction technique.
Now, I will use the t-SNE embedding vectors to identify the clusters using DBSCAN.

```
import pandas as pd
df = pd.read_csv("https://reneshbedre.github.io/assets/posts/tsne/tsne_scores.csv")
df.head(2)
t-SNE-1 t-SNE-2
0 10.846841 -16.712580
1 24.794334 -16.775398
# check the shape of dataset
df.shape
(4406, 2)
```

This dataset has 4406 rows and two features. This is unlabelled dataset (no cluster information). I will identify the cluster information on this dataset using DBSCAN.

### Compute required parameters for DBSCAN clustering

DBSCAN requires `ε`

and `minPts`

parameters for clustering. The `minPts`

parameter is easy to set. The `minPts`

should be 4
for two-dimensional dataset. For multidimensional dataset, `minPts`

should be `2 * number of dimensions`

. For example, if
your dataset has 6 features, set `minPts = 12`

.

Another question is what optimal value should be used for the `ε`

parameter. The `ε`

parameter is difficult to set and depends
on the distance function. Sometimes, domain expertise is also required to set a `ε`

parameter. The `ε`

should be as small
as possible.

To determine the optimal `ε`

parameter, I will compute the k-nearest neighbor (kNN) distances (average distance of every data
point to its k-nearest neighbors) of an input dataset using the k-nearest neighbor method (unsupervised nearest neighbors
learning). For finding the k-nearest neighbor, I will use the `sklearn.neighbors.NearestNeighbors`

function.

`NearestNeighbors`

function requires `n_neighbors`

(number of neighbors) parameter, which can be same as the `minPts`

value.

```
import numpy as np
from sklearn.neighbors import NearestNeighbors
# n_neighbors = 5 as kneighbors function returns distance of point to itself (i.e. first column will be zeros)
nbrs = NearestNeighbors(n_neighbors=5).fit(df)
# Find the k-neighbors of a point
neigh_dist, neigh_ind = nbrs.kneighbors(df)
# sort the neighbor distances (lengths to points) in ascending order
# axis = 0 represents sort along first axis i.e. sort along row
sort_neigh_dist = np.sort(neigh_dist, axis=0)
```

Now, get the sorted k^{th} column (distances with k^{th} neighbors) and plot the kNN distance plot,

```
import matplotlib.pyplot as plt
k_dist = sort_neigh_dist[:, 4]
plt.plot(k_dist)
plt.axhline(y=2.5, linewidth=1, linestyle='dashed', color='k')
plt.ylabel("k-NN distance")
plt.xlabel("Sorted observations (4th NN)")
plt.show()
```

In the k-NN distance plot, you should look for the “knee” of the curve to find the optimal value of `ε`

. In the below
plot, the knee occurs at approximately 2.5 i.e. the points below 2.5 belong to a cluster and points above 2.5 are noise
or outliers (noise points will have higher kNN distance).

### Compute DBSCAN clustering

Now, we have calculated `ε`

and `minPts`

parameters for DBSCAN clustering. We will pass these parameters to the DBSCAN
to predict the clusters using the `sklearn.cluster.DBSCAN`

class.

```
from sklearn.cluster import DBSCAN
clusters = DBSCAN(eps=2.5, min_samples=4).fit(df)
# get cluster labels
clusters.labels_
# output
array([0, 0, 1, ..., 1, 1, 1], dtype=int64)
# check unique clusters
set(clusters.labels_)
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, -1}
# -1 value represents noisy points could not assigned to any cluster
```

Choosing appropriate DBSCAN parameters can be harder sometimes. If you select

`ε`

based on your domain expertise, you should play with different values of`minPts`

to get better insights from clustering.

Get each cluster size,

```
from collections import Counter
Counter(clusters.labels_)
# output
Counter({1: 1501, 0: 834, 2: 703, 3: 285, 7: 283, 5: 245, 6: 232, 4: 145, 12: 46, -1: 34, 11: 13, 8: 11, 9: 11, 14: 10, 15: 8, 17: 8, 13: 6, 10: 5, 18: 5, 19: 5, 22: 4, 16: 4, 20: 4, 21: 4})
```

DBSCAN clustering identified 22 clusters with given parameters. The largest cluster (1) has 1501 data points and the smallest cluster (22) has 4 data points.

### Visualization of DBSCAN clustering

Visualize the cluster as a scatter plot and color the clusters using predicted class labels,

```
import seaborn as sns
import matplotlib.pyplot as plt
p = sns.scatterplot(data=df, x="t-SNE-1", y="t-SNE-2", hue=clusters.labels_, legend="full", palette="deep")
sns.move_legend(p, "upper right", bbox_to_anchor=(1.17, 1.2), title='Clusters')
plt.show()
```

Different colors represent different predicted clusters. Blue represents noisy points (-1 cluster).

### DBSCAN limitations

- DBSCAN is computationally expensive (less scalable) and more complicated clustering method as compared to simple
*k*-means clustering - DBSCAN is sensitive to input parameters, and it is hard to set accurate input parameters
- DBSCAN depends on a single value of
`ε`

for all clusters, and therefore, clusters with variable densities may not be correctly identified by DBSCAN - DBSCAN is a time-consuming algorithm for clustering

## Enhance your skills with courses on Machine Learning and Python

- Machine Learning Specialization
- Machine Learning with Python
- Machine Learning for Data Analysis
- Cluster Analysis in Data Mining
- Python for Everybody Specialization
- Python for Genomic Data Science

## Related reading

- Performing and visualizing the Principal component analysis (PCA) from PCA function and scratch in Python
- Support Vector Machine (SVM) basics and implementation in Python
- Logistic regression in Python (feature selection, model fitting, and prediction)
*k*-means clustering in Python [with example]

## References

- Chen Y, Ruys W, Biros G. KNN-DBSCAN: a DBSCAN in high dimensions. arXiv preprint arXiv:2009.04552. 2020 Sep 9.
- Ryu KH, Huang L, Kang HM, Schiefelbein J. Single-cell RNA sequencing resolves molecular relationships among individual plant cells. Plant physiology. 2019 Apr 1;179(4):1444-56.
- Dudik JM, Kurosu A, Coyle JL, Sejdić E. A comparative analysis of DBSCAN, K-means, and quadratic variation algorithms for automatic identification of swallows from swallowing accelerometry signals. Computers in biology and medicine. 2015 Apr 1;59:10-8.
- Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. Inkdd 1996 Aug 2 (Vol. 96, No. 34, pp. 226-231).
- Xie X, Duan L, Qiu T, Li J. Quantum algorithm for MMNG-based DBSCAN. Scientific Reports. 2021 Jul 30;11(1):1-8.
- Schubert E, Sander J, Ester M, Kriegel HP, Xu X. DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS). 2017 Jul 31;42(3):1-21.

If you have any questions, comments, corrections, or recommendations, please email me at
**reneshbe@gmail.com**

This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.