DBSCAN in Python (with example dataset)
What is DBSCAN?
Density Based Spatial Clustering of Applications with Noise (abbreviated as DBSCAN) is a density-based unsupervised clustering algorithm. In DBSCAN, clusters are formed from dense regions and separated by regions of no or low densities.
DBSCAN computes nearest neighbor graphs and creates arbitrary-shaped clusters in datasets (which may contain noise or outliers) as opposed to k-means clustering, which typically generates spherical-shaped clusters
Unlike k-means clustering, DBSCAN does not require specifying the number of clusters initially. However, DBSCAN requires
two parameters viz. the radius of neighborhoods for a given data point p (eps
or ε
) and the minimum number
of data points in a given ε
-neighborhood to form clusters (minPts
).
DBSCAN is also useful for clustering non-linear datasets.
DBSCAN common terminologies
Here we will discuss the core point, border point, noise point (Outlier), and density reachable terminologies used in DBSCAN.
The randomly selected data point p is called a core point if there are more than a minimum number of points (minPts
)
within a ε
-neighborhood of p.
A data point is called a border point if it is within a ε
-neighborhood of p and it has fewer than the minimum number
of points (minPts
) within its ε
-neighborhood.
If a point that is not a core point or border point is called a Noise point (Outlier).
A point x is directly density reachable from point p if a point p is a core point and x is in p’s ε
-neighborhood.
A point y is density reachable from point p if a point y is directly density reachable to core point x, which
is also density reachable to core point p.
All points within a ε
-neighborhood of a core point belongs to same clusters and all points in a cluster are directly density
reachable to core point.
The points which are not density reachable from any core points are called noise (outliers) points.
Steps involved in DBSCAN clustering algorithm
- Choose any point p randomly
- Identify all density reachable points from p with
ε
andminPts
parameter - If p is a core point, create a cluster (with
ε
andminPts
) - If p is a border point, visit the next point in a dataset
- Continue the algorithm until all points are visited
Perform DBSCAN clustering in Python
To perform DBSCAN clustering in Python, you will require to install sklearn
, pandas
, and matplotlib
Python
packages. Check for how to install Python packages
Get dataset
For clustering using DBSCAN, I am using a single-cell gene expression dataset of Arabidopsis thaliana root cells processed by a 10x genomics Cell Ranger pipeline. The dataset is preprocessed t-SNE dimensionality reduction technique. Now, I will use the t-SNE embedding vectors to identify the clusters using DBSCAN.
import pandas as pd
df = pd.read_csv("https://reneshbedre.github.io/assets/posts/tsne/tsne_scores.csv")
df.head(2)
t-SNE-1 t-SNE-2
0 10.846841 -16.712580
1 24.794334 -16.775398
# check the shape of dataset
df.shape
(4406, 2)
This dataset has 4406 rows and two features. This is unlabelled dataset (no cluster information). I will identify the cluster information on this dataset using DBSCAN.
Compute required parameters for DBSCAN clustering
DBSCAN requires ε
and minPts
parameters for clustering. The minPts
parameter is easy to set. The minPts
should be 4
for two-dimensional dataset. For multidimensional dataset, minPts
should be 2 * number of dimensions
. For example, if
your dataset has 6 features, set minPts = 12
.
Another question is what optimal value should be used for the ε
parameter. The ε
parameter is difficult to set and depends
on the distance function. Sometimes, domain expertise is also required to set a ε
parameter. The ε
should be as small
as possible.
To determine the optimal ε
parameter, I will compute the k-nearest neighbor (kNN) distances (average distance of every data
point to its k-nearest neighbors) of an input dataset using the k-nearest neighbor method (unsupervised nearest neighbors
learning). For finding the k-nearest neighbor, I will use the sklearn.neighbors.NearestNeighbors
function.
NearestNeighbors
function requires n_neighbors
(number of neighbors) parameter, which can be same as the minPts
value.
import numpy as np
from sklearn.neighbors import NearestNeighbors
# n_neighbors = 5 as kneighbors function returns distance of point to itself (i.e. first column will be zeros)
nbrs = NearestNeighbors(n_neighbors=5).fit(df)
# Find the k-neighbors of a point
neigh_dist, neigh_ind = nbrs.kneighbors(df)
# sort the neighbor distances (lengths to points) in ascending order
# axis = 0 represents sort along first axis i.e. sort along row
sort_neigh_dist = np.sort(neigh_dist, axis=0)
Now, get the sorted kth column (distances with kth neighbors) and plot the kNN distance plot,
import matplotlib.pyplot as plt
k_dist = sort_neigh_dist[:, 4]
plt.plot(k_dist)
plt.axhline(y=2.5, linewidth=1, linestyle='dashed', color='k')
plt.ylabel("k-NN distance")
plt.xlabel("Sorted observations (4th NN)")
plt.show()
In the k-NN distance plot, you should look for the “knee” of the curve to find the optimal value of ε
. In the below
plot, the knee occurs at approximately 2.5 i.e. the points below 2.5 belong to a cluster and points above 2.5 are noise
or outliers (noise points will have higher kNN distance).
Compute DBSCAN clustering
Now, we have calculated ε
and minPts
parameters for DBSCAN clustering. We will pass these parameters to the DBSCAN
to predict the clusters using the sklearn.cluster.DBSCAN
class.
from sklearn.cluster import DBSCAN
clusters = DBSCAN(eps=2.5, min_samples=4).fit(df)
# get cluster labels
clusters.labels_
# output
array([0, 0, 1, ..., 1, 1, 1], dtype=int64)
# check unique clusters
set(clusters.labels_)
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, -1}
# -1 value represents noisy points could not assigned to any cluster
Choosing appropriate DBSCAN parameters can be harder sometimes. If you select
ε
based on your domain expertise, you should play with different values ofminPts
to get better insights from clustering.
Get each cluster size,
from collections import Counter
Counter(clusters.labels_)
# output
Counter({1: 1501, 0: 834, 2: 703, 3: 285, 7: 283, 5: 245, 6: 232, 4: 145, 12: 46, -1: 34, 11: 13, 8: 11, 9: 11, 14: 10, 15: 8, 17: 8, 13: 6, 10: 5, 18: 5, 19: 5, 22: 4, 16: 4, 20: 4, 21: 4})
DBSCAN clustering identified 22 clusters with given parameters. The largest cluster (1) has 1501 data points and the smallest cluster (22) has 4 data points.
Visualization of DBSCAN clustering
Visualize the cluster as a scatter plot and color the clusters using predicted class labels,
import seaborn as sns
import matplotlib.pyplot as plt
p = sns.scatterplot(data=df, x="t-SNE-1", y="t-SNE-2", hue=clusters.labels_, legend="full", palette="deep")
sns.move_legend(p, "upper right", bbox_to_anchor=(1.17, 1.2), title='Clusters')
plt.show()
Different colors represent different predicted clusters. Blue represents noisy points (-1 cluster).
DBSCAN limitations
- DBSCAN is computationally expensive (less scalable) and more complicated clustering method as compared to simple k-means clustering
- DBSCAN is sensitive to input parameters, and it is hard to set accurate input parameters
- DBSCAN depends on a single value of
ε
for all clusters, and therefore, clusters with variable densities may not be correctly identified by DBSCAN - DBSCAN is a time-consuming algorithm for clustering
Enhance your skills with courses on Machine Learning and Python
- Machine Learning Specialization
- Machine Learning with Python
- Machine Learning for Data Analysis
- Cluster Analysis in Data Mining
- Python for Everybody Specialization
- Python for Genomic Data Science
Related reading
- Performing and visualizing the Principal component analysis (PCA) from PCA function and scratch in Python
- Support Vector Machine (SVM) basics and implementation in Python
- Logistic regression in Python (feature selection, model fitting, and prediction)
- k-means clustering in Python [with example]
References
- Chen Y, Ruys W, Biros G. KNN-DBSCAN: a DBSCAN in high dimensions. arXiv preprint arXiv:2009.04552. 2020 Sep 9.
- Ryu KH, Huang L, Kang HM, Schiefelbein J. Single-cell RNA sequencing resolves molecular relationships among individual plant cells. Plant physiology. 2019 Apr 1;179(4):1444-56.
- Dudik JM, Kurosu A, Coyle JL, Sejdić E. A comparative analysis of DBSCAN, K-means, and quadratic variation algorithms for automatic identification of swallows from swallowing accelerometry signals. Computers in biology and medicine. 2015 Apr 1;59:10-8.
- Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. Inkdd 1996 Aug 2 (Vol. 96, No. 34, pp. 226-231).
- Xie X, Duan L, Qiu T, Li J. Quantum algorithm for MMNG-based DBSCAN. Scientific Reports. 2021 Jul 30;11(1):1-8.
- Schubert E, Sander J, Ester M, Kriegel HP, Xu X. DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS). 2017 Jul 31;42(3):1-21.
If you have any questions, comments, corrections, or recommendations, please email me at reneshbe@gmail.com
This work is licensed under a Creative Commons Attribution 4.0 International License
Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.