Samtools is a suite of utilities commonly used in analyzing the aligned sequence data in the SAM (Sequence Alignment/Map) and BAM (Binary Alignment/Map) formats in bioinformatics and genomics analysis.

`samtools view`

command with `-F`

or `-f`

parameter and a flag value is typically used in the filtering mapped and unmapped sequence
reads from SAM/BAM files.

The flag value is a numerical value that encodes various properties of each read alignment. For example, the flag value of 4 (0x4) indicates that the sequence read does not have a valid alignment to the reference genome (unmapped sequence reads).

The following examples demonstrate how to filter mapped and unmapped sequence reads from the BAM file using samtools.

You can use the following commands to filter the unmapped sequence reads from the BAM file using Samtools.

```
samtools view -b -f 4 input.bam > unmapped.bam
```

Where, `-b`

parameter specify the output should be in BAM format, `-f 4`

parameter specifies to **filter the unmapped
sequence reads** (retain only unmapped sequence reads in `unmapped.bam`

).

The above command will create a new BAM file `unmapped.bam`

which will contain only unmapped reads from the input BAM file.

If you want to create an output file in SAM format, you can use the following command.

```
samtools view -f 4 input.bam > unmapped.sam
```

The above command will create a new SAM file `unmapped.sam`

which will contain only unmapped reads from the input BAM file.

These commands will work for the single-end reads. While filtering mapped and unmapped sequence reads for paired-end data, it is also important to consider whether the paired-end reads are properly paired.

You can use the following commands to filter the mapped sequence reads from the BAM file using Samtools.

```
samtools view -b -F 4 input.bam > mapped.bam
```

Where, `-b`

parameter specifies the output should be in BAM format, `-F 4`

parameter specifies to **filter out the unmapped
sequence reads** (retain only mapped sequence reads in `mapped.bam`

).

The above command will create a new file `mapped.bam`

which will contain only mapped reads from the input BAM file.

If you want to create an output file in SAM format, you can use the following command.

```
samtools view -F 4 input.bam > mapped.sam
```

The above command will create a new SAM file `mapped.sam`

which will contain only mapped reads from the input BAM file.

These commands will work for the single-end reads. While filtering mapped and unmapped sequence reads for paired-end data, it is also important to consider whether the paired-end reads are properly paired.

- Genomic Data Science Specialization
- Biology Meets Programming: Bioinformatics for Beginners
- Python for Genomic Data Science
- Bioinformatics Specialization
- Command Line Tools for Genomic Data Science
- Introduction to Genomic Technologies

This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.

]]>In a genome assembly analysis, we typically generate millions of short sequence reads using Next-generation sequencing (NGS) technology for a given sample.

A large number of short sequence reads generated by NGS overlap. A genome sequence can be constructed from overlapped
reads by *de novo* assembly or using reference genomes.

In some cases, the sequence reads do not overlap and therefore don’t contribute to contigs. A sequence read of this
type is often referred to as a **singleton**.

A singletons sequence read could represent a unique region in the genome or it could be a sequencing error.

Singletons need to be analyzed thoroughly based on the goal of the study as they may represent rare variants or novel sequences.

- Genomic Data Science Specialization
- Biology Meets Programming: Bioinformatics for Beginners
- Python for Genomic Data Science
- Bioinformatics Specialization
- Command Line Tools for Genomic Data Science
- Introduction to Genomic Technologies

This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.

]]>A histogram is useful for visualizing the frequency distribution of data as a bar graph. The height of the bar in the histogram represents the frequency counts of observations falling into each interval (bins).

In this article, you will learn how to create the histogram using the `numpy.histogram()`

function from Python NumPy package.

The general syntax of `numpy.histogram()`

looks like this:

```
# import package
import numpy as np
# generate random numbers
np.histogram(data, bins=10)
```

Where,

Parameter |
Description |
---|---|

`data` |
Input data in array format |

`bins` |
number of equal-width bins. It can be single int value or sequence of values (default is 10) |

The following examples explains how to use the `numpy.histogram()`

function to generate histogram in Python.

The following example shows how to generate a histogram using a `numpy.histogram()`

.

By default, the `numpy.histogram()`

function generates 10 intervals for the histogram

```
# import package
import numpy as np
# generate random data
data = np.random.uniform(low=1, high=100, size=100)
# create numpy histogram with 10 intervals
hist, bins = np.histogram(data)
print(hist, bins)
# output
[ 7 9 7 10 8 9 15 15 10 10]
[ 3.01807371 12.68467772 22.35128173 32.01788575 41.68448976 51.35109377
61.01769778 70.6843018 80.35090581 90.01750982 99.68411384]
```

In the above output, the first array is frequency counts for each bin (equally-spaced intervals) and second array is the edges of the bins.

Now, plot the histogram,

```
# import package
import matplotlib.pyplot as plt
# draw histogram
plt.hist(data, bins=bins)
plt.xlabel("Bins (Intervals)")
plt.ylabel("Frequency counts")
plt.show()
```

The following example shows how to generate a histogram with specific bins (intervals) using a `numpy.histogram()`

function.

```
# import package
import numpy as np
# generate random data
data = np.random.uniform(low=1, high=100, size=100)
# create numpy histogram with 10 specific intervals
hist, bins = np.histogram(data, bins=[1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100])
print(hist, bins)
# output
[ 7 9 8 18 6 13 9 8 10 12]
[ 1 10 20 30 40 50 60 70 80 90 100]
```

In the above output, the first array is frequency counts for each bin (equally-spaced interval) and the second array is the edges of the bins.

Now, plot the histogram,

```
# import package
import matplotlib.pyplot as plt
# draw histogram
plt.hist(data, bins=bins)
plt.xlabel("Bins (Intervals)")
plt.ylabel("Frequency counts")
plt.show()
```

- Introduction to Statistics
- Python for Everybody Specialization
- Python 3 Programming Specialization
- Statistics with Python Specialization
- Advanced Statistics for Data Science Specialization

This work is licensed under a Creative Commons Attribution 4.0 International License

The **antilogarithm (antilog)** refers to the inverse operation of a logarithmic (log) number. The antilog is used for finding the original number
from the log number.

For example, the antilog of the log with a base 10 (log_{10}) value can be found by raising the base value (10) to the
power of the log value. If log_{10}(x) = z then the antilog of z is 10^{z}.

The following examples illustrates for how to find antilog of various log bases,

Base |
Log |
Antilog |
---|---|---|

10 | log_{10}(5) = 0.6989 |
10^{0.6989} = 5 |

2 | log_{2}(5) = 2.3219 |
2^{2.3219} = 5 |

e | log(5) = 1.6094 | 2.7182^{1.6094} = 5 |

In Python, you can use the `10**x`

, `2**x`

, or `np.exp(x)`

functions to calculate the antilogs, depending on the base you
want to use.

Suppose you have a value of log_{10} as follows:

```
# import package
import numpy as np
# calculate log10 value
log_val = np.log10(5)
# see log value
log_val
0.6989700043360189
```

Now, calculate the antilog of log_{10} value (0.6989) to get the original value of 5.

```
# raise base value 10 to the power of the log10 value
10**log_val
# output
5.0
```

By taking the antilog of log_{10} value, we obtained the original value of 5.

Suppose you have a value of log_{2} as follows:

```
# import package
import numpy as np
# calculate log10 value
log_val = np.log2(5)
# see log value
log_val
2.321928094887362
```

Now, calculate the antilog of log_{2} value (2.3219) to get the original value of 5.

```
# raise base value 2 to the power of the log2 value
2**log_val
# output
5.0
```

By taking the antilog of log_{2} value, we obtained the original value of 5.

Suppose you have a value of natural log as follows:

```
# import package
import numpy as np
# calculate log10 value
log_val = np.log(5)
# see log value
log_val
1.6094379124341003
```

Now, calculate the antilog of natural log value (2.3219) to get the original value of 5.

You can use `np.exp()`

function to calculate the antilog of natural log value.

```
# import package
import numpy as np
# calculate antilog
np.exp(log_val)
# output
5.0
```

By taking the antilog of natural log value, we obtained the original value of 5.

**Related**: How to Calculate Antilog of Values in R

- Introduction to Statistics
- Python for Everybody Specialization
- Python 3 Programming Specialization
- Statistics with Python Specialization
- Advanced Statistics for Data Science Specialization

This work is licensed under a Creative Commons Attribution 4.0 International License

The **antilogarithm (antilog)** refers to the inverse operation of a logarithmic (log) number. The antilog is used for finding the original number
from the log number.

For example, the antilog of the log with a base 10 (log_{10}) value can be found by raising the base value (10) to the
power of the log value. If log_{10}(x) = z then the antilog of z is 10^{z}.

The following examples illustrates for how to find antilog of various log bases,

Base |
Log |
Antilog |
---|---|---|

10 | log_{10}(8) = 0.90309 |
10^{0.90309} = 8 |

2 | log_{2}(8) = 3 |
2^{3} = 8 |

e | log(8) = 2.079442 | 2.7182^{2.079442} = 8 |

In R, you can use the `10^x`

, `2^x`

, or `exp(x)`

functions to calculate the antilogs, depending on the base you
want to use.

Suppose you have a value of log_{10} as follows:

```
# calculate log10 value
log_val = log10(8)
# see log value
log_val
0.90309
```

Now, calculate the antilog of log_{10} value (0.90309) to get the original value of 8.

```
# raise base value 10 to the power of the log10 value
10^log_val
# output
8
```

By taking the antilog of log_{10} value, we obtained the original value of 8.

Suppose you have a value of log_{2} as follows:

```
# calculate log10 value
log_val = log2(8)
# see log value
log_val
3
```

Now, calculate the antilog of log_{2} value (3) to get the original value of 8.

```
# raise base value 2 to the power of the log2 value
2^log_val
# output
8
```

By taking the antilog of log_{2} value, we obtained the original values of 8.

Suppose you have a value of natural log as follows:

```
# calculate log10 value
log_val = log(8)
# see log value
log_val
2.079442
```

Now, calculate the antilog of natural log value (2.079442) to get the original value of 8.

You can use `exp()`

function to calculate the antilog of natural log value.

```
# calculate antilog
exp(log_val)
# output
8
```

By taking the antilog of natural log value, we obtained the original values of 8.

**Related**: How to Calculate Antilog of Values in Python

- Introduction to Statistics
- R Programming
- Data Science: Foundations using R Specialization
- Data Analysis with R Specialization
- Getting Started with Rstudio
- Applied Data Science with R Specialization
- Statistical Analysis with R for Public Health Specialization

This work is licensed under a Creative Commons Attribution 4.0 International License

Quartiles are values which divide a dataset into four equal parts, each of which contains 25% of the data. It is helpful to understand the spread and distribution of a dataset by using quartiles.

In general, there are three quartiles (Q1, Q2, and Q3) used. Q1 (first quartile), Q2 (second quartile), and Q3 (third quartile) are the values below which 25%, 50%, and 75% of the data fall.

In R, the quartiles can be calculated using the built-in `quantile()`

function.

The general syntax of `quantile()`

looks like this:

```
# calculate quartiles
quantile(x)
```

Where, `x`

is a vector of the dataset.

The following three examples explains how to use the `quantile()`

function from R to calculate quartiles from vector and
data frame.

Suppose, you have a following dataset for which you would like to calculate the quartiles,

```
x = c(48, 64 ,43, 62, 56, 52, 80, 63, 68, 82)
```

Calculate the quartiles using `quantile()`

function,

```
quantile(x)
# output
0% 25% 50% 75% 100%
43.0 53.0 62.5 67.0 82.0
```

From the output, you can see that Q1, Q2, and Q3 quartile values are 53, 62.5, and 67, respectively.

Suppose, you have a dataset in a data frame format.

```
# create random pandas DataFrame
df < -data.frame(col1 = c('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'),
col2 = c(48, 64 ,43, 62, 56, 52, 80, 63, 68, 82))
# view first few rows
head(df, 2)
col1 col2
1 A 48
2 B 64
# calculate quartiles
quantile(df$col2)
# output
0% 25% 50% 75% 100%
43.0 53.0 62.5 67.0 82.0
```

From the output, you can see that Q1, Q2, and Q3 quartile values are 53, 62.5, and 67, respectively.

You can also visualize the quartiles using the boxplot. The boxplot helps to visualize the spread and distribution of the data.

Create a boxplot,

```
x = c(48, 64 ,43, 62, 56, 52, 80, 63, 68, 82)
# boxplot
boxplot(x)
```

Using the boxplot, we can estimate the quartile locations.

Minimum value or Q0 (43) is indicated by the bottom whisker, Q1 (53) is indicated by the bottom line of box, median value or Q2 (62.5) is indicated by the middle dark line, and maximum value or Q4 (82) is indicated by the top whisker.

**Related**: Calculate quartiles in Python

- Introduction to Statistics
- R Programming
- Data Science: Foundations using R Specialization
- Data Analysis with R Specialization
- Getting Started with Rstudio
- Applied Data Science with R Specialization
- Statistical Analysis with R for Public Health Specialization

This work is licensed under a Creative Commons Attribution 4.0 International License

Heatmap is a statistical visualization method for visualizing complex data sets in matrix form and quickly gaining insights from large datasets.

Heatmaps are widely used in bioinformatics for analyzing and visualizing large gene expression datasets obtained from different samples and conditions.

This tutorial explains how to use the `Heatmap()`

function from the `ComplexHeatmap`

R *Bioconductor* package for visualizing complex heatmaps.

You can install the `ComplexHeatmap`

R package (from *Bioconductor*) as below:

```
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ComplexHeatmap")
```

We will use the sample RNA-seq gene expression dataset for creating heatmaps using the `ComplexHeatmap`

package.

```
# load dataset
df = read.csv("https://reneshbedre.github.io/assets/posts/heatmap/hm_data.csv", row.names="Gene")
# convert to matrix
df_mat = data.matrix(df)
# view first few rows of data
head(df, 5)
# output
A B C D E F
B-CHI1 4.505700 3.260360 -1.249400 8.89807 8.05955 -0.842803
CTL2-1 3.508560 1.660790 -1.856680 -2.57336 -1.37370 1.196000
B-CHI2 2.160030 3.146520 0.982809 9.02430 6.05832 -2.967420
CTL2-2 1.884750 2.295690 0.408891 -3.91404 -2.28049 1.628820
CHIV1 0.255193 -0.761204 -1.022350 3.65059 2.46525 -1.188140
```

Create and visualize a single heatmap with the default settings,

```
# load package
library(ComplexHeatmap)
# visualize heatmap
Heatmap(df_mat)
```

You can change the color of the heatmap using the `col`

argument

```
# create color scale
library(circlize)
col_fun = colorRamp2(seq(min(df_mat), max(df_mat), length = 3),
c("green", "black", "red"))
# visualize heatmap
Heatmap(df_mat, col = col_fun)
```

You can change the individual cell borders of the heatmap using the `rect_gp`

argument

```
Heatmap(df_mat, rect_gp = gpar(col = "white", lwd = 2))
```

You can add row and column titles to the heatmap using the `column_title`

and `row_title`

arguments

```
Heatmap(df_mat, column_title = "Conditions", row_title = "Genes",
column_title_side = "bottom")
```

The row and column clustering is plotted by default in ComplexHeatmap.

You can turn off row and column clustering using `cluster_rows`

and `show_column_dend`

arguments

```
# turn off row clustering
Heatmap(df_mat, cluster_rows = FALSE)
# turn off column clustering
Heatmap(df_mat, show_column_dend = FALSE)
```

You can also color the individual row clusters,

```
# install.packages("dendextend")
library(dendextend)
row_dend = as.dendrogram(hclust(dist(df_mat)))
# color row clustering
Heatmap(df_mat, cluster_rows = color_branches(row_dend, k = 5))
```

You can also split the heatmap by rows and columns to better understand the clustering of the data. It uses k-means clustering to split the clusters.

```
# split row clusters
Heatmap(df_mat, name = "scale", row_km = 5)
# split column clusters
Heatmap(df_mat, name = "scale", column_km = 2)
```

Split by both rows and columns simultaneously,

```
# split row and column clusters at same time
Heatmap(df_mat, name = "scale", row_km = 5, column_km = 2)
```

You can change the legend position in the ComplexHeatmap as below,

```
draw(Heatmap(df_mat), heatmap_legend_side = "left")
```

Similarly, you can use the `bottom`

and `top`

positions to adjust the legend position.

**Related**: pheatmap: create annotated heatmaps in R

- Genomic Data Science Specialization
- Biology Meets Programming: Bioinformatics for Beginners
- Bioinformatics Specialization
- Command Line Tools for Genomic Data Science
- Introduction to Genomic Technologies

This work is licensed under a Creative Commons Attribution 4.0 International License

Survival analysis (also known as time-to-event analysis) is a statistical method for analyzing the duration of time until the event of interest occurs (e.g. death of patients).

The Kaplan-Meier survival method is a non-parametric statistical technique that estimates the survival probability of an event occurring at various points in survival time.

In the Kaplan-Meier survival curve, survival probability is plotted against survival time. The survival curve is useful for understanding the median survival time (the time at which survival probability is 50%).

The Kaplan-Meier survival method is a non-parametric statistical technique that estimates the survival probability of an event occurring at various points in survival time.

The Kaplan-Meier curve is primarily used for descriptive analysis of survival data. When the predictor variable is binary, Kaplan-Meier survival analysis is applied. It does not consider additional predictors in the analysis. A regression-based Cox proportional hazards model (CPH) should be used if you have other continuous variables to study the impact on survival analysis.

This tutorial explains how to perform Kaplan–Meier survival analysis in R.

We will use the patient survival data for performing the Kaplan–Meier survival analysis.

Load the dataset,

```
# load package
# install.packages("tidyverse")
library(tidyverse)
# load data file
df <- read_csv("https://reneshbedre.github.io/assets/posts/survival/survival_data.csv")
# view first few rows
head(df, 5)
# A tibble: 5 × 5
patient survival_time_days outcome treatment age_years
<dbl> <dbl> <dbl> <chr> <dbl>
1 1 1 1 drug_2 75
2 2 1 1 drug_2 79
3 3 4 1 drug_2 85
4 4 5 1 drug_2 76
5 5 6 0 drug_2 66
```

This dataset contains 15 patients with their survival times (in days), outcome (1=death, 0=survived), treatments (drug_1 and drug_2), and age of the patients.

In R, the Kaplan–Meier survival analysis can be performed using the `Surv()`

and `survfit()`

functions from the `survival`

package.

For Kaplan–Meier analysis, you need three key variables i.e. survival time, status at survival time (event of interest), and treatment groups of patients.

First, you need to create a survival object using the `Surv()`

function. In a survival object, the event parameter
must be binary e.g. TRUE/FALSE (TRUE = death), 1/0 (1 = death), 2/1 (2 = death).

```
# load package
library("survival")
surv = Surv(time = df$survival_time_days, event = df$outcome)
print(surv)
# output
[1] 1 1 4 5 6+ 8 9+ 9 12 15+ 22 25+ 37 55 72+
```

In the above output, the + sign indicates that survival time was censored i.e. patients survived after the time of study, or they have dropped from the study, or they have not followed up the study.

Note: If there are a large number of censored patients in the study, the survival curve may not be reliable. The results should be interpreted cautiously.

Now, we will compute the survival probability for both drug treatments using `survfit()`

function.

```
fit <- survfit(formula = surv ~ treatment, data = df)
summary(fit)
# output
Call: survfit(formula = surv ~ treatment, data = df)
treatment=drug_1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
8 7 1 0.857 0.132 0.6334 1
12 6 1 0.714 0.171 0.4471 1
37 3 1 0.476 0.225 0.1884 1
55 2 1 0.238 0.203 0.0449 1
treatment=drug_2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 8 2 0.750 0.153 0.503 1.000
4 6 1 0.625 0.171 0.365 1.000
5 5 1 0.500 0.177 0.250 1.000
9 3 1 0.333 0.180 0.116 0.961
22 1 1 0.000 NaN NA NA
```

Visualize the Kaplan–Meier survival curve for both treatments (drug_1 and drug_2). We will use the `ggsurvplot()`

function
from the `survminer`

package.

```
# load package
# install.packages("survminer")
library("survminer")
# plot Kaplan–Meier survival curve
ggsurvplot(fit = fit, pval = TRUE, surv.median.line = "hv",
xlab = "Survival time (Days)", ylab = "Survival probability")
# with confidence interval
ggsurvplot(fit = fit, pval = TRUE, surv.median.line = "hv", conf.int =TRUE,
xlab = "Survival time (Days)", ylab = "Survival probability")
```

The patient survival rate is higher for drug_1 treatment than for drug_2 treatment. Similarly, the median survival time (time at which survival probability is 50%) is higher for patients taking drug_1 treatment (37 days) than drug_2 treatment (7 days).

**Related**: Survival analysis

- Introduction to Statistics
- R Programming
- Data Science: Foundations using R Specialization
- Data Analysis with R Specialization
- Getting Started with Rstudio
- Applied Data Science with R Specialization
- Statistical Analysis with R for Public Health Specialization

This work is licensed under a Creative Commons Attribution 4.0 International License

In statistics, we often come across terms such as Quartiles, Quantile, and Percentiles, and often they are confusing to understand.

Quartiles, quantiles, and percentiles are used to describe the distribution of data, and particularly useful in understanding the spread, relative position, and central tendency of data.

The differences between quartiles, quantiles, and percentiles can be explained as follows:

**Quartile**

- Quartiles divide a dataset into four equal parts, each containing 25% of the data.
- Three quartiles (Q1, Q2, and Q3) are commonly used, which divide a dataset into four equal parts.
- The Q2 (second quartile) is also known as median.
- The Q0 and Q4 represents the minimum and maximum values in the dataset.
- The Q1 is a value below which contains the 25% of the data falls.

**Quantiles**

- Quantiles divide the dataset into any number of equal parts.
- Quartiles and percentiles are parts of quantiles.
- For example, quartiles, quintiles, deciles, and percentiles split the data into 4, 5, 10, and 100 equal parts
- Quantiles are typically expressed as decimal values and range from 0 to 1.
- The 0.25 quantile is a value below which contains the 25% of the data falls.

**Percentiles**

- Percentiles divide the data into 100 equal parts.
- Percentiles are typically expressed as whole numbers and range from 0 to 100.
- The 25th percentile is equivalent to the 0.25 quantile and first quartile (Q1). Similarly, the 50th percentile is equivalent to the 0.5 quantile and second quartile (Q2).
- The 25th percentile is a value below which the 25% of the data falls.

Suppose you have the following dataset,

```
x <- c(37, 87, 17, 32, 65, 58, 52, 84, 41, 37)
```

You can calculate the dataset’s quartile, quantile, and percentile using the built-in `quantile()`

function.

Calculate the second quartile (Q2) or median. The quantile of Q2 is 0.5.

```
quantile(x, 0.5)
# output
50%
46.5
```

The Q2 is 46.5. It means that 50% of the values in the dataset are below 46.5.

Calculate the 95th percentile. The quantile of the 95th percentile is 0.95.

```
quantile(x, 0.95)
# output
95%
85.65
```

The 95th percentile is 85.65. It means that 95% of the values in the dataset are below 85.65.

The quartile should be used to understand the spread and distribution of data. For example, if your goal is to identify the median score of students, what is the minimum and maximum score, and how many students are in the top 25%?

The quantile should be used to understand the spread and distribution of data beyond the quartile. For example, if your goal is to create five groups (quintiles) of students based on score to understand the score distribution. In this case, each group will contain 20% of students.

The percentile should be used for comparing position/ranking among the population. For example, percentiles are often used for scoring national exams. If the student scores 95th percentile, it means the student’s score is higher than 95% of the students who took the exam.

Overall, quartile and quantile are useful in understanding the data’s distribution, while percentile is useful for understanding the relative positions of data points.

**Related**: Calculate Quartiles in R

- Introduction to Statistics
- R Programming
- Data Science: Foundations using R Specialization
- Data Analysis with R Specialization
- Getting Started with Rstudio
- Applied Data Science with R Specialization
- Statistical Analysis with R for Public Health Specialization

This work is licensed under a Creative Commons Attribution 4.0 International License

Quantiles and percentiles are often confusing statistics terms.

In data analysis, quantiles and percentiles are used for describing the distribution of data, as well as determining spread, relative position, and central tendency.

The key differences between quantiles and percentiles are:

**Quantiles**

- Quantiles divide the dataset into any number of equal parts.
- Quartiles and percentiles are parts of quantiles.
- For example, quartiles and percentiles split the data into 4 and 100 equal parts
- Quantiles are typically expressed as decimal values and range from 0 to 1 (e.g., 0.25, 0.5).
- The 0.25 quantile is a value below which contains the 25% of the data falls.

In python, quantiles can be calculated using the quantile() function from the NumPy

The following example shows how to calculate the quantiles in Python.

```
# import package
import numpy as np
# dataset
x = [15, 10, 15, 25, 25, 30, 35, 45, 45, 50, 55, 65]
# calculate 0.5 quantile
np.quantile(x, [0.5])
# output
array([32.5])
```

The value of 0.5 quantile is 32.5. This indicates that 50% of the data falls below the value of 32.5.

**Percentiles**

- Percentiles divide the data into 100 equal parts.
- Percentiles are typically expressed as whole numbers and range from 0 to 100.
- The 25th percentile is equivalent to the 0.25 quantile. Similarly, the 75th percentile is equivalent to the 0.75 quantile.
- The 50th percentile is a value below which the 50% of the data falls.

In python, percentiles can be calculated using the percentile() function from the NumPy

The following example shows how to calculate the percentiles in Python.

```
# import package
import numpy as np
# dataset
x = [15, 10, 15, 25, 25, 30, 35, 45, 45, 50, 55, 65]
# calculate 95th percentile
np.percentile(x, [95])
# output
array([59.5])
```

The value of the 95th percentile is 59.5. It means that 95% of the values in the dataset are below 59.5.

It’s important to note that

`np.quantile()`

and`np.percentile()`

return the same value for a given quantile or percentile. However, there is a difference between the two. The`np.quantile()`

function requires values between 0 and 1 as its second argument, while`np.percentile()`

requires values between 0 and 100 for its second argument.

**Related**: Difference Between Quantile, Quartile, and Percentile

- Introduction to Statistics
- Python for Everybody Specialization
- Python 3 Programming Specialization
- Statistics with Python Specialization
- Advanced Statistics for Data Science Specialization

This work is licensed under a Creative Commons Attribution 4.0 International License