Create a gene counts matrix from featureCounts

Renesh Bedre    1 minute read

  • featureCounts software program summarizes the read counts for genomic features (e.g., exons) and meta-features (e.g., gene) from genome mapped RNA-seq, or genomic DNA-seq reads (SAM/BAM files).
  • featureCounts uses genomics annotations in GTF or SAF format for counting genomic features and meta-features.

When you want to analyze the data for differential gene expression analysis, it would be convenient to have counts for all samples in a single file (gene count matrix). You can get this gene count matrix file when you run featureCounts on all mapped files at once.

# meta-feature (gene) level count
featureCounts -t 'exon' -g 'gene_id' -a annotation.gtf -T 10 -o counts.txt library1.bam library2.bam library3.bam
# use -f option for feature (exon) level count

But, when you run a featureCounts for large samples individually, then the counts for each sample will be in a separate text file.

To get the merged gene count matrix from all individual counts files, we will use bioinfokit v2.0.5

# run this Python code (in a Python interpreter) from a folder where all files are present
from bioinfokit.analys import HtsAna
# make sure all individual count files are present in same folder
# by default, it assumes each count file has .txt extension 
HtsAna.merge_featureCount()

See detailed usgae of HtsAna.merge_featureCount here

Once it runs successfully, you can see the output file gene_matrix_count.csv in the same folder, which has counts merged for all samples.

# gene_matrix_count.csv
Geneid,sample1.bam,sample2.bam,sample3.bam
PGSC0003DMG400015133,0,7,2
PGSC0003DMG400015132,72,95,155
PGSC0003DMG400022764,42,78,77
PGSC0003DMG400022799,2,3,5

References

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

This work is licensed under a Creative Commons Attribution 4.0 International License