Import small and large data using pandas (CSV, Excel, Tab, JSON, SQL, and Parquet files)

Renesh Bedre    5 minute read

  • Importing datasets is a key step in data analysis and visualization tasks. The source data can be saved in different file formats such as CSV (comma-separated value), tab-separated text, Excel, SQL, or JSON files.
  • pandas is a powerful data analysis tool developed in Python and highly useful for data reading and manipulation. It allows data imports, reshaping, slicing, indexing, merging, joining, and useful for handling missing data.

Import data from CSV files

  • A CSV is a text file where values are separated by a comma. It can be opened by any text editor or MS Excel.
  • Download test CSV file
# read test csv file
import pandas as pd
# make sure if the file is not in the current directory, add the path to the file
df = pd.read_csv('test_data.csv')
# see the content of the file
df
# output
    A   B   C   D
0  25  45  30  54
1  30  55  29  60
2  28  29  33  51
3  36  56  37  62
4  29  40  27  73

# skip lines starting with certain character e.g. #
df = pd.read_csv('test_data.csv', comment='#')
# skip first few lines e.g. skip first two rows
df = pd.read_csv('test_data.csv', skiprows=2)
# import csv file with no header
df = pd.read_csv('test_data.csv', header=None)

Import data from tab-separated text files

import pandas as pd
# make sure if the file is not in the current directory, add the path to the file
df = pd.read_csv('test_data.txt', sep='\t')
# see the content of the file
df
# output
    A   B   C   D
0  25  45  30  54
1  30  55  29  60
2  28  29  33  51
3  36  56  37  62
4  29  40  27  73

# skip lines starting with certain character e.g. #
df = pd.read_csv('test_data.csv', comment='#')
# skip first few lines e.g. skip first two rows
df = pd.read_csv('test_data.csv', skiprows=2)

Import data from MS Excel files

# make sure you have the latest version of pandas (v1.2.0)
import pandas as pd
# make sure if the file is not in the current directory, add the path to the file
# by default, it will read data from the first sheet
df = pd.read_excel('test_data.xlsx')
# see the content of the file
df
# output
    A   B   C   D
0  25  45  30  54
1  30  55  29  60
2  28  29  33  51
3  36  56  37  62
4  29  40  27  73

# read data from second sheet
df = pd.read_excel('test_data.xlsx', sheet_name='Sheet1')
# output
df
    A   B   C
0  25  45  30
1  30  55  29
2  28  29  33
3  36  56  37
4  29  40  27

# skip first few lines e.g. skip first two rows
df = pd.read_excel('test_data.xlsx', skiprows=2)
# skip range of rows e.g. skip rows from index 1 to 3
df = pd.read_excel('test_data.xlsx', skiprows=range(2, 5))

Import data from SQL database

  • You can import a table from the SQL database. I am importing the table from the PostgreSQL database.
# make sure you have the latest version of pandas (v1.2.0)
# Additionally you need to install sqlalchemy and psycopg2 (PostgreSQL database adapter)
import pandas as pd
from sqlalchemy import create_engine
df = pd.read_sql_table('test_table', create_engine('postgresql://postgres:pwd@localhost/testdb').connect())

# engine configuration
# postgresql:driver
# postgres: user name
# pwd: write your own postgres password here
# host: localhost
# testdb: database name 

df
# output
    A   B   C   D
0  25  45  30  54
1  30  55  29  60
2  28  29  33  51
3  36  56  37  62
4  29  40  27  73

Import data from JSON file

  • JSON (JavaScript Object Notation) is a lightweight text-based key-value format file (.json) mainly used for data transmission between web applications and servers
  • Download test json file
# make sure you have the latest version of pandas (v1.2.0)
import pandas as pd
df = pd.read_json('test_data.json')
# see content of file
df
# output
    A   B   C   D
0  25  45  30  54
1  30  55  29  60
2  28  29  33  51
3  36  56  37  62
4  29  40  27  73

Import large data from CSV or Text files using chunksize

  • If you have a big table (for example, a table with millions of rows), you can use the chunksize parameter in read_csv function
  • chunksize parameter reads the data from large table in chunks (smaller subsets) as defined by chunksize parameter at a time and reduces memory usage while parsing (in contrast to reading entire table in memory)
import pandas as pd
# read the table with 18M rows as chunks of 1M rows at a time
df = pd.read_csv('plant_anot_db.txt', chunksize=1000000) 
# Note: df is a TextFileReader object and not a dataframe
# combine multiple chunks
com_df = pd.concat(df)
com_df
# output
0         XP_040872840.1  putative pentatricopeptide repeat-containing p...
1         XP_040872839.1  putative pentatricopeptide repeat-containing p...
2         XP_040872838.1  putative pentatricopeptide repeat-containing p...
3         XP_040872837.1  putative pentatricopeptide repeat-containing p...
4         XP_040872836.1  putative pentatricopeptide repeat-containing p...
...                  ...                                                ...
18495124      CAN67397.1  hypothetical protein VITISV_022672 [Vitis vini...
18495125      CAN74053.1  hypothetical protein VITISV_005346 [Vitis vini...
18495126      CAN74052.1  hypothetical protein VITISV_005345 [Vitis vini...
18495127      CAN71914.1  hypothetical protein VITISV_026485 [Vitis vini...
18495128      CAN67393.1  hypothetical protein VITISV_041597 [Vitis vini...
[18495129 rows x 2 columns]

# save CSV or TSV file to Parquet format
com_df.to_parquet('plant_anot_db.parquet')

Import Apache Parquet file

  • Apache Parquet is a columnar-based self-describing (with metadata) file format mostly used in the Hadoop ecosystem
  • In contrast to row-based format (e.g. CSV, TSV), the Parquet file is optimized for query performance and requires less I/O (by reading only required columns from large files)
  • Parquet greatly reduces the storage requirement and is much faster in run time. For example, 1.4 GB of TSV file can be saved as 261 MB of Parquet file
import pandas as pd
# read the table with 18M rows as Parquet file
df = pd.read_parquet('plant_anot_db.parquet') 
df
# output
                      ID                                        Description
0         XP_040872840.1  putative pentatricopeptide repeat-containing p...
1         XP_040872839.1  putative pentatricopeptide repeat-containing p...
2         XP_040872838.1  putative pentatricopeptide repeat-containing p...
3         XP_040872837.1  putative pentatricopeptide repeat-containing p...
4         XP_040872836.1  putative pentatricopeptide repeat-containing p...
...                  ...                                                ...
18495124      CAN67397.1  hypothetical protein VITISV_022672 [Vitis vini...
18495125      CAN74053.1  hypothetical protein VITISV_005346 [Vitis vini...
18495126      CAN74052.1  hypothetical protein VITISV_005345 [Vitis vini...
18495127      CAN71914.1  hypothetical protein VITISV_026485 [Vitis vini...
18495128      CAN67393.1  hypothetical protein VITISV_041597 [Vitis vini...
[18495129 rows x 2 columns]

# read only ID column
df = pd.read_parquet('plant_anot_db.parquet', columns=['ID']) 
df.head(2)
               ID
0  XP_040872840.1
1  XP_040872839.1

Learn more about Python

This work is licensed under a Creative Commons Attribution 4.0 International License