How To Use Python To Query A CSV

Tutorial Tuesday – A lesson that you can use

Sometimes you might find that you have a very big comma separated value (csv) file from which you only need to look at a subset of the rows. An easy way to do this is to open the file in your spreadsheet program and apply a filter. Sometimes using a spreadsheet will be too slow, like at times when there are hundreds of thousands of rows, or your spreadsheet is older and limits the number of rows that you can open. For example if you are looking through the Canada Open data directory you may find files that are too large to be truly managable in your spreadsheet.

This tutorial describes how you can use a plain installation of Python to query a filter a large csv file and save it to a new file. Before you start you must have Python 3 installed.

In this tutorial, we will.

  1. Read the first 10 rows of a csv file.
  2. Query the csv file so that it selects the values that you want and saves it to a new file.

We will start with a sample csv file that you can download here.

Read the First 10 Rows Of A CSV File

To read the first 10 rows of a csv file we will.

  1. Import the baked-in csv module.
  2. Import a module that will allow us to step through each row of the CSV file, and stop after the first 10 rows. This is the itertools module. Short for iteration tools module.

The first two lines of code look like this:

import csv
    from itertools import islice

Next we will open the csv file. The file must be in the same directory as your Python script. You can always change the path of your file in your script.

csvfile = open('test.csv', newline='')
csv_rows = csv.reader(csvfile)

Lastly we will use the islice function to slice off the top 10 rows of the file and print them.

top_ten = islice(csv_rows, 0, 10)

for item in top_ten:
	print (item)

Knowing what the top 10 rows looks like helps us determine how we should be querying or filtering the file. We can see which column we want to look at to filter out.

The whole script for reading the top 10 rows is as follows:

import csv
from itertools import islice

csvfile = open('test.csv', newline='')
csv_rows = csv.reader(csvfile)

top_ten = islice(csv_rows, 0, 10)

for item in top_ten:
	print (item)

Querying the CSV

Next we will filter the CSV file to only save the values that we want it to. To make life easiest, this script should be a separate file from the previous script.

We will start by importing the csv module and opening the csv file in the same way that we did in the previous script:

import csv

csvfile = open('test.csv', newline='')
 csv_rows = csv.reader(csvfile)

We also want to open a file where we will save our output data. Using the ‘a+’ attribute means that it is a file that we can append rows of data to.

f=open("saved_file.csv", "a+")

Then we loop through the csv file with the following code. This code does a few things:

  1. It loops throught the whole list of the csv.
    for item in csv_rows:
  2. It will select the column that you want to filter, and the value that you want to filter by a simple if statement.
if item[1] == 'c':

Counting columns starts at 0. If you wanted the third column, you would actually put item[2]. (0, 1, 2 brings you to the third column.) You can add any value you want in the filter side.

Next it writes the item to disk. Because your row is in a list in Python, the Join function just turns the row into a text row that is joined by commas. (Turning the output file back into a human-readable csv.)

f.write(','.join(item)+'\n')

For our own sake, we also print the row that we have just saved.

print (item)

Lastly after we are done going through all of the rows, we close the open file.

f.close()

The whole script is below:

import csv
from itertools import islice

csvfile = open('test.csv', newline='')
csv_rows = csv.reader(csvfile)

top_ten = islice(csv_rows, 0, 10)

for item in top_ten:
	print (item)

These two scripts can be used to read the first 10 lines of a massive CSV file, and then filter the data to output only the rows that you want into a new file.

You can download a zipped file with all of the scripts here.

Postscript/ Opening Different CSV Files

Opening our simple file works well. But sometimes you might find that the csv file you are working with was saved in in a more complex format. In this case, you might have to open the file with an encoding tag. It might be an iso encoding, as in our example below. It could also be a utf-8 or utf-16 encoding. Below is the code that you would use if you had to explicitly define the type of csv file. Use this code if you are getting an error about the type of file you have.

csvfile = open('test1.csv', newline='', encoding='iso-8859-1')