How do i extract specific lines of data from a huge Excel sheet using Python?

Question

I need to get specific lines of data that have certain key words in them (names) and write them to another file. The starting file is a 1.5 GB Excel file. I can't just open it up and save it as a different format. How should I handle this using python?

You mentioned that you can open it in EditPad Lite; what does the data look like (e.g. binary data, XML, CSV, tab-delimited, etc.)? If you don't know, you could edit your question and paste a sample there. — tgray
– tgray, Commented Jul 13, 2010 at 20:34
It's readable words that are seperated by // where there would be different columns i beleive. Like John // Doe // male // caucasian // — novak
– novak, Commented Jul 13, 2010 at 21:13
Copy the first four lines of your file, paste them into your question, select the pasted lines, and then press Ctrl-K to format them in a helpful way. — Tim Pietzcker
– Tim Pietzcker, Commented Jul 14, 2010 at 4:50

John Machin · Accepted Answer · 2010-07-13 20:57:39Z

3

I'm the author and maintainer of xlrd. Please edit your question to provide answers to the following questions. [Such stuff in SO comments is VERY hard to read]

How big is the file in MB? ["Huge" is not a useful answer]
What software created the file?
How much memory do you have on your computer?
Exactly what happens when you try to open the file using Excel? Please explain "I can open it partially".
Exactly what is the error message that you get when you try to open "C:\bigfile.xls" with your script using xlrd.open_workbook? Include the script that you ran, the full traceback, and the error message
What operating system, what version of Python, what version of xlrd?
Do you know how many worksheets there are in the file?

answered Jul 13, 2010 at 20:57

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

novak Over a year ago

1. the file is 1,500,000 KB 2. I beleive excel. I didn't create the file myself. 3. Not enough. Its freezing up often. 4. Excel says I can't open the etire file and some data will be lost. So i can open the first part of the file and not the entire leght of the record 5. It say: the file named that does not exist 6. Python 2.6 windows and im not sure about the xlrd 7. only one worksheet

John Machin Over a year ago

Re Q3: Exactly how many MB of memory do you have? Re Q5: (1) PLEASE ensure that you have entered the full correct path to your file; PLEASE give the exact error message and traceback (use copy/paste) (2) Please tell us the contents of the first 8 bytes of the file, obtained by doing python -c "print repr(open('yourfile.xls', 'rb').read(8))"

Community · Accepted Answer · 2017-05-23 11:50:44Z

1

It sounds to me like you have a spreadsheet that was created using Excel 2007 and you have only Excel 2003.

Excel 2007 can create worksheets with 1,048,576 rows by 16,384 columns while Excel 2003 can only work with 65,536 rows by 256 columns. Hence the reason you can't open the entire worksheet in Excel.

If the workbook is just bigger in dimension then xlrd should work for reading the file, but if the file is actually bigger than the amount of memory you have in your computer (which I don't think is the case here since you can open the file with EditPad lite) then you would have to find an alternate method because xlrd reads the entire workbook into memory.

Assuming the first case:

import xlrd

wb_path = r'c:\bigfile.xls'
output_path = r'c:\output.txt'

wb = xlrd.open(wb_path)
ws = wb.sheets()[0]  # assuming you want to work with the first sheet in the workbook

with open(output_path, 'w') as output_file:
    for i in xrange(ws.nrows):
        row = [cell.value for cell in ws.row(i)]

        # ... replace the following if statement with your own conditions ...
        if row[0] == u'interesting':
            output_file.write('\t'.join(row) + '\r\n')

This will give you a tab-delimited output file that should open in Excel.

Edit:

Based on your answer to John Machin's question 5, make sure there is a file called 'bigfile.xls' located in the root of your C drive. If the file isn't there, change the wb_path to the correct location of the file you want to open.

edited May 23, 2017 at 11:50

CommunityBot

11 silver badge

answered Jul 13, 2010 at 20:54

tgray

8,9865 gold badges38 silver badges41 bronze badges

6 Comments

John Machin Over a year ago

For an Excel file to VALIDLY have more than 256 columns or 65536 rows, it has to be created by Excel 2007 or 2010 in XLSX format or XLSB format. Excel 2003 won't open any of an XLSX or XLSB file (unless maybe the compatibility kit has been added in). Unless the OP gives some precise info, all we have at the moment is idle speculation.

tgray Over a year ago

@John Machin, True enough. Though I seem to remember that Excel 2007/2010 can save a worksheet with more than 65536 rows as an XLS file and re-open it without losing any data. Since I'm about to sign out for the day I figured I'd provide my speculation before leaving and just made educated guesses based on the comments the OP made.

novak Over a year ago

It says there is a syntax error is this line with open(output_path, 'w') as output_file:

tgray Over a year ago

@novak, if you're using python 2.5 you need to include another import statement: from __future__ import with_statement

John Machin Over a year ago

@tgray: """Excel 2007/2010 can save a worksheet with more than 65536 rows as an XLS file and re-open it without losing any data.""" -- WRONG More than 65536 rows in an XLS format is just not on; the row index is kept in a 16-bit unsigned integer.

|

Ned Batchelder · Accepted Answer · 2010-07-13 20:12:18Z

0

I haven't used it, but xlrd looks like it does a good job reading Excel data.

answered Jul 13, 2010 at 20:12

Ned Batchelder

378k77 gold badges583 silver badges675 bronze badges

10 Comments

novak Over a year ago

i'm having a really hard time workign with xlrd i can't get it to open my file.

Tim Pietzcker Over a year ago

Then please post in your question what you've tried so far that didn't work. Can't you even open the file in Excel itself?

novak Over a year ago

NO the file is too big to open in excel completey. I can open it partially. I have this program: from xlrd import open_workbook,cellname book = open_workbook('C:\\bigfile.xls') sheet = book.sheet_by_index(0) print sheet.name print sheet.nrows print sheet.ncols for row_index in range(sheet.nrows): for col_index in range(sheet.ncols): print cellname(row_index,col_index),'-', print sheet.cell(row_index,col_index).value

Nick T Over a year ago

Try glancing at the file in Notepad or the like to make sure it's an actual Excel file, not something like a CSV that was named .xls[x] which can confuse Excel.

novak Over a year ago

So the problem is i can make a sample.xls file and call it as C:\\sample.xls and it opens fine and lists the data. But when I want to use the real huge actually data file C:\\bigfile.xls is says that file doesn't exist. Its really frustrating

|

Martin · Accepted Answer · 2010-07-13 20:53:41Z

0

Your problem is that you are using Excel 2003 .. You need to use a more recent version to be able to read this file. 2003 will not open files bigger than 1M rows.

answered Jul 13, 2010 at 20:53

Martin

9059 silver badges21 bronze badges

1 Comment

John Machin Over a year ago

How do you know that that is the problem? Having sifted carefully through the comments etc, I can't see any mention of row count. Besides, the OP says that xlrd says that file doesn't exist i.e. no "too big" indication.

Collectives™ on Stack Overflow

How do i extract specific lines of data from a huge Excel sheet using Python?

4 Answers 4

2 Comments

6 Comments

10 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

6 Comments

10 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related