-2

I need to get specific lines of data that have certain key words in them (names) and write them to another file. The starting file is a 1.5 GB Excel file. I can't just open it up and save it as a different format. How should I handle this using python?

5
  • 3
    Detail in the question is also appreciated. Commented Jul 13, 2010 at 20:13
  • 2
    You mentioned that you can open it in EditPad Lite; what does the data look like (e.g. binary data, XML, CSV, tab-delimited, etc.)? If you don't know, you could edit your question and paste a sample there. Commented Jul 13, 2010 at 20:34
  • It's readable words that are seperated by // where there would be different columns i beleive. Like John // Doe // male // caucasian // Commented Jul 13, 2010 at 21:13
  • 1
    as @tgray asked, POST A SAMPLE (edit your question!). Commented Jul 13, 2010 at 21:33
  • Copy the first four lines of your file, paste them into your question, select the pasted lines, and then press Ctrl-K to format them in a helpful way. Commented Jul 14, 2010 at 4:50

4 Answers 4

3

I'm the author and maintainer of xlrd. Please edit your question to provide answers to the following questions. [Such stuff in SO comments is VERY hard to read]

  1. How big is the file in MB? ["Huge" is not a useful answer]

  2. What software created the file?

  3. How much memory do you have on your computer?

  4. Exactly what happens when you try to open the file using Excel? Please explain "I can open it partially".

  5. Exactly what is the error message that you get when you try to open "C:\bigfile.xls" with your script using xlrd.open_workbook? Include the script that you ran, the full traceback, and the error message

  6. What operating system, what version of Python, what version of xlrd?

  7. Do you know how many worksheets there are in the file?

Sign up to request clarification or add additional context in comments.

2 Comments

1. the file is 1,500,000 KB 2. I beleive excel. I didn't create the file myself. 3. Not enough. Its freezing up often. 4. Excel says I can't open the etire file and some data will be lost. So i can open the first part of the file and not the entire leght of the record 5. It say: the file named that does not exist 6. Python 2.6 windows and im not sure about the xlrd 7. only one worksheet
Re Q3: Exactly how many MB of memory do you have? Re Q5: (1) PLEASE ensure that you have entered the full correct path to your file; PLEASE give the exact error message and traceback (use copy/paste) (2) Please tell us the contents of the first 8 bytes of the file, obtained by doing python -c "print repr(open('yourfile.xls', 'rb').read(8))"
1

It sounds to me like you have a spreadsheet that was created using Excel 2007 and you have only Excel 2003.

Excel 2007 can create worksheets with 1,048,576 rows by 16,384 columns while Excel 2003 can only work with 65,536 rows by 256 columns. Hence the reason you can't open the entire worksheet in Excel.

If the workbook is just bigger in dimension then xlrd should work for reading the file, but if the file is actually bigger than the amount of memory you have in your computer (which I don't think is the case here since you can open the file with EditPad lite) then you would have to find an alternate method because xlrd reads the entire workbook into memory.

Assuming the first case:

import xlrd

wb_path = r'c:\bigfile.xls'
output_path = r'c:\output.txt'

wb = xlrd.open(wb_path)
ws = wb.sheets()[0]  # assuming you want to work with the first sheet in the workbook

with open(output_path, 'w') as output_file:
    for i in xrange(ws.nrows):
        row = [cell.value for cell in ws.row(i)]

        # ... replace the following if statement with your own conditions ...
        if row[0] == u'interesting':
            output_file.write('\t'.join(row) + '\r\n')

This will give you a tab-delimited output file that should open in Excel.

Edit:

Based on your answer to John Machin's question 5, make sure there is a file called 'bigfile.xls' located in the root of your C drive. If the file isn't there, change the wb_path to the correct location of the file you want to open.

6 Comments

For an Excel file to VALIDLY have more than 256 columns or 65536 rows, it has to be created by Excel 2007 or 2010 in XLSX format or XLSB format. Excel 2003 won't open any of an XLSX or XLSB file (unless maybe the compatibility kit has been added in). Unless the OP gives some precise info, all we have at the moment is idle speculation.
@John Machin, True enough. Though I seem to remember that Excel 2007/2010 can save a worksheet with more than 65536 rows as an XLS file and re-open it without losing any data. Since I'm about to sign out for the day I figured I'd provide my speculation before leaving and just made educated guesses based on the comments the OP made.
It says there is a syntax error is this line with open(output_path, 'w') as output_file:
@novak, if you're using python 2.5 you need to include another import statement: from __future__ import with_statement
@tgray: """Excel 2007/2010 can save a worksheet with more than 65536 rows as an XLS file and re-open it without losing any data.""" -- WRONG More than 65536 rows in an XLS format is just not on; the row index is kept in a 16-bit unsigned integer.
|
0

I haven't used it, but xlrd looks like it does a good job reading Excel data.

10 Comments

i'm having a really hard time workign with xlrd i can't get it to open my file.
Then please post in your question what you've tried so far that didn't work. Can't you even open the file in Excel itself?
NO the file is too big to open in excel completey. I can open it partially. I have this program: from xlrd import open_workbook,cellname book = open_workbook('C:\\bigfile.xls') sheet = book.sheet_by_index(0) print sheet.name print sheet.nrows print sheet.ncols for row_index in range(sheet.nrows): for col_index in range(sheet.ncols): print cellname(row_index,col_index),'-', print sheet.cell(row_index,col_index).value
Try glancing at the file in Notepad or the like to make sure it's an actual Excel file, not something like a CSV that was named .xls[x] which can confuse Excel.
So the problem is i can make a sample.xls file and call it as C:\\sample.xls and it opens fine and lists the data. But when I want to use the real huge actually data file C:\\bigfile.xls is says that file doesn't exist. Its really frustrating
|
0

Your problem is that you are using Excel 2003 .. You need to use a more recent version to be able to read this file. 2003 will not open files bigger than 1M rows.

1 Comment

How do you know that that is the problem? Having sifted carefully through the comments etc, I can't see any mention of row count. Besides, the OP says that xlrd says that file doesn't exist i.e. no "too big" indication.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.