I need to get specific lines of data that have certain key words in them (names) and write them to another file. The starting file is a 1.5 GB Excel file. I can't just open it up and save it as a different format. How should I handle this using python?
-
3Detail in the question is also appreciated.Tim Pietzcker– Tim Pietzcker2010-07-13 20:13:46 +00:00Commented Jul 13, 2010 at 20:13
-
2You mentioned that you can open it in EditPad Lite; what does the data look like (e.g. binary data, XML, CSV, tab-delimited, etc.)? If you don't know, you could edit your question and paste a sample there.tgray– tgray2010-07-13 20:34:22 +00:00Commented Jul 13, 2010 at 20:34
-
It's readable words that are seperated by // where there would be different columns i beleive. Like John // Doe // male // caucasian //novak– novak2010-07-13 21:13:03 +00:00Commented Jul 13, 2010 at 21:13
-
1as @tgray asked, POST A SAMPLE (edit your question!).John Machin– John Machin2010-07-13 21:33:07 +00:00Commented Jul 13, 2010 at 21:33
-
Copy the first four lines of your file, paste them into your question, select the pasted lines, and then press Ctrl-K to format them in a helpful way.Tim Pietzcker– Tim Pietzcker2010-07-14 04:50:26 +00:00Commented Jul 14, 2010 at 4:50
4 Answers
I'm the author and maintainer of xlrd. Please edit your question to provide answers to the following questions. [Such stuff in SO comments is VERY hard to read]
How big is the file in MB? ["Huge" is not a useful answer]
What software created the file?
How much memory do you have on your computer?
Exactly what happens when you try to open the file using Excel? Please explain "I can open it partially".
Exactly what is the error message that you get when you try to open "C:\bigfile.xls" with your script using
xlrd.open_workbook? Include the script that you ran, the full traceback, and the error messageWhat operating system, what version of Python, what version of xlrd?
Do you know how many worksheets there are in the file?
2 Comments
python -c "print repr(open('yourfile.xls', 'rb').read(8))"It sounds to me like you have a spreadsheet that was created using Excel 2007 and you have only Excel 2003.
Excel 2007 can create worksheets with 1,048,576 rows by 16,384 columns while Excel 2003 can only work with 65,536 rows by 256 columns. Hence the reason you can't open the entire worksheet in Excel.
If the workbook is just bigger in dimension then xlrd should work for reading the file, but if the file is actually bigger than the amount of memory you have in your computer (which I don't think is the case here since you can open the file with EditPad lite) then you would have to find an alternate method because xlrd reads the entire workbook into memory.
Assuming the first case:
import xlrd
wb_path = r'c:\bigfile.xls'
output_path = r'c:\output.txt'
wb = xlrd.open(wb_path)
ws = wb.sheets()[0] # assuming you want to work with the first sheet in the workbook
with open(output_path, 'w') as output_file:
for i in xrange(ws.nrows):
row = [cell.value for cell in ws.row(i)]
# ... replace the following if statement with your own conditions ...
if row[0] == u'interesting':
output_file.write('\t'.join(row) + '\r\n')
This will give you a tab-delimited output file that should open in Excel.
Edit:
Based on your answer to John Machin's question 5, make sure there is a file called 'bigfile.xls' located in the root of your C drive. If the file isn't there, change the wb_path to the correct location of the file you want to open.
6 Comments
from __future__ import with_statementI haven't used it, but xlrd looks like it does a good job reading Excel data.
10 Comments
Your problem is that you are using Excel 2003 .. You need to use a more recent version to be able to read this file. 2003 will not open files bigger than 1M rows.