1

I have a directory full of files that have date strings as part of the filenames:

file_type_1_20140722_foo.txt
file_type_two_20140723_bar.txt
filetypethree20140724qux.txt

I need to get these date strings from the filenames and save them in an array:

['20140722', '20140723', '20140724']

But they can appear at various places in the filename, so I can't just use substring notation and extract it directly. In the past, the way I've done something similar to this in Bash is like so:

date=$(echo $file | egrep -o '[[:digit:]]{8}' | head -n1)

But I can't use Bash for this because it sucks at math (I need to be able to add and subtract floating point numbers). I've tried glob.glob() and re.match(), but both return empty sets:

>>> dates = [file for file in sorted(os.listdir('.')) if re.match("[0-9]{8}", file)]
>>> print dates
>>> []

I know the problem is it's looking for complete file names that are eight digits long, but I have no idea how to make it look for substrings instead. Any ideas?

4
  • 1
    Use re.search instead of match, and put the digits inside parentheses to get a match group. Commented Jul 22, 2014 at 18:46
  • Would using the split() function be a valid option for you? Commented Jul 22, 2014 at 18:47
  • @Batman no, because the numbers are sometimes offset by underscores, and sometimes jammed up next to text. Commented Jul 22, 2014 at 18:47
  • @TomZych that doesn't give the substring, just the files that have that substring matching the pattern (all of them). Commented Jul 22, 2014 at 18:49

3 Answers 3

6
>>> import re
>>> import os
>>> [date for file in os.listdir('.') for date in re.findall("(\d{8})", file)]
['20140722', '20140723']

Note that if a filename has a 9-digit substring, then only the first 8 digits will be matched. If a filename contains a 16-digit substring, there will be 2 non-overlapping matches.

Sign up to request clarification or add additional context in comments.

2 Comments

Just a note to newcomers to Python... make sure you import the regular expressions engine with import re. :) I couldn't upvote because I exhausted my daily vote limit. hehehe
@LenielMacaferi: Thanks for the improvement.
2

Your regular expression looks good, but you should be using re.search instead of re.match so that it will search for that expression anywhere in the string:

import re
r = re.compile("[0-9]{8}")
m = r.search(filename)
if m:
    print m.group(0)

2 Comments

This gives the full file name, not the stubstrings
I missed the group() part, my bad
1

re.match matches from the beginning of the string. re.search matches the pattern anywhere. Or you can try this:

extract_dates = re.compile("[0-9]{8}").findall
dates = [dates[0] for dates in sorted(
    extract_dates(filename) for filename in os.listdir('.')) if dates]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.