2

I have 15 .csv files with the following formats:

**File 1**
MYC
RASSF1
DAPK1
MDM2
TP53
E2F1
...

**File 2**
K06227
C00187
GLI1
PTCH1
BMP2
TP53
...

I would like to create a loop that runs through each of the 15 files and compares 2 at each time, creating unique pairs. So, File 1 and File 2 would be compared with each other giving an output telling me how many matches it found and what they were. So in the above example, the output would be:

1 match and TP53

The loops would be used to compare all the files against each other so 1,3 (File 1 against File 3), 1,4 and so on.

f1 = set(open(str(cancers[1]) + '.csv', 'r'))
f2 = set(open(str(cancers[2]) + '.csv', 'r'))
f3 = open(str(cancers[1]) + '_vs_' + str(cancers[2]) + '.txt', 'wb').writelines(f1 & f2)

The above works but I'm having a hard time creating the looping portion.

2 Answers 2

1

In order not to compare the same file, and make the code flexible to the number of cancers, I would code like this. I assume cancer is a list.

# example list of cancers
cancers = ['BRCA', 'BLCA', 'HNSC']
fout = open('match.csv', 'w')
for i in range(len(cancers)):
    for j in range(len(cancers)):
        if j > i:
            # if there are string elements in cancers,
            # then it doesn't need 'str(cancers[i])'
            f1 = [x.strip() for x in set(open(cancers[i] + '.csv', 'r'))]
            f2 = [x.strip() for x in set(open(cancers[j] + '.csv', 'r'))]
            match = list(set(f1) & set(f2))
            # I use ; to separate matched genes to make excel able to read
            fout.write('{}_vs_{},{} matches,{}\n'.format(
                cancers[i], cancers[j], len(match), ';'.join(match)))
fout.close()

Results

BRCA_vs_BLCA,1 matches,TP53
BRCA_vs_HNSC,6 matches,TP53;BMP2;GLI1;C00187;PTCH1;K06227
BLCA_vs_HNSC,1 matches,TP53
Sign up to request clarification or add additional context in comments.

4 Comments

This works but I'm still getting instances where the output .txt file equates to a comparison of the same file. Ex: BRCA_vs_BRCA.txt. Do you know how I could bypass this?
@Quintakov Yes. I meant to avoid the same file comparison, but I just found I didn't. Now it should work.
Do you know how I would be able to complete the part 1 match and TP53. Basically what I would like to do is to create an output .csv that contains the number of matches in all files and what they are? So something like file 1_vs_file 2, 2 matches, {TP53, BRCA1}
@Quintakov I edited as you requested. Please check the edited version.
1

To loop through all pairs up to 15, something like this can do it:

for i in range(1, 15):
    for j in range(i+1, 16):
        f1 = set(open(str(cancers[i]) + '.csv', 'r'))
        f2 = set(open(str(cancers[j]) + '.csv', 'r'))
        f3 = open(str(cancers[i]) + '_vs_' + str(cancers[j]) + '.txt',
                  'wb').writelines(f1 & f2)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.