0

So I have a list of words, like 50,000 of them, and I want to remove certain numbers and letters from them. Specifically, I want to remove anything that has a number from 0-99 followed by either an E or Z, so for example: 4E, 11Z, 11E, 20Z, etc

The words that I want to remove them from look like this:-

  • 6S,9,12S-trimethyl-2E,4E,8E,10E-tetradecatetraenoic acid
  • 7Z,14Z-eicosadienoic acid
  • 13,17,21,25-tetramethyl-5Z-hexacosenoic acid
  • CDP-DG(18:1(11Z)/22:6(4Z,7Z,10Z,13Z,16Z,19Z))
  • PC(20:4(5Z,8Z,11Z,14Z)/17:2(9Z,12Z))

As you can see the thing I want to remove appears in different ways in the words (as in within a bracket or after a hyphen etc). So far, I've done:

public class EZConfig {

    public static void main(String[] args) throws IOException{

     BufferedReader br = new BufferedReader(new FileReader("C:/Users/colles-a-l-kxc127/Dropbox/PhD/Java/MetabolitesCompiled/src/commonNames"));

        try {

            StringBuilder sb = new StringBuilder();
            String line = br.readLine();

            while (line != null) {

                if(line.contains("[0-99][E|Z]")){

                    System.out.println(line + " TRUE");
                }
                else{
                    System.out.println(line);
                }

                line = br.readLine();
            }

        } finally {
            br.close();
        }
    }
}

Just to see if I can pick up the number/E or Z annotations but I can't seem. I need to basically script something that will remove all those annotations from my list of words. Anyone know what I can do in order to achieve this?

1
  • 2
    As a side note -- [0-99] doesn't match any number between 0 and 99. It matches any digit, then 9, if I recall correctly, but the syntax you're looking for is [0-9]+, which will match one or more digits in a row. Commented Dec 18, 2014 at 13:32

1 Answer 1

3

You cannot pass a regular expression to String.contains - or rather, it will be treated as literal.

I would use this draft solution:

// declare as constant somewhere
static final Pattern MY_PATTERN = Pattern.compile("\\d+[EZ]");

Then, instead of your if(line.contains("[0-99][E|Z]")){ statement, you can use:

if (MY_PATTERN.matcher(line).find()) {

On the long run, if you're removing that from your words, you probably want to use:

line = line.replaceAll("\\d+[EZ]", "");

Edit

As newbiedoodle mentions (hadn't noticed), the character class [0-99] will not match a range between 0 and 99.

If you need to limit your digits to < 100, you can use \\d{1,2} instead of the more generic \\d+.

Notes

To remove [optional] parenthesis surrounding the pattern, an optional hyphen starting it and an optional comma ending it as well, you can use the following idiom: "-?\\(?\\d+[EZ]\\)?,?".

Note that parenthesis need to be double escaped in this context.

Sign up to request clarification or add additional context in comments.

1 Comment

Ah that's it! Can't accept your answers until two minutes for some reason but yeah that's perfect. Just out of curiosity, say if I also wanted to remove any brackets, hyphens or commas associated with it, so for example (11E) or -11Z,9E- etc, how could I incorporate this into the pattern or the .replaceAll method?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.