Using regular expressions in shell script

Question

What is the correct way to parse a string using regular expressions in a linux shell script? I wrote the following script to print my SO rep on the console using curl and sed (not solely because I'm rep-crazy - I'm trying to learn some shell scripting and regex before switching to linux).

json=$(curl -s http://stackoverflow.com/users/flair/165297.json)
echo $json | sed 's/.*"reputation":"\([0-9,]\{1,\}\)".*/\1/' | sed s/,//

But somehow I feel that sed is not the proper tool to use here. I heard that grep is all about regex and explored it a bit. But apparently it prints the whole line whenever a match is found - I am trying to extract a number from a single line of text. Here is a downsized version of the string that I'm working on (returned by curl).

{"displayName":"Amarghosh","reputation":"2,737","badgeHtml":"\u003cspan title=\"1 silver badge\"\u003e\u003cspan class=\"badge2\"\u003e●\u003c/span\u003e\u003cspan class=\"badgecount\"\u003e1\u003c/span\u003e\u003c/span\u003e"}

I guess my questions are:

What is the correct way to parse a string using regular expressions in a linux shell script?
Is sed the right thing to use here?
Could this be done using grep?
Is there any other command that's more easier/appropriate?

superM · Accepted Answer · 2014-02-26 10:11:08Z

11

The grep command will select the desired line(s) from many but it will not directly manipulate the line. For that, you use sed in a pipeline:

someCommand | grep 'Amarghosh' | sed -e 's/foo/bar/g'

Alternatively, awk (or perl if available) can be used. It's a far more powerful text processing tool than sed in my opinion.

someCommand | awk '/Amarghosh/ { do something }'

For simple text manipulations, just stick with the grep/sed combo. When you need more complicated processing, move on up to awk or perl.

My first thought is to just use:

echo '{"displayName":"Amarghosh","reputation":"2,737","badgeHtml"'
    | sed -e 's/.*tion":"//' -e 's/".*//' -e 's/,//g'

which keeps the number of sed processes to one (you can give multiple commands with -e).

edited Feb 26, 2014 at 10:11

superM

8,7058 gold badges44 silver badges54 bronze badges

answered Oct 28, 2009 at 10:28

paxdiablo

889k243 gold badges1.6k silver badges2k bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

hobbs Over a year ago

I'm a Perl guy myself, but sometimes awk is faster and cleaner for extracting data. It does one thing and it does it pretty well :)

user181548 Over a year ago

@hobbs: you like parsing JSON with regular expressions, but not HTML?

Amarghosh Over a year ago

Thanks pax. @Kinopiko I think that would be because json has a solid structure, but html can be totally out of structure (missing closing braces etc).

Teddy Over a year ago

You obviously don't know sed. It both has loops and can do the selects itself. It is just as "powerful" as (albeit less convenient than) awk, and certainly better than grep.

paxdiablo Over a year ago

I know sed well enough to know awk is better for more complex tasks :-) If you're talking about sed's branch and test commands, they're a horrible kludge, nothing like awk's elegant for statement (similar to C). Any Turing-complete language is as "powerful" as any other but I'd still rather write my applications in Java than machine language.

|

user181548 · Accepted Answer · 2009-10-28 10:52:16Z

8

You may be interested in using Perl for such tasks. As a demonstration, here is a Perl script which prints the number you want:

#!/usr/local/bin/perl
use warnings;
use strict;
use LWP::Simple;
use JSON;

my $url = "http://stackoverflow.com/users/flair/165297.json";
my $flair = get ($url);
my $parsed = from_json ($flair);
print "$parsed->{reputation}\n";

This script requires you to install the JSON module, which you can do with just the command cpan JSON.

answered Oct 28, 2009 at 10:52

user181548

Comments

viam0Zah · Accepted Answer · 2009-10-28 10:56:41Z

5

For working with JSON in shell script, use jsawk which like awk, but for JSON.

json=$(curl -s http://stackoverflow.com/users/flair/165297.json)
echo $json | jsawk 'return this.reputation' # 2,747

answered Oct 28, 2009 at 10:56

viam0Zah

26.4k8 gold badges79 silver badges103 bronze badges

1 Comment

Amarghosh Over a year ago

Thanks. Though I think regex is enough for this particular case, it's good to know that there is a json parser for the shell.

mouviciel · Accepted Answer · 2009-10-28 12:03:26Z

3

My proposition:

$ echo $json | sed 's/,//g;s/^.*reputation...\([0-9]*\).*$/\1/'

I put two commands in sed argument:

s/,//g is used to remove all commas, in particular the ones that are present in the reputation value.
s/^.*reputation...$[0-9]*$.*$/\1/ locates the reputation value in the line and replaces the whole line by that value.

In this particular case, I find that sed provides the most compact command without loss of readability.

Other tools for manipulating strings (not only regex) include:

grep, awk, perl mentioned in most of other answers
tr for replacing characters
cut, paste for handling multicolumn inputs
bash itself with its rich $(...) syntax for accessing variables
tail, head for keeping last or first lines of a file

edited Oct 28, 2009 at 12:03

answered Oct 28, 2009 at 11:34

mouviciel

68k12 gold badges109 silver badges144 bronze badges

1 Comment

Amarghosh Over a year ago

Thanks, I didn't know that we can pass more than one command to sed.

Brian Agnew · Accepted Answer · 2009-10-28 10:29:04Z

2

sed is appropriate, but you'll spawn a new process for every sed you use (which may be too heavyweight in more complex scenarios). grep is not really appropriate. It's a search tool that uses regexps to find lines of interest.

Perl is one appropriate solution here, being a shell scripting language with powerful regexp features. It'll do most everything you need without spawning out to separate processes (unlike normal Unix shell scripting) and has a huge library of additional functions.

answered Oct 28, 2009 at 10:29

Brian Agnew

273k38 gold badges342 silver badges443 bronze badges

Comments

qba · Accepted Answer · 2009-10-28 11:27:21Z

2

You can do it with grep. There is -o switch in grep witch extract only matching string not whole line.

$ echo $json | grep -o '"reputation":"[0-9,]\+"' | grep -o '[0-9,]\+'
2,747

answered Oct 28, 2009 at 11:27

qba

1,3113 gold badges15 silver badges22 bronze badges

4 Comments

ghostdog74 Over a year ago

a challenge. how about doing it with just one grep command :)

Amarghosh Over a year ago

@qba thanks for the -o. @ghoshdog74 Using one grep and a sed would be cheating, right ;)

Amarghosh Over a year ago

I think lookbehind is the way to go. Something like (?<=reputation":")[0-9,]+ But I don't know if look behind is supported in shell's regex - the given pattern didn't work for me. May be I am not escaping all special characters.

ghostdog74 Over a year ago

@Amarghosh - cheating?? Don't know what you mean. Anyway , my point is: if you can do it in one invocation of grep, why do it 2 times ...

Sinan Ünür · Accepted Answer · 2009-10-31 00:53:28Z

2

1) What is the correct way to parse a string using regular expressions in a linux shell script?

Tools that include regular expression capabilities include sed, grep, awk, Perl, Python, to mention a few. Even newer version of Bash have regex capabilities. All you need to do is look up the docs on how to use them.

2) Is sed the right thing to use here?

It can be, but not necessary.

3) Could this be done using grep?

Yes it can. you will just construct similar regex as you would if you use sed, or others. Note that grep just does what it does, and if you want to modify any files, it will not do it for you.

4) Is there any other command that's easier/more appropriate?

Of course. regex can be powerful, but its not necessarily the best tool to use everytime. It also depends on what you mean by "easier/appropriate". The other method to use with minimal fuss on regex is using the fields/delimiter approach. you look for patterns that can be "splitted". for eg, in your case(i have downloaded the 165297.json file instead of using curl..(but its the same)

awk 'BEGIN{
 FS="reputation" # split on the word "reputation"
}
{
    m=split($2,a,"\",\"")    # field 2 will contain the value you want plus the rest
                             # Then split on ":" and save to array "a"
    gsub(/[:\",]/,"",a[1])   # now, get rid of the redundant characters
    print a[1]
}' 165297.json

output:

$ ./shell.sh
2747

edited Oct 31, 2009 at 0:53

Sinan Ünür

118k15 gold badges201 silver badges347 bronze badges

answered Oct 28, 2009 at 11:31

ghostdog74

346k62 gold badges264 silver badges349 bronze badges

4 Comments

Amarghosh Over a year ago

"easier/appropriate" - I am looking for the way people normally do string parsing with regex in shell scripts. This is my first shell script and I wrote this with a lot of help from online man pages. Wanted to make sure this is the normal way to do this.

ghostdog74 Over a year ago

the only tool you will ever need to do string/text parsing, is awk.

Amarghosh Over a year ago

wrt the comment in @qba's answer: I can't seem to do it with a single invocation of grep - how to do it?

ghostdog74 Over a year ago

just combine the 2nd grep's regex with the first grep's regex, which sad to say i am not going to bother myself to come up with. I will let qba give you the answer

pavium · Accepted Answer · 2009-10-28 10:34:41Z

1

sed is a perfectly valid command for your task, but it may not be the only one.

grep may be useful too, but as you say it prints the whole line. It's most useful for filtering the lines of a multi-line file, and discarding the lines you don't want.

Efficient shell scripts can use a combination of commands (not just the two you mentioned), exploiting the talents of each.

edited Oct 28, 2009 at 10:34

answered Oct 28, 2009 at 10:29

pavium

15.2k4 gold badges35 silver badges50 bronze badges

Comments

Dennis Williamson · Accepted Answer · 2009-10-28 13:54:49Z

0

Blindly:

echo $json | awk -F\" '{print $8}'

Similar (the field separator can be a regex):

awk -F'{"|":"|","|"}' '{print $5}'

Smarter (look for the key and print its value):

awk -F'{"|":"|","|"}' '{for(i=2; i<=NF; i+=2) if ($i == "reputation") print $(i+1)}'

answered Oct 28, 2009 at 13:54

Dennis Williamson

364k95 gold badges386 silver badges446 bronze badges

Comments

Sinan Ünür · Accepted Answer · 2009-10-31 01:10:28Z

0

You can use a proper library (as others noted):

E:\Home> perl -MLWP::Simple -MJSON -e "print from_json(get 'http://stackoverflow.com/users/flair/165297.json')->{reputation}"

or

$ perl -MLWP::Simple -MJSON -e 'print from_json(get "http://stackoverflow.com/users/flair/165297.json")->{reputation}, "\n"'

depending on OS/shell combination.

edited Oct 31, 2009 at 1:10

answered Oct 31, 2009 at 1:04

Sinan Ünür

118k15 gold badges201 silver badges347 bronze badges

Comments

Beejor · Accepted Answer · 2017-04-08 21:31:02Z

Simple RegEx via Shell

Disregarding the specific code in question, there may be times when you want to do a quick regex replace-all from stdin to stdout using shell, in a simple way, using a string syntax similar to JavaScript.

Below are some examples for anyone looking for a way to do this. Perl is a better bet on Mac since it lacks some sed options. If you want to get stdin as a variable you can use MY_VAR=$(cat);.

echo 'text' | perl -pe 's/search/replace/g'; # using perl
echo 'text' | sed -e 's/search/replace/g'; # using sed

And here's an example of a custom, reusable regex function. Arguments are source string (or -- for stdin), search, replace, and options.

regex() {
    case "$#" in
        ( '0' ) exit 1 ;; ( '1' ) echo "$1"; exit 0 ;;
        ( '2' ) REP='' ;; ( '3' ) REP="$3"; OPT='' ;;
        ( * ) REP="$3"; OPT="$4" ;;
    esac
    TXT="$1"; SRCH="$2";
    if [ "$1" = "--" ]; then [ ! -t 0 ] && read -r TXT; fi
    echo "$TXT" | perl -pe 's/'"$SRCH"'/'"$REP"'/'"$OPT";
}

echo 'text' | regex -- search replace g;

Collectives™ on Stack Overflow

Using regular expressions in shell script

11 Answers 11

8 Comments

Comments

1 Comment

1 Comment

Comments

4 Comments

1) What is the correct way to parse a string using regular expressions in a linux shell script?

2) Is sed the right thing to use here?

3) Could this be done using grep?

4) Is there any other command that's easier/more appropriate?

4 Comments

Comments

Comments

Comments

Simple RegEx via Shell

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

8 Comments

Comments

1 Comment

1 Comment

Comments

4 Comments

1) What is the correct way to parse a string using regular expressions in a linux shell script?

2) Is sed the right thing to use here?

3) Could this be done using grep?

4) Is there any other command that's easier/more appropriate?

4 Comments

Comments

Comments

Comments

Simple RegEx via Shell

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related