Easier sed than done

sed seems to be one of the least used, and possibly most hated of the standard Unix utilities.  But honestly, I think it gets a bad rap.  Sure, it seems to have a steep learning curve, but vi did too when I first started.  It's just that I use vi heavily, on a daily basis.  Not so much sed.  I must admit, my sed skills are sketchy at best.  I can only produce meaningful sed scripts with a good reference, but it's often the best tool for the job.


A friend of mine needed a way to gathered lines from one file, treat them as file names, and combine the contents of those files into another file.  He hacked up a 65-line bash script (okay, 38 lines if you don't count blank lines or comments), complete with error checking.  But I couldn't imagine one really needed 38 lines for a simple task, and I couldn't think of why sed wouldn't be able to handle this much more efficiently.  So I set myself to the challenge.

Let's say I want to take all lines that exist between @START and @END.  We will start with the sample file input_file1, which looks like this on the inside:


aaa
@START
bbb
ccc
ddd
@END
eee

I started by reminding myself how to match lines between patterns.  This isn't too hard:


sed -e '/^@START$/,/^@END$/ p'

The /@START/ finds the line that contains @START, and likewise for /@END/.  The ^ anchors the pattern to the beginning of the line, and the $ to the end, so that's the only thing that can be on that line.  Also, the comma says "give me everything between (and including) those lines", and the p prints that line.  So let's try it:


$ sed -e '/^@START$/,/^@END$/ p' input_file1
aaa
@START
@START
bbb
bbb
ccc
ccc
ddd
ddd
@END
@END
eee

Uh-oh, what happened?  All the lines we wanted are duplicated!  sed prints every line it processes by default, plus what we told it to match and print .  Let's tell it not to echo everything with the -n command line switch:


$ sed -n -e '/^@START$/,/^@END$/ p' input_file1
@START
bbb
ccc
ddd
@END

Much better, but we don't want @START and @END in our list of files.  We will change our approach to that of "don't print anything up to the line with @START, and don't print anything from @END to the end of the input".  It looks something like:


$ sed -e '1,/^@START$/ d' -e '/^@END$/,$ d' input_file1
bbb
ccc
ddd

Ahh, no more pesky -n, no more @START and @END in our output.  What we did is run two expressions on input_file, the first "deletes", or doesn't show line 1 through the line containing @START, and the second doesn't show lines from that containing @END to the end of file (symbolized by the $).  Life is looking good, but we're still not there yet.  We need to get the contents of bbb, ccc and ddd and put them in another file.  First, let's generate some sample files:


$ for i in {a..e}; do echo "file ${i}" > ${i}${i}${i}; done

That should do it.  Now I'll use the handy 'cat' utility and bash sub-expression evaluation to dump all these file's contents


$ cat $(sed -e '1,/^@START$/ d' -e '/^@END$/,$ d' input_file1)
file b
file c
file d

Good, it worked.  Let's just redirect that to output_file:


$ cat $(sed -e '1,/^@START$/ d' -e '/^@END$/,$ d' input_file1) > output_file
$ cat output_file
file b
file c
file d

Done!  Or not.  Turns out he has multiple @START and @END sections for us to process.  Time to make sed do a loop.  For this, we're going to have to go back to our "print lines between (and including) @START and @END" approach.  But first, let's get ourselves a test file with multiple @START and @END sections (we'll call it input_file2):


aaa
@START
bbb
@END
ccc
ddd
eee
@START
fff
ggg
@END
hhh
iii
jjj
@START
kkk
lll
mmm
@END
nnn

That should do it, and now generate content for the files we want to process



$ for i in {a..n}; do echo "file ${i}" > ${i}${i}${i}; done


Good, and now for our loop:



$ sed -n -e ':x;/^@START$/,/^@END$/ p;tx' input_file2
@START
bbb
@END
@START
fff
ggg
@END
@START
kkk
lll
mmm
@END

We have everything between all the @START and @END tags, now we need to eliminate the tags themselves.  This should be easy:

$ sed -n -e ':x; /^@START$/,/^@END$/ p; t x' input_file2 | sed -e '/^@\(START\|END\)$/ d'
bbb
fff
ggg
kkk
lll
mmm

That ":x" sets up a label called 'x', and the "t x" command basically says go to label 'x' if something matched (in this case, it did match lines between our tags).

And last step, the cat-and-redirect, and a quick test:

$ cat $(sed -n -e ':x; /^@START$/,/^@END$/ p; t x' input_file2 | sed -e '/^@\(START\|END\)$/ d') > output_file
$ cat output_file
file b
file f
file g
file k
file l
file m

Success!  Hopefully you learned to love sed a little more than you did before, and maybe even came away with a few useful sed tricks you can use later.

EDIT 2012-02-03:
So it seems there's a clearer and easier solution using AWK:

awk '
BEGIN {do_print = 1;}
{
    if ($0 == "@START") {
        do_print = 1;
    }
    else if ($0 == "@END") {
        do_print = 0;
    }
    else if (do_print == 1) {
        system("cat " $0);
    }
}' < input_file

Comments

  1. I like the fact that we are exploiting existing tools to do the work. The only downside to this approach is that it is not exactly obvious as to what is going on. Trying to fix a bug in it could be challenging or impossible for some.

    ReplyDelete
  2. Indeed, this solution is a bit convoluted, and it's been bugging me for some time. I spent a couple hours fooling around with some of sed's commands for manipulating the holding space, to no avail, then I had an epiphany.

    Instead of printing from @START to @END, why not simply delete from @END to @START. Then there's only the edge conditions of deleting from the first line to the first @START and from the last @END to the end of file. The shorter, and cleaner solution (though still not intuitive for those who aren't sed-savvy):

    sed '1,/^@START$/ d; :x /^@END$/,/^@START/ d; t x; /^@END$/,$ d'

    Maybe some investigation into other tools like AWK will be worthwhile.

    ReplyDelete
  3. One small addition/correction. The script I just posted won't work if @START is the first line of the file. To correct that, you would have to rely on the GNU extension, if available, of the 0 line address:

    sed '0,/^@START$/ d; :x /^@END$/,/^@START/ d; t x; /^@END$/,$ d'

    ReplyDelete

Post a Comment

Popular posts from this blog

Welcome!

Good Preprocessor Usage