sed behaving strangely when -n and the delete command are combined

Fri Jul 10 09:54:39 UTC 2009

On Fri, Jul 10, 2009 at 9:23 AM, Rob Landley<rob at landley.net> wrote:
> So the first page of the patch is still nothing but white noise...

Okay... Attached 5.patch is the meat of the change, and 6.patch is
white noise...

>>@@ -893,35 +894,51 @@ static void process_files(void)
>>
>>            || (!sed_cmd->beg_line && !sed_cmd->end_line
>>
>>                && !sed_cmd->beg_match && !sed_cmd->end_match)
>>            /* Or did we match the start of a numerical range? */
>>-           || (sed_cmd->beg_line > 0 && (sed_cmd->beg_line == linenum))
>>+           || (sed_cmd->beg_line > 0 && (sed_cmd->beg_line == linenum
>>+                           /* "shadowed beginning" case: "1d;1,ENDp" - p
>> still matches at line 2 +                            * even though 1d
>> skipped line 1 which is a start line for p */
>
> It's somewhat disturbing to have a comment inside a nested parenthetical.  The
> indentation gives no hints so it's a bit hard to track the structure here.

What?

                        /* Or did we match the start of a numerical range? */
                        || (sed_cmd->beg_line > 0 &&
(sed_cmd->beg_line == linenum
                                                        /* "shadowed
beginning" case: "1d;1,ENDp" - p still matches at line 2
                                                         * even though
1d skipped line 1 which is a start line for p */
                                                        ||
(sed_cmd->end_line && sed_cmd->beg_line < linenum && sed_cmd->end_line
>= linenum)
                                                        ||
(sed_cmd->end_match && sed_cmd->beg_line < linenum)
                                                )
                        )

> I note the comment could just be one line:
>
> /* "1d;1,ENDp" should start matching at line 2. */

Are you serious? I mean, do you really complain that I do not write
comments in exactly the way you like?

>> || (sed_cmd->end_line && sed_cmd->beg_line < linenum && sed_cmd->end_line
>> >= linenum)
>
> This is more complicated than it needs to be.
>
> If you're going to slavishly copy what gnu does, you might want to run a few
> more tests to see what that actually is.  Specifically:
>
>  echo -e "one\ntwo\nthree\nfour" | sed -ne '2d;2,1p'
>  three
>
> So the logic can be simplified: when your start is a number it triggers on the
> next line >= to that number.   Just change the match start to >=, and then
> disable numerical starting matches are disabled after the first hit.

I did not think about this example. Let me try... aha, now works:

# echo -ne "first\nsecond\nthird\nfourth\n" | sed -n '1d;2,1p'
second
# echo -ne "first\nsecond\nthird\nfourth\n" | ./busybox sed -n '1d;2,1p'
second

But this breaks another case:

# echo -ne "first\nsecond\nthird\nfourth\n" | sed -n '1d;1,3p'
second
third
# echo -ne "first\nsecond\nthird\nfourth\n" | ./busybox sed -n '1d;1,3p'
(nothing)

See? You waste your time writing up agressive emails
but you make mistakes too.

> Beyond
> that, you want to fall through like it was already doing to the normal match
> ending logic, which has had <= linenum since my days.
>
> So you can get closer to the gnu behavior by changing the code _less_.  Which

Nice theory, until you actually try to implement it instead of just describing
how easy it is and how idiotic someone else's code is.

> means you don't have to avoid snapshotting the line and continuing on to the
> "should we end the match now" logic.
>
> Also, why do you discard the comment about snapshotting the line?  The

-               /* Snapshot the value */
                matched = sed_cmd->in_match;

Because comment says that we are saving a value. Well.
That's kind of obvious. The comment should say why we do that
in order to be useful.

>>+       /* Is this line the end of the current match? */
>>        if (matched) {
>>-           sed_cmd->in_match = !(
>>+           int n = (
>>                /* has the ending line come, or is this a single address
>> command? */
>> -               (sed_cmd->end_line ?
>>+               sed_cmd->end_line ?
>>                    sed_cmd->end_line == -1 ?
>>                        !next_line
>>
>>                        : (sed_cmd->end_line <= linenum)
>>                    :
>>                    : !sed_cmd->end_match
>>
>>-               )
>>+               );
>
> I think this is more random cosmetic changes mixed in with the functional
> ones?  Except for not setting in_match immediately, which I think is actually
> a bad thing.
>
>>+           if (!n) {
>>                /* or does this line matches our last address regex */
>>-               || (sed_cmd->end_match && old_matched
>>+               n = (sed_cmd->end_match
>>+                    && old_matched
>>                     && (regexec(sed_cmd->end_match,
>>-                                pattern_space, 0, NULL, 0) == 0))
>>+                                pattern_space, 0, NULL, 0) == 0)
>>            );
>>+               if (n && sed_cmd->beg_line > 0) {
>>+                   /* Once matched, "n,regex" range is dead, disabling it
>> */ +                   regfree(sed_cmd->end_match);
>>+                   free(sed_cmd->end_match);
>>+                   sed_cmd->end_match = NULL;
>>+               }
>>+           }
>>+           sed_cmd->in_match = !n;
>>        }
>
> Ok, buried in the random wordwrapping changes, you're now calling regfree in
> the middle of a sed, which seems way overkill.

sed_cmd->in_match = !(complex multi line expression)

was already somewhat hard to understand. When I needed to throw in
even more into it, it got really bad. So I split it into
separate statements:

                if (matched) {
                        int n = (
                                /* has the ending line come, or is
this a single address command? */
                                sed_cmd->end_line ?
                                        sed_cmd->end_line == -1 ?
                                                !next_line
                                                : (sed_cmd->end_line <= linenum)
                                        : !sed_cmd->end_match
                                );
                        if (!n) {
                                /* or does this line matches our last
address regex */
                                n = (sed_cmd->end_match
                                     && old_matched
                                     && (regexec(sed_cmd->end_match,
                                                 pattern_space, 0,
NULL, 0) == 0)
                                );
                                if (n && sed_cmd->beg_line > 0) {
                                        /* Once matched, "n,regex"
range is dead, disabling it */
                                        regfree(sed_cmd->end_match);
                                        free(sed_cmd->end_match);
                                        sed_cmd->end_match = NULL;
                                }
                        }
                        sed_cmd->in_match = !n;
                }

I do not see what's wrong here from the style POV.

> If we're in a match, and beg_line > 0, set beg_line = -1.  That's all you need
> to do to prevent the match from triggering again after we end it.  You can
> even do that test every time, it's small and simple and really cheap, happens
> totally in L1 cache, and more or less parallelizes away on a modern processor
> with multiple execution cores.

This makes sense. Except for now it can erroneously match here:

                        /* Or did we match last line of input? */
                        || (sed_cmd->beg_line == -1 && next_line == NULL);

I guess I can try to use -2

>>        /* Okay, so did this line match? */
>>-       if (sed_cmd->invert ? !matched : matched) {
>>+       if (sed_cmd->invert ? matched : !matched)
>>+           continue; /* no */
>>+
>
> Um, not noise.  This is, in fact, a functional change totally unrelated to the
> rest of the patch of the kind that tends to cause subtle bugs, which I used to
> be very sensitive to in sed because it's so complicated and fiddly.
>
> You left this change in, but didn't change the indentation of everything that
> used to be a block, so the indentation is now misleading.

I did. You complained that the patch is messed up. So I rediffed it with
diff -bB to make it look better. Now you complain about *that*.

I did mix the fix with whitespace and comment changes.
This complaint is valid.

Please review attached 7.patch which fixes 2d;2,1p

function                                             old     new   delta
process_files                                       2173    2120     -53

--
vda
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 5.patch
Type: text/x-patch
Size: 2048 bytes
Desc: not available
URL: <http://lists.busybox.net/pipermail/busybox/attachments/20090710/d571090d/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 6.patch
Type: text/x-patch
Size: 15863 bytes
Desc: not available
URL: <http://lists.busybox.net/pipermail/busybox/attachments/20090710/d571090d/attachment-0004.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 7.patch
Type: text/x-patch
Size: 2301 bytes
Desc: not available
URL: <http://lists.busybox.net/pipermail/busybox/attachments/20090710/d571090d/attachment-0005.bin>