AW: [RFC/PATCH v2 3/5] libbb: add ends_with() function

Jody Bruchon jody at jodybruchon.com
Tue Aug 25 12:22:57 UTC 2015


I did not see the email you've quoted on the mailing list. Interesting. 
I shall respond to it now.

On 2015-08-25 03:00, dietmar.schindler at manroland-web.com wrote:
>> Von: Tito
>> Gesendet: Montag, 24. August 2015 21:29
>>
>> Hi Jody,
>> so you pass the buck to the silly caller to guess what is consistent,
>> but even in the few replies on this thread there were different views
>> about it.

Yes, I will gladly pass the buck to the silly caller. It is the function 
caller's responsibility to use the function properly and check for error 
conditions the caller actually cares about. There is no way to know what 
exactly will be done with a chunk of published code in ten years or in 
ten hours. Basic C programming rules say to quietly work with as many 
inputs as possible and fail gracefully when something goes wrong. In the 
case of asking "ends with?" you get to choose between returning "int: 
true/false" or "int: more possible outputs than simple true/false" or 
"pointer to a string or NULL on failure." Since increasingly complex 
return values tend to require increasingly complex code to handle those 
possibilities, the question posed is most simply answered as a 
true/false question: "does string X end with Y?" Choosing for Y == 
<empty> to return true or false is linguistically nonsensical but 
*logical from a byte-for-byte analysis perspective* for the reasons I 
discussed in my previous message.

>> In my opinion it would be better to cover such corner cases with
>> meaningful responses by analyzing the semantics of this new function
>> that checks if a _string_ ends with a specific _string_ or as somebody
>> suggested a specific _suffix_.

More meaningful responses (by which we are really discussing "return 
values" or--in one of the worst cases--structs passed as pointers and 
modified by the called function as the "return value") require more 
complicated code to handle those responses. Keeping responses as simple 
as possible minimizes code size AND complexity. Adding complexity is 
always bad by default unless adding complexity is ultimately the only 
"simple" solution. Passing struct pointers around in network packet 
handling code is unavoidable due to the information density required of 
the task, for example, but it makes no sense to apply such density to 
the straightforward question of "does this string end with this other 
string?"

We could always perform very complex analysis of such a simple question 
and return a series of bit flags for possibilities; for example, "yes, 
but the case of the text differs" or "no, but if you offset left by one 
character, yes" or "yes, and both substrings are in upper case." How 
complex should the answer to the question be? How much time should we 
waste trying to make our response more meaningful?

IMO for such a simple question, the calling code should be able to do 
something like this if it wants and just be done with it:

if (ends_with(a, b) == 1) do_true_work();
else do_false_work();

In this case, if b == '' and that's a problem for the calling code, the 
calling code should be going "if (*b == '\0') handle_empty_case()" and 
dealing with it prior to calling ends_with(). Likewise, it's possible 
that the author of the calling code knows that the data their code is 
working with can safely ignore what happens when empty strings are 
passed to ends_with() and attempting to "help" the author will only 
frustrate them instead and possibly force them to make more obnoxious 
code sort of like this:

x = ends_with(a, b);
switch (x) {
   case 0:
     do_false_work();
     break;
   case 1:
   case -1:
     do_true_work();
     break;
   default:
     do_unknown_retval_error_message();
}

or a much smarter programmer not interested in a different response 
might instead do one of these to bypass such "helpfulness" in responses:

if (*b != '\0') {
   if (ends_with(a, b) == 0) {
     do_false_work();
     goto did_false_work;
   }
}
do_true_work();
did_false_work:
...

if (ends_with == 0) do_false_work();
else do_true_work();
...

An exercise for the reader: what happens to the code in all these 
workarounds if we have 0, 1, and 2 as return values for 
{false,true,empty} and then "improve" the function later by changing 
various "corner cases" to return negative values, or if we add return 
value 3 for {ends_with_if_shifted_left_by_one_char} or any other such 
"improvement" of similar type? Unless all *potential* future return 
value sets are clearly documented, we could break things in interesting 
ways!

Or, we just return "true/false" in the simplest possible way and stop 
pretending we know more than the calling programmer does about their code.

>> This suffix being non-existent is a anomaly that probably the caller
>> would like to be informed about.

Then let them check for it. The caller knows their input data set. We do 
not and we can never know in advance what will be thrown at any of the 
reusable functions we write.

>>> If a useless answer for empty strings is a problem, the writer of the calling
>>> code needs to handle that themselves, and is in a far more flexible
>>> position to do so than the person writing the called function that only
>>> answers "yes" or "no."
>>
>> Why should a function give a useless answer?
>> Wouldn't it be better for the caller to give him
>> always useful answers (return different values, set errno, etc.).

If the caller asks a useless question, the caller gets a useless answer. 
It is up to the caller to make sure they ask the question correctly. 
Only the caller knows how to ensure the data set that feeds the 
ends_with() function (or any other function for that matter) presents 
sane and valid data in the function parameters. See above for why 
pretending to be smarter than the programmer calling the function can 
make a mess.

>>> I also say that we're all C programmers if we're here fiddling with the
>>> guts of BusyBox and we can all easily understand why returning "yes" for
>>> "ends with \0?" is a technically correct result since all C strings end
>>> with \0.
>>
>> This is a somewhat dogmatic argument:
>> "The people of the inner circle know it better."
>>
>> SusV3 instead for the strstr function of which ends_with() could be
>> considered a subset of functionality (just looks for s2 at the end of
>> s1) says:
>>
>> "#include <string.h>
>>
>>       char *strstr(const char *s1, const char *s2);
>>
>> DESCRIPTION
>>
>>       The strstr() function shall locate the first occurrence in the
>> string pointed to by s1 of the sequence of bytes (excluding the
>> terminating null byte) in the string pointed to by s2.
>>
>> RETURN VALUE
>>
>>       Upon successful completion, strstr() shall return a pointer to the
>> located string or a null pointer if the string is not found.
>>
>>       If s2 points to a string with zero length, the function shall
>> return s1.
>>
>> ERRORS
>>
>>       No errors are defined."
>>
>> So as you can see they explicitly request a specific action for the case
>> of a zero length key:
>>
>>       "If s2 points to a string with zero length, the function shall
>> return s1."
>>
>> Not that I like their solution but still this is a fact and makes me
>> think that we should also handle this corner case in better way
>> rather than return a "useless value".

We aren't writing strstr() and strstr() asks a different question than 
ends_with() does. The strstr() question is "Where is the first 
occurrence of string 'needle' inside string 'haystack'?" This 
necessitates returning a pointer within 'haystack' to the first instance 
of the string 'needle', and because this returns either NULL or a valid 
pointer rather than true/false, someone had to make an *arbitrary 
choice* for what the answers should be in the special case of needle == 
<empty>.

Because strstr() can't perform its desired functionality if it considers 
the terminating '\0' (which makes it effectively the same thing as 
ends_with()), strstr() only operates on the string data prior to the 
terminator. Linguistically, it follows that you can't find a 
non-existent needle in any given haystack because "nothing in 
particular" encompasses any response we might decide to give. In the 
case of an empty needle, the easiest answer is to just return the 
pointer to the start of the haystack or NULL.

The choice for strstr() was probably purely arbitrary because there's no 
linguistic or technical logic to choose which return value an empty 
substring should return; in any case, testing for *needle == '\0' before 
trying to do strstr() is the way to catch empty substrings, not 
analyzing the limited return value set that strstr() provides.

In a way, strstr() returns true/false but with pointers (NULL/valid) 
instead of integers. Returning true for ends_with() and an empty ending 
string actually makes our behavior consistent with that of strstr(), and 
consistency like that is important to maintain when possible.

Side note: I wrote an improved strstr() implementation that I submitted 
to uClibc but they never reviewed and committed it. Anyone know who I 
need to give it to? I keep hearing about a uClibc-ng but I dropped the 
uClibc mailing list when it seemed to have become nothing but patch 
submissions followed by maintainer pings.

-Jody


More information about the busybox mailing list