Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide shell version of strgen.c to allow use without building a binary. #13

Merged
merged 13 commits into from
Apr 22, 2019

Conversation

stesser
Copy link
Contributor

@stesser stesser commented Apr 19, 2019

The shell version has been verified to lead to identical object files from generated C source files.
It should be a fully compatible replacement for strgen.c.
The reason I rewrote this tool is that I have prepared an import of bc/dc into the FreeBSD base system and having to compile a build tool as a prerequisite of building bc and dc is to be avoided, if at all possible.

I have verified, that the compilation succeeds if "GEN = strgen" is replaced by "GEN = strgen.sh" in Makefile.in.

On systems that do not directly execute the shell script that way, GEN_EMU may be set to /bin/sh or any other Bourne Shell compatible interpreter.

stesser added 3 commits April 19, 2019 10:06
The output has been verified to be identical to that of the C language
version for all files in this project by comparing the resulting .o files.
@codecov
Copy link

codecov bot commented Apr 19, 2019

Codecov Report

Merging #13 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff           @@
##           master     #13   +/-   ##
======================================
  Coverage    99.9%   99.9%           
======================================
  Files          16      16           
  Lines        3091    3091           
======================================
  Hits         3088    3088           
  Misses          3       3
Impacted Files Coverage Δ
src/vm.c 99.56% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2f76173...ecd6481. Read the comment docs.

@gavinhoward
Copy link
Owner

Okay, this patch has more advanced POSIX shell scripting than I know, so I am going to make sure I understand it first, after reading about exec and redirection. In other words, I am not reviewing the code; I am reviewing my understanding of it. (I will review later, if it's necessary.)

First, I am assuming that the lines after line 53 were to help you write the script and can be removed after I merge. Please let me know if I am wrong.

Second, I assume that lines up to line 19 are just handling command-line arguments, the same way that strgen.c does.

Lines 21 and 22 are what I am not sure I understand. The exec page says that the file descriptors are open according to the redirection. What I think that means is that those lines are for making the script's stdin refer to the input file and the script's stdout refer to the output file. Is that correct?

I will assume that is the case for the rest of this comment.

Lines 24-35 seem to be handling the optional command-line arguments.

Lines 37-50 seem to be where the magic is happening. Since the script is running cat with a here doc, and its stdout is pointed to the output file, these lines should be outputting the result of the here doc to that file, correct?

Going forward on that assumption, the here doc appears to be of the same form used by strgen.c. ${condstart} and ${condend} make sure that there is a #define if one was passed in, while $nameline adds the name if one was given.

Then lines 46 and 47 is is where more magic happens. That is where the array is created.

Line 47 is still a mystery to me. Obviously, the $(...) form is making the result of the sed call be embedded in the here doc. That's not a mystery to me. The sed call itself is, so I am going to go through it slowly.

sed -e "$remtabsexpr "'1,/^$/d; s:\\n:\\\\n:g; s:":\\":g; s:^:":; s:$:\\n":'

First, it appears to be using the -e option to allow it to read from stdin, which is pointing to the input file.

The first part is easy: it is using $remtabsexpr as part of the script to remove tabs. However, when it gets to the "'1,/^$/d; part, I am lost. It seems as though you needed to use double quotes to expand $remtabsexpr, but needed the script to all be one whole, so there is no space between the double-quoted part of the script and the single-quoted part of the script, which needs to be single-quoted because of the use of double quotes inside. Is that correct?

Anyway, if it is, then the 1,/^$/d; part of the sed script seems to be using addressing. The 1 is probably an address that refers to the first character, the , separates the addresses, and the /^$/ seems to say the next address is an empty line. If all of that is true, then this part of the script is what deletes the comment at the beginning of each file.

The s:\\n:\\\\n:g; part of the script seems to be replacing \n occurrences with \\n.

The s:":\\":g; part of the script seems to be replacing occurrences of " with \".

The s:^:":; part of the script seems to be placing " at the beginning of one line, probably the one containing the actual data from the input file.

The s:$:\\n": appears to be doing the same thing, but adding a \n" at the end of the input data instead.

What this probably means is that instead of creating an char array listed as numbers, it is creating a string literal.

Am I correct about everything? Or did I get something wrong?

Copy link
Owner

@gavinhoward gavinhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate to say this, but this change breaks the bc completely. It causes seg faults (see the build and click on any one of the four jobs in the build). The original code was correct, and it had a cast for a reason. I will attempt to explain why.

The code for bc's vectors is sort-of a copy of Dyna's vector, also written by me. The idea is a type-agnostic container, kind of like the C++ STL, but without templates. In order to be type-agnostic, it has to implement by hand, and at runtime, the sort of code that you would get by compiling C++ templates. That means that it has to manually calculate pointers to elements.

As you can probably tell from Dyna, it uses a char array underneath as well, but it returns a void*, which the user can then assign to a pointer of the correct type. The reason that it has a char* underneath but returns a void* is because the C standard doesn't require casts to or from a void*, but it also says that pointer arithmetic on void* is undefined, while on char*, it is guaranteed to be calculated by individual bytes, since char is defined as the smallest addressable type.

The bc vector does exactly the same thing (see bc_vec_item() and friends). Basically, even though it is a char* underneath, it is doing an address calculation, taking the size of the type that the container holds into account.

In bc_vm_envArgs(), the vector v is created to hold the type char*. This means that underneath, in bc_vec_item(), it is doing a size calculation equivalent to:

v.v + idx * sizeof(char*)

This all means that for v specifically, even though v.v is technically a char*, it actually contains a char*, so it is actually a char**.

As a result, by taking its address, you are sending a char*** to bc_args(), which then, assuming it has a char**, sends that to getopt_long(), which then accesses wrongly and causes the seg fault.

Perhaps a clearer line of code would be the following:

s = bc_args((int) v.len - 1, bc_vec_item(&v, 0));

That line is functionally equivalent to the original line of code.

The reason I did not use that was because I did not want to assume that the compiler would optimize it away, but since it only happens once per run (at most), it might be better to do that instead.

Please either change it back or change it to the line of code above.

@stesser
Copy link
Contributor Author

stesser commented Apr 20, 2019

Okay, this patch has more advanced POSIX shell scripting than I know, so I am
going to make sure I understand it first, after reading about |exec|
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_20
and redirection
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_07.
In other words, I am not reviewing the code; I am reviewing my understanding
of it. (I will review later, if it's necessary.)

Hi Gavin,

understood ... I could have made the script far more cryptic, BTW ;-)

First, I am assuming that the lines after line 53 were to help you write the
script and can be removed after I merge. Please let me know if I am wrong.

Yes, I had pasted in some information to help me get the parameters right.
I thought I had removed them before the commit, but apparently not.

Second, I assume that lines up to line 19 are just handling command-line
arguments, the same way that |strgen.c| does.

Yes, the script takes up to 7 parameters and does just assign variable
names for later reference.

Lines 21 and 22 are what I am not sure I understand. The |exec|
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_20
page says that the file descriptors are open according to the redirection
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_07.
What I think that means is that those lines are for making the script's
|stdin| refer to the input file and the script's |stdout| refer to the output
file. Is that correct?

Yes, exec used in that way connects the files as if the script had been
started with re-directed STDIN and STDOUT.

The script could just as well wrap use redirection on the "cat << EOF",
but I did not go for the most compact (as in LOC) style.

I will assume that is the case for the rest of this comment.

Lines 24-35 seem to be handling the optional command-line arguments.

Yes, this could also be done at the places where the variables are used
(e.g. with ${var:+...}) but I prefer the more verbose style which also
makes debugging a lot easier.

Lines 37-50 seem to be where the magic is happening. Since the script is
running |cat| with a here doc, and its |stdout| is pointed to the output file,
these lines should be outputting the result of the here doc to that file, correct?

Yes.

Going forward on that assumption, the here doc appears to be of the same form
used by |strgen.c|. |${condstart}| and |${condend}| make sure that there is a
|#define| if one was passed in, while |$nameline| adds the name if one was given.

Yes.

Then lines 46 and 47 is is where more magic happens. That is where the array
is created.

Line 47 is still a mystery to me. Obviously, the |$(...)| form is making the
result of the |sed| call be embedded in the here doc. That's not a mystery to
me. The |sed| call itself is, so I am going to go through it slowly.

|sed -e "$remtabsexpr "'1,/^$/d; s:\n:\\n:g; s:":\":g; s:^:":; s:$:\n":' |

First, it appears to be using the |-e| option to allow it to read from
|stdin|, which is pointing to the input file.

The -e is just followed by any sed commands that are passed on the command
line (instead of provided in a command file).

The first part is easy: it is using |$remtabsexpr| as part of the script to
remove tabs. However, when it gets to the |"'1,/^$/d;| part, I am lost. It
The $remtabexpr could also be changed to remove only tabs at the beginning
of a line (and later tabs could be kept or replaced by space characters), but
given the use of tabs in the input files just for indentation, I did not
bother to write the more complex rule required for that ...

seems as though you needed to use double quotes to expand |$remtabsexpr|, but
needed the script to all be one whole, so there is no space between the
double-quoted part of the script and the single-quoted part of the script,
which needs to be single-quoted because of the use of double quotes inside. Is
that correct?

Yes, using double quotes had been possible, but then the \ characters had to
be duplicated everywhere.

Anyway, if it is, then the |1,/^$/d;| part of the |sed| script seems to be
using addressing. The |1| is probably an address that refers to the first
character, the |,| separates the addresses, and the |/^$/| seems to say the
next address is an empty line. If all of that is true, then this part of the
script is what deletes the comment at the beginning of each file.

Yes. The sed commands can be preceded with a line range, where number are
taken as line numbers, /../ as RE patterns that are triggered by the line
that matches. The range is inclusive and this makes "1,/^$/" match the first
line to the first empty line and the "d" delete those lines including the
empty line that ends the range. Since "1" is an absolute line, this command
is only triggered once at that line number and ignored after the end condition
has been met for the first time.

The |s:\n:\\n:g;| part of the script seems to be replacing |\n| occurrences
with |\n|.

Yes.

The |s:":\":g;| part of the script seems to be replacing occurrences of |"|
with |"|.

Yes.

The |s:^:":;| part of the script seems to be placing |"| at the beginning of
one line, probably the one containing the actual data from the input file.

Yes.

The |s:$:\n":| appears to be doing the same thing, but adding a |\n"| at the
end of the input data instead.
Exactly.

What this probably means is that instead of creating an char array listed as
numbers, it is creating a string literal.

Yes, one long string literal that is broken in lots of double-quoted fragments, which are to be
concatenated by the C pre-processor to a single string literal in Standard C.

Am I correct about everything? Or did I get something wrong?
Yes, you got everything right.

@gavinhoward
Copy link
Owner

Thank you for your patience with me as I understood the script. To me, it is imperative that I understand everything in my repo, in case I need to make changes.

Okay, this patch has more advanced POSIX shell scripting than I know, so I am
going to make sure I understand it first, after reading about |exec|
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_20
and redirection
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_07.
In other words, I am not reviewing the code; I am reviewing my understanding
of it. (I will review later, if it's necessary.)

Hi Gavin,

understood ... I could have made the script far more cryptic, BTW ;-)

I am so glad that you didn't. :P

First, I am assuming that the lines after line 53 were to help you write the
script and can be removed after I merge. Please let me know if I am wrong.

Yes, I had pasted in some information to help me get the parameters right.
I thought I had removed them before the commit, but apparently not.

Second, I assume that lines up to line 19 are just handling command-line
arguments, the same way that |strgen.c| does.

Yes, the script takes up to 7 parameters and does just assign variable
names for later reference.

Lines 21 and 22 are what I am not sure I understand. The |exec|
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_20
page says that the file descriptors are open according to the redirection
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_07.
What I think that means is that those lines are for making the script's
|stdin| refer to the input file and the script's |stdout| refer to the output
file. Is that correct?

Yes, exec used in that way connects the files as if the script had been
started with re-directed STDIN and STDOUT.

You explained it so much more concisely.

The script could just as well wrap use redirection on the "cat << EOF",
but I did not go for the most compact (as in LOC) style.

Totally fine. I don't care about LOC, I care about readability, which you seem to have prioritized. (The sed call wasn't very readable, but that couldn't have been avoided in any way.)

I will assume that is the case for the rest of this comment.
Lines 24-35 seem to be handling the optional command-line arguments.

Yes, this could also be done at the places where the variables are used
(e.g. with ${var:+...}) but I prefer the more verbose style which also
makes debugging a lot easier.

I agree. Thank you.

Lines 37-50 seem to be where the magic is happening. Since the script is
running |cat| with a here doc, and its |stdout| is pointed to the output file,
these lines should be outputting the result of the here doc to that file, correct?

Yes.

Going forward on that assumption, the here doc appears to be of the same form
used by |strgen.c|. |${condstart}| and |${condend}| make sure that there is a
|#define| if one was passed in, while |$nameline| adds the name if one was given.

Yes.

Then lines 46 and 47 is is where more magic happens. That is where the array
is created.

Line 47 is still a mystery to me. Obviously, the |$(...)| form is making the
result of the |sed| call be embedded in the here doc. That's not a mystery to
me. The |sed| call itself is, so I am going to go through it slowly.
|sed -e "$remtabsexpr "'1,/^$/d; s:\n:\n:g; s:":":g; s:^:":; s:$:\n":' |
First, it appears to be using the |-e| option to allow it to read from
|stdin|, which is pointing to the input file.

The -e is just followed by any sed commands that are passed on the command
line (instead of provided in a command file).

The first part is easy: it is using |$remtabsexpr| as part of the script to
remove tabs. However, when it gets to the |"'1,/^$/d;| part, I am lost. It

The $remtabexpr could also be changed to remove only tabs at the beginning
of a line (and later tabs could be kept or replaced by space characters), but
given the use of tabs in the input files just for indentation, I did not
bother to write the more complex rule required for that ...

The original strgen.c doesn't do anything particularly clever either, so technically, your rule is equivalent to what it does. Even if you wanted to write a more complex rule, I would say keep this one, for consistency.

seems as though you needed to use double quotes to expand |$remtabsexpr|, but
needed the script to all be one whole, so there is no space between the
double-quoted part of the script and the single-quoted part of the script,
which needs to be single-quoted because of the use of double quotes inside. Is
that correct?

Yes, using double quotes had been possible, but then the \ characters had to
be duplicated everywhere.

Because this is the case, would it be possible to put $remtabsexpr as an argument to its own -e option? In other words, can we separate it from the others like so:

sed -e "$remtabsexpr " -e '1,/^$/d; s:\n:\\n:g; s:":\":g; s:^:":; s:$:\n":'

That way, it a little clearer to read, at least for me, and it seems the POSIX standard requires implementations to be able to take more that 1 -e option, and it requires them to concatenate them.

Anyway, if it is, then the |1,/^$/d;| part of the |sed| script seems to be
using addressing. The |1| is probably an address that refers to the first
character, the |,| separates the addresses, and the |/^$/| seems to say the
next address is an empty line. If all of that is true, then this part of the
script is what deletes the comment at the beginning of each file.

Yes. The sed commands can be preceded with a line range, where number are
taken as line numbers, /../ as RE patterns that are triggered by the line
that matches. The range is inclusive and this makes "1,/^$/" match the first
line to the first empty line and the "d" delete those lines including the
empty line that ends the range. Since "1" is an absolute line, this command
is only triggered once at that line number and ignored after the end condition
has been met for the first time.

Clever. And I am glad there was a solution to this.

The |s:\n:\n:g;| part of the script seems to be replacing |\n| occurrences
with |\n|.

Yes.

The |s:":":g;| part of the script seems to be replacing occurrences of |"|
with |"|.

Yes.

The |s:^:":;| part of the script seems to be placing |"| at the beginning of
one line, probably the one containing the actual data from the input file.

Yes.

The |s:$:\n":| appears to be doing the same thing, but adding a |\n"| at the
end of the input data instead.

Exactly.

What this probably means is that instead of creating an char array listed as
numbers, it is creating a string literal.

Yes, one long string literal that is broken in lots of double-quoted fragments, which are to be
concatenated by the C pre-processor to a single string literal in Standard C.

Am I correct about everything? Or did I get something wrong?

Yes, you got everything right.

Okay, perfect.

There is just one problem: the limit on the length of string literals in the C standard. (See §5.2.4.1.1.) This was the original reason I used strgen.c to generate a char array rather than a string literal. (The standard imposes no limits on char arrays.) Now, I was doing it as a char array mostly for the 1 C89 project that I was involved with: busybox, and C89 has a limit of 511 characters, if I remember correctly. That said, I am no longer involved with busybox and can use C99 exclusively, which has a limit of 4095 characters.

Right now, all of the items that strings are generated from produce strings less than 4095 characters, so technically, this script would work for everyone.

However, I have been thinking about adding a few more functions to gen/lib2.bc (mostly to help programmers), and that is the longest one at about 3100 (including with the header comment and tabs removed). I could see gen/lib2.bc becoming larger than 4095 characters in the future, in which case, it would technically be undefined behavior to have a string literal that large.

Now, I assume that clang does not impose such a limit (in fact, the standard explicitly encourages not imposing one), so FreeBSD would not be affected. However, other platforms might be (they like to compile this bc with pcc on OpenBSD, and I would not assume that it does not impose that limit.

Therefore, my proposal is to keep strgen.c as the default, but allow users to select strgen.sh when running configure.sh. We can decide how to select from one of two ways:

  1. If the environment variable SHELL_GEN (or GEN_SHELL, whichever) is set and not 0, strgen.sh is selected.
  2. The environment variable GEN is set to strgen.sh (it would default to strgen).

Personally, I prefer (1). What do you think?

I will go with (1) by default if I don't get a reply before you go to bed. (I am assuming you are in Germany.)

@gavinhoward gavinhoward merged commit 9887b31 into gavinhoward:master Apr 22, 2019
@gavinhoward
Copy link
Owner

gavinhoward commented Apr 22, 2019

I will be doing some work on this, but otherwise, it is good enough. Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants