![]() search |
TALISMAN
general
Information
Unix server |
|
Commandname Extensions Considered Harmful
|
Erlkönig: Commandname Extensions Considered HarmfulCreated: 2016-01-23
Some complementary material on
interpreter directives
You can also read the email in which Dennis Ritchie introduced #!. ABSTRACTThe command name for any Unix script must be stable for any complex system based on it to be stable. However, this is being compromised through practices based on misinformation. † This paper explores how scripts are actually run, how naming affects correctness and stability, and various common misconceptions in order to clarify the reasons behind standard practice - which is: Command names should never have filename extensions. Command name extensions have numerous issues:
Ironically all of these are problems involving interpretation by humans. Herein, a problem with filename extensions is described in a manner perhaps more pragmatic than, yet inspired by, the well known Go To Statement Considered Harmful by Edsger W. Dijkstra (Communications of the ACM, Vol. 11, No. 3, March 1968). Dijkstra's work addresses the issue of how the use of the go to statement largely abridges the ability to parametrically describe the progress of a process, engendering an unnecessary impediment to the code's clarity and manageability. This new document details, based on practical experience under Unix-like operating systems, how filename extensions, particularly but not limited to those files implementing commands, create a secondary set of semantic tags in the interfaces between programs which are demonstrably both superfluous and treacherous. It's not a coincidence that in both Dijkstra's plaint and this one that computers are not at all affected by either practice - it's entirely a problem for just the humans. What is a CommandFor purposes of this paper, command names are the filenames of all the executable files in the directories in the Unix $PATH environment variable.
By convention, almost all such directories end in bin
(nominally suggestive for Didactic Examples
Consider the following examples, in which files have
.sh and .py
extension, ostensibly to indicate the type of the file as well as to
make it easy to list all files of the same type (shell scripts).
Running them based on the apparently-correct interpreter doesn't go well
(the
$ sh frob.sh
./frob.sh: 2: Syntax error: # […] what?
$ ./frob.sh
hello world # ✔
$ cat frob.sh
#!/usr/bin/perl -w
# Used to be a shell script,
# but we couldn't change the name...
use strict;
printf("hello world\n");
$
$ python knob.py
File "knob.py", line 2
func = print
^
SyntaxError: invalid syntax # what?
$ ./knob.py
hello world # ✔
$ cat knob.py
#!/usr/bin/python3
func = print
func("hello world")
$
$ sh qux.sh
qux.sh: 3: qux.sh: Syntax error: # […] what?
$ ./qux.sh
hello world # ✔
$ cat frob.sh
#!/bin/bash
cat <<<'hello World'
$
These scripts show some problems around trusting extensions:
Failures aren't always as obvious as immediately exiting with an error: More subtle distinctions in script language execution, or a script with sufficient error trapping to survive being run with the wrong interpreter version, could result in incorrect results and serious damage.
$ python divide.py 5 2
2 # what?
$ ./divide.py 5 2
2.5 # ✔
$ cat divide.py
#!/usr/bin/python3
import sys
a, b = int(sys.argv[1]), int(sys.argv[2])
print(a / b)
$
(The same issue can arise through command search in $PATH finding a different version of a program than expected, especially when using virtual environments, but that's outside of the scope of this document) Methods of Specifying Interpreters for ScriptsSeveral mechanisms exist to determine how a file should be executed, whether as a set of directives or as machine code. The ones relating to this discussion are:
Interpreter Directives are an Intrinsic Part of File ContentInterpreter directives can only be changed by modifying the files' contents, whereas file extensions can be changed arbitrarily using general filesystem commands like mv. File extensions also have a disturbing tendency to get lost in some contexts, since they're part of a Unix directory entry, not part of the file itself. In contrast, interpreter directives are quite stable. With scripts, interpreter directives are typically changed in the same manner as the other contents through using a text editor. Modern editors can usually recognize scripts by their interpreter directive, although historically special handling of certain types of text files was usually done based on the file extension. Humans are the ProblemNow, so far, command name extensions might look like no more than hints to editors to use the correct editing mode, or to humans to make it easy to ls by script type.. The kernel doesn't view them specially at all - they're only just more bytes in the filename. But there is an insidious problem with them, in that using them breaks part of the mechanism by which the implementation details are hidden from the user, and from other programs written by users. It's the humans' attempt to apply the information in these command name extensions that causes problems. Effects of Porting Programs Between LanguagesTypically, programs in Unix often start their lives as quickly written, inefficient, under-featured shell scripts. Later, they get converted to something faster, like PERL or python. Finally, they are often rewritten C, C++, or something else fully compiled. If the author violates encapsulation by exposing the underlying language in a spurious extension, the command name may change from a name.sh, to name.pl, to name, breaking all existing coded calls to the program each time, as well as adding to the cognitive load of human users. The more effective the user base has been at script-based factoring and reuse, the more treacherous the extensions become (ie. proficient users often build more readily on preëxisting programs, increasing the number of dependencies on the names of those programs). To combat the problem of breaking dependencies, what usually happens is that when the name.sh script ends up being rewritten in (for example) PERL, the now-misleading old name is retained to keep from breaking other programs which refer to it. The resulting mismatch causes extra maintenance hassles principally to users trying to maintain the extensions, who naïvely type things like ls -l *.sh without realizing some of the listed files aren't shell scripts anymore. Such semantic dissonance leads easily to more serious issues, with scripts called by the wrong interpreters in error-suppressed contexts, truncated processing due to the resulting errors, and the resulting arbitrarily disastrous problems. Command Name Extensions Are Often Wrong - and Subtly
The issue of using the wrong interpreter can be subtle, since a user
seeing a name.py program may enter
python name.py, not realizing that the program
only works with python 2.5 when 2.4 is still the system default
(the former would have a directive like #!/usr/bin/python2.5).
Most scripts suffixed with Some Command Name Extensions Matter - But Not To the KernelThere are cases where scripts are executed as a result of special extensions, such as the model currently used by most webservers where file handling is cued by filename extensions. However, even such subsystems often have other, more sophisticated approaches allowing those same extensions to be hidden, and thus protect URIs from a variant of the script filename extension problem, namely, how to keep all links to your website from breaking with you switch from *.html files to *.cgi, *.php, or something else. Furthermore, of the extensions just listed, note that .html files aren't scripts, .php files use a webserver builtin, and that .cgi scripts themselves require interpreter directives to be executed correctly as well as the .cgi for Apache to permit the script to be run. Commands should never have filename extensions.Rely on interpreter directives instead or some other paradigm that prevents the implementation from being exposed, or worse yet, lied about, within the very name of the command. The best place is in the file itself, though as noted, there are some issues to deal with through #!/usr/bin/env and other tactics. AppendicesPythonSo you have this file named foo.py...
If There's a case where, in some bin/ directory, there are both a foo.py implementing a library, and a foo implementing the options parsing and using library. In this situation the foo is executable and the foo.py isn't, and because the .py isn't this situation is fine (though rare). As an example, here's a library hellolib.py and a program hi.py just as described above (save for the names):
# hellolib.py library
import unittest
def hi(whom=None):
return 'Hello' + ((' ' + whom) if whom
else ' World')
class TestLib(unittest.TestCase):
def test_hi(self):
self.assertEqual('Hello World', hi())
self.assertEqual('Hello You', hi('You'))
#!/usr/bin/python
# "chmod 755 hi" so you can run ./hi
import hellolib, sys
someone = (sys.argv[1] if len(sys.argv) > 1
else None)
print(hellolib.hi(someone))
There's no point in being able to run ./hellolib.py or python hellolib.py, because we're obviously just going to run nosetests hellolib anyway, as per standard practice. Otherwise, we'd have to add the rather ugly, though accepted, lines below:
⋮
# addendum to hellolib.py
if __name__ == '__main__':
unittest.main()
...which a bit nasty, since we'd have to either add execute permission on the library file too as well as a #! line, or guess at which version of Python is needed to run it manually, e.g. python hellolib.py Also, enabling execute permission makes nosetests's decision of whether it's safe to import the file (without causing side effects) much harder, so it doesn't test executable files default, and we risk the unittest in our library being skipped. Listing All Files of the Same Script Type (sh, py, etc)The issue of users wanting to be able to list, for example, all Bourne shell scripts easily with ls(1) is a big motivator to some people to name them all with .sh extensions. If ls had an option to filter based on the execution method of a file, say something like ls -e '*/sh' to list only files with /sh at the end of the first part of the interpreter directive, that would help. However, whether ls should even be doing such a job would probably be hotly, justifiably contested. Here's an example of using a new program to address this problem: $ cd /bin
$ scripts /bin/sh | wc -l
10
$ scripts /bin/sh
./bzgrep
./bzmore
./running-in-container
./setupcon
./unicode_start
./lesspipe
./red
./bzdiff
./bzexe
./which
$ less $(scripts /bin/sh)
# …in a pager, looking at shell scripts...
$ head -1 $(scripts)
==> ./zforce <==
#!/bin/bash
==> ./bzgrep <==
#!/bin/sh
==> ./bzmore <==
#!/bin/sh
==> ./gunzip <==
#!/bin/bash
# …and so on…
$ cd /usr/bin
$ head -1 $(scripts) | grep '#!' | sort | uniq -c | sort -nr
191 #!/bin/sh
104 #!/usr/bin/perl -w
102 #!/usr/bin/perl
58 #! /usr/bin/python # about that space, now uncommon
53 #! /bin/sh
# …38 variants overall, with some of these less common:
7 #!/usr/bin/ruby1.9.1
3 #!/usr/bin/fontforge -lang=ff
2 #!/usr/bin/pypy
1 #!/usr/bin/env nickle
A sample script implementing the command (obviously with no extension in case someone wants to rewrite it in Python, Ruby, C, etc.). Note that this needs #!/bin/bash specifically, since classic /bin/sh doesn't support $(...) or local. #!/bin/bash
# Return a list of scripts having a given string in the interpreter directive.
Syntax () {
local regexp="$1"
echo "Syntax: $0 [<regexp> [<file>|<dir>]...]"
echo " $0 {-h | --help}"
echo ' <regexp> - used to match interpreter directives'
echo ' <file> - report file if <regexp> matches'
echo ' <dir> - report each file in <dir> for which <regexp> matches'
echo 'If no <file> or <dir> is given, "." is used as a default.'
echo 'Give a <regexp> of "." to use the default ('"$regexp"') with <file>/<dir>.'
echo 'NOTE: Only executable files are considered.'
}
ScanFile () { [ -x "$1" ] && head -1 "$1" | egrep -- -qs "$2" ; }
ScanStuff () {
local found=false
local regexp="$1" ; shift
local thing dir file
for thing in "$@" ; do
if [ -d "$thing" ] ; then
dir="$thing"
for file in $(find "$dir" -name . -o -type d -prune -o -type f -print) ; do
ScanFile "$file" "$regexp" && echo "$file" && found=true
done
else
file="$thing"
ScanFile "$file" "$regexp" && echo "$file" && found=true
fi
done
$found
}
Main () {
local regexp='^#!'
case "$1" in
--) shift ;;
-h|--help) Syntax "$regexp" ; exit 0 ;;
-*) Syntax "$regexp" 1>&2 ; exit 1 ;;
esac
[ $# -ge 1 ] && { regexp="$1" ; shift ; }
[ $# -eq 0 ] && set .
ScanStuff "$regexp" "$@"
}
Main "$@"
#---eof
Obviously we can reimplement scripts in any language we want without telling any of its other users, because it doesn't have some [expletive deleted] extension on the end, and so for everyone else it'll just keep working. Why Are So Many Developer Recently Misusing Extensions?This is... a theory. In the late 1980s (based my experience at the time) , commandname extensions were essentially absent from the Unix realm. Almost all scripting was either in Bourne shell, or in the Csh a few screwballs (included myself and others) tried to make work as a scripting language. Ksh, Tcsh, and a few others were used at some sites. Interpreter directives were required for all of them except Bourne shell scripts, since sh would attempt to execute a executable script via the kernel, but if that failed it would just assume it was an sh script (they ALL were a decade before, so it made some sense), and spawn a shell to interpret it, which worked badly when the script was actually written in any of the other things. In the 1990s, commandname extensions showed up occasionally when DOS/Windows users started poking at Linux and dragging along the DOS extension concept with them. However, DOS hides filename extensions - you can run a DOS script even if the extension is omitted when invoking it - so in theory they were hiding metadata (and, coincidentally, creating an inroad for Trojan attacks) instead of exposing the implementation language. In contrast, Unix requires the entire name of the file to run commands - including any extensions (or a string of them) since they're just more characters - the . isn't special to the kernel, just part of the name. Essentially the DOS practice is totally wrongheaded in the Unix environment. Fortunately, during this period more experienced Unix users tended to educate the DOS arrivals soon enough to keep the practice from being all that common. In the 2000s, and increasingly in 2010 and beyond, there was a sudden explosion in commandname extensions, but not from the DOS migrants, but rather from a new sub-population of programmers in languages like PHP, PERL (to some extent), Python, Ruby and others - all languages which were NOT compiled, and whose libraries tend to require extensions, and whose users typically had little to no grounding in Unix fundamentals, and hadn't worked in C (which produces executables without extensions most of the time). These programmers improperly overgeneralized the use of extensions from libraries to command scripts, and then wrote lots of documentation that included this aberrant practice. And now, suddenly they're everywhere, doing it wrong while thinking it's right (that what the docs say, after all), and driving those who actually know how it works slightly insane. So now we the insane ones are writing little webpages like this to tell the interpreted-language crowd, please, please be more sparing in your extensions. They don't belong on commands. Really. Ever. Every time you mutilate a command by putting an extension on it, some angry computing god out there kills a kitten. Please - think of the kittens. ![]() Thanks to:(Note: Don't copy/paste the addresses, just type what they suggest.)
|