Working in CommandLine Interface, Shell, and Bash
This is the script for the talk given at IPM Observational Astronomy Meeting [^1]. It is based on the IAC SMACK talks [^2].
[^1] https://ipm-oam.github.io/presentations/2022/Working_in_Command-Line_Interface/ [^2] https://gitlab.com/makhlaghi/smack-talks-iac
Original author:
Pedram Ashofteh-Ardakani <pedramardakani@pm.me>
Contributing author(s):
Mohammad Akhlaghi <mohammad@akhlaghi.org>
Copyright (C) 2022 Free Software Foundation, Inc.
The content and the scripts in this work are free work: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This file is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this file. If not, see http://www.gnu.org/licenses/.
info bash
_C_ommand_L_ine _I_nterface (CLI) works with the keyboard. Where you can get exactly what you need, manipulate it the way you want, and automate the procedure using scripts.
With _G_raphic _U_ser _I_nterface (GUI), you are bound to what “the designers” have given you. You’re not in control of what you see and can’t automate the procedure. See how hard it is to test a GUI program (I feel you “front-end developers”). There are sophisticated tools created just for testing the GUI, and still, it is not easy to use.
Not to mention how slow you can be when chasing the mouse around the screen. Sure, “seeing” the commands can be convenient, but only for when you’re new to the ecosystem and don’t know the program’s capabilities.
One more thing, GUI changes from one program or even one version to another. With one upgrade, you might completely lose your way around. However, a good designed CLI has a backward-compatible nature. Which means, with the upgrades, you are still on the track while enjoying the new features.
Finally, in CLI you can pass around the results of your work from one program to another. This is called “pipe”ing. An awesome and extremely useful feature that is impossible by design in GUI.
info bash
Bash is just one form of shell
, hence the name Bourne-Again SHell (pun intended in the naming).
There are older shells such as sh
and more recent ones such as zsh
(pronounced zee-shell).
The purpose of the shell is to link different programs using the unified standard input (or input files) and standard output (or output files).
Therefore unlike other common “programming” languages like C
or Python
, numbers for example are just a string of characters (which by coincidence are digits).
Of course, it also has similar structures to C
or Python (like if
or for
);
But in the shell, these are higher-level constructs to make it easy to conditionally run programs or run them in a loop.
Learning about the basic capabilities such as loops and conditional goes a long way. But you’re not limited to that. There are many GNU programs and other CLI tools that come to your aid. Without them, you might need thousands of lines of code to accomplish the simplest tasks.
Bash enables you to unleash the power of modularity. That is, you have access to many programs that are specialized to do one and only one job the best way possible. Combining these tools will liberate you from using various libraries and plug-ins for the programming language you use.
Python has its own merits. It’s great for quickly developing solutions. However, it has a quick evolution where sometimes breaks backward compatibility. Your python script might not work on an older computer, or even on your own computer a few months (or two years at tops) from now!
The transition from Python 2 to Python 3 has not been easy for software that depended on it. For instance, have a look at [[https://arxiv.org/abs/1712.00461][the hardship hte Large Synoptic Survey Telescope pipeline is going through [1]]]. It took them almost one decade and needed a lot of extra energy to engineer the update without crashing the whole thing while running on real data - from Subaru Telescope’s Hyper Suprime Cam which was using it in those years, it is still using the LSST pipeline!
[1] https://arxiv.org/abs/1712.00461
In contrast, it is most likely that your bash script executes just find on a computer as old as 1998! This backward compatibility is a key backbone of reproducibility, an important element that is missing in many of the scientific endeavors done these days.
So be nice to yourself, and use more stable and reproducible programming languages for the long run.
Another issue with python is that most of its modules are themselves written in Python, needing many layers of interpretation before your command reaches the CPU. On the other hand, bash mostly runs on compiled programs that directly talk to the kernel.
Compiled programs are faster than interpreted programs in general, since they have already translated the source code to the specific machine you’re using. With this, you are able to fine-tune your programs and follow the best practices that allow for extremely useful optimizations. For example, in the Warp library of Gnuastro, we are dealing with tens of millions of pixels, and hundreds of millions of processing on them. Even 1 microsecond of delay in this huge scale could result in at least one extra minute, where 1 millisecond means 18 extra hours of runtime!
Bash is a layer of abstraction over a very powerful set of tools. Don’t be fooled by its easy looks and syntax. As soon as you get to know your way around the manual, it can do wonders for you.
With that said, let’s dive in.
# Later I want to show the convenience of using 'alias' unalias ls ll
https://archive.stsci.edu/hlsp/uvudf
- Where are we?
# Kernel name (Linux, Darwin, etc.) uname --kernel-name
# Operating system (GNU/Linux, macOS, etc.) uname --operating-system
# Current location (i.e. parent working directory) pwd
- Who goes there?!
ls ls --help ls --color ls -l ls -ltrha --color
- But this is a lot of options to remember, and a lot to type, so let’s set an alias for
ls
andll
alias ls="ls --color" alias ll="ls -ltrha --color"
- Change directory
cd w/bash-tutorial
- Check download URL
cat url.txt
- Create directory
mkdir dataset ls
- Download the Hubble Space Telescope (HST) UVUDF survey catalog.
# The UVUDF survey catalog: https://archive.stsci.edu/prepds/uvudf curl https://archive.stsci.edu/missions/hlsp/uvudf/v2.0/hlsp_uvudf_hst_wfc3-uvis_udf-epoch3_multi_v2.0_cat.fits
# We could have given it the output name in the first place by passing the '-o' option curl -o catalog-raw.fits \ https://archive.stsci.edu/missions/hlsp/uvudf/v2.0/hlsp_uvudf_hst_wfc3-uvis_udf-epoch3_multi_v2.0_cat.fits
- Copy the catalogue file with a better name
cd ~/w/bash-tutorial cp dataset/hlsp_uvudf_hst_wfc3-uvis_udf-epoch3_multi_v2.0_cat.fits catalog-raw.fits
- Convert to text
# Just bear with me, we're creating a human readable file from a binary # FITS format using Gnuastro's Table program. You'll learn about it in # the future sessions. asttable catalog-raw.fits --txtf64format fixed -o catalog-raw.txt
- Inspect the file with
less
less catalog-raw.txt
- Print the first 97 rows
head -97 catalog-raw.txt
- They all start with ‘#’, so we can get them with
grep
as well (no need to speculate)# Contains the word 'Column' (case sensitive) grep Column catalog-raw.txt
# Use the --color option to see the matches grep --color Column catalog-raw.txt
# Or make it case insensitive grep -i column catalog-raw.txt
Note that simply writing # would return an error since the pound sign has a special meaning: “comment”. Comments are lines that are ignored by the command line. So what actually happens, is that bash ignores whatever comes right after the pound sign. To avoid that, we’re sandwiching the ‘#’ with single quotes. This might happen when you’re looking for non-alphabetic characters as they might have special meanings. Be careful and sandwich them between ‘single quotes’.
# Bad form grep # catalog-raw.txt
# Correct form grep -e '#' catalog-raw.txt
# [Advanced] use regex to say lines that start with the pound sign '#' # (read more about Regular expressions in grep manual). grep -e '^#' catalog-raw.txt
- Now write that to a new file, and write the body to another file as well
grep -e '^#' catalog-raw.txt > header.txt grep -ve '^#' catalog-raw.txt > data.txt
- Let’s check the header again, this time with
more
andcat
cat header.txt more header.txt less header.txt
Note that if we don’t add the
.txt
extension, nothing bad happens! The computer doesn’t care! It knows what these files contain. It’s only for us humans, and also, they can be helful when categorizing files. Wanna try? See:file header.txt file catalog-raw.fits
- The data has many occurances of
-99
and99
which are intented to be values that are not actually available. But having numbers can ruin our statistics without failing (which is a logical error, the nastiest kind of error). So let’s replace them withnan
as in ‘Not a Number’:# See that the -99 are replaced with nan sed -e's/ -99 / nan /g' data.txt
# But we need to store this data somewhere, also, we need to replace # 99 and the floating point -99.0000000000000 (and the positive number) # as well! So let's combine all of these criteria inside one 'sed' call sed -e's/ -\?99 / nan /g' -e's/ -\?99.0*0 / nan /g' \ data.txt > catalog.txt
- Now, let’s say we need to extract the spectroscopic redshifts denoted by SPECZ from the raw catalog.
First, we’d have to figure out the column number.
But instead of scrolling through the 97 columns, let’s just
grep
it!# Note that order of the options could matter, in this case, it doesn't. grep SPECZ header.txt
# Let's put it in a new file grep -i 'specz ' header.txt > select.txt
# Check available filters grep -i mag_ header.txt
# Let's get the 435 filter as well grep -i mag_f435w header.txt
# Suppose there's a lot of them and we can't just remember them. Let's # put it in a new file for later reference: grep -i mag_f435w header.txt > select.txt
# BUT WAIT! It just overrites the file! So we'd have to append it with >> rm select.txt grep -i ' id ' header.txt > select.txt grep -i ' specz ' header.txt >> select.txt grep -i 'mag_f435w' header.txt >> select.txt grep -i 'mag_f606w' header.txt >> select.txt grep -i 'mag_f775w' header.txt >> select.txt
How can we show them at the same time? Use the pipe
|
character. Since it is a special character, we need to escape it with slash\
:grep -i -e'mag_f435\|mag_f606' header.txt
Feeling bad about all the new information? You can get all of the information from here:
info grep
- How about putting some colors in a separate file?
Even better, let’s do some arithmetic over them simultaneously!
awk '{print $1}' catalog.txt
# [Advanced] We actually didn't need to put the data in a separate file # just to use AWK easier. AWK takes regex as well. For example: awk '!/^#/{print $1}' catalog-raw.txt > catalog.txt less catalog.txt
See how the regex seems similar in both
grep
andawk
? This happens a lot. So when you learn a concept, usually it applies to other programs as well. Especially the GNU family.# Get the ID, SPECZ, F435W, F606W, F775W. We want ID so we can identify # the final results for later use cat select.txt awk '{print $1, $94, $10, $11, $12}' catalog.txt
# But I don't want to see all of them, just the last line would # suffice. How can we use "tail" here? Use the pipe "|"! awk '{print $1, $94, $10, $11, $12}' catalog.txt | tail -1
# Let's calculate F435W-F775W to estimate "color" awk '{print $1, $94, $10, $11, $12, $10-$12}' catalog.txt | tail -1 awk '{print $1, $94, $10, $11, $12, $10-$12}' catalog.txt > magnitudes.txt
- Now select the reddest objects
# We're saying where 6th column is greater than 3, print it (default # behavior) awk '$6>3' magnitudes.txt
# Explicitely saying print all columns (that's $0) awk '$6>3 {print $0}' magnitudes.txt
# Only their ID and SPECZ awk '$6>3 {print $1, $2}' magnitudes.txt
# Save them in a file awk '$6>3' magnitudes.txt > reddest.txt
But it has lots of ‘nan’ values, let’s filter them out as well:
# Add conditions, also, "nan" is a string, so sandwich it between # double quotations. In AWK, single quotations have special meaning, it # shows the start and stop of the commands, so let's be nice and not # confuse it. awk '$6>3 && $2!="nan"' magnitudes.txt
It is OK, let’s put it in another catalog:
awk '$6>3 && $2!="nan"' magnitudes.txt > reddest-with-z.txt
- Count how many objects we’ve got so far:
# Use word count wc reddest-with-z.txt
# Also, open the help and check the options wc --help
# Now check lines, characters, etc. for demo wc -l reddest-with-z.txt
# Compare with previous catalog wc -l magnitudes.txt
- Now let’s sort by SPECZ in ascending order
sort -nk2 reddest-with-z.txt
- How do we get the object with the max redshift?
sort -nk2 reddest-with-z.txt | tail -1
- What is its value?
sort -nk2 reddest-with-z.txt | tail -1 | awk '{print $2}'
- We only need 3 decimals:
sort -nk2 reddest-with-z.txt | tail -1 | awk '{printf "%.3f\n", $2}'
- Sneak peak at Gnuastro’s Table program:
# Bug in table range! I used grep since the '--range=SPECZ,-98,98' # printed the '99' values as well! asttable catalog-raw.fits -cID,SPECZ,10,11,12,'arith $10 $12 -' --sort=SPECZ \ | asttable --range=6,3:inf --txtf64format fixed \ | grep -ve' -\?99.0*0 '
- Let’s say we’d want a random floating point number as the last column when we’re creating mock galaxies, etc.
How do we create random numbers?
First we’d need to learn about regualr and special variables, how do we get or set them? There are rules for that:
- Start with characters (case sensitive), and to split, use the underscore “_” character:
foo=1 Foo=2 echo $foo echo $Foo 2a=5 # We get an error here! response="YAY!" echo $response echo "$USER: is this fun?" echo "audience: $response"
- Start with characters (case sensitive), and to split, use the underscore “_” character:
- Simple arithmetic, only works with integers NOT floating points!
echo $(( 5+12 )) echo $(( $foo+$Foo ))
# Put this into another variable bar=3 baz=17 foo=$(( $bar+$baz )) echo $foo echo "Variable foo is: $foo"
- How do I deal with floating point arithmetic you say? Use AWK ;-)
echo | awk '{print 1.2 * 10}'
- Random numbers
echo $RANDOM
- How do I know this? Cheating of course:
# Go to 'Shell Variables' section and find RANDOM, show the bounds which # is the range from '0' up to '32767' info bash
Notice that the internal variables are in all caps. Using ALLCAPS variable names are discouraged since you might accidentally overwrite a critical shell variable! So please just use lower case variable names.
echo $PWD echo $USER echo $PATH echo $PS1 PS1="\[\033[01;35m\]OAM$ \[\033[00m\]"
# Also, you can check all the special variables using 'export' export
- Random number up to 100
echo $(( $RANDOM%100 ))
- Now let’s use
awk
to add a column of random numbersawk '{print $0}' reddest-with-z.txt awk '{print $0, rand()}' reddest-with-z.txt
# If we run it again, you can see that the random numbers are actually # the same! This is because AWK uses the same random-seed. This is to # make random numbers 'reproducible'. If you want to actually change the # random number for every execution, you must change the random-seed awk '{print $0, rand()}' reddest-with-z.txt awk 'BEGIN{srand('$RANDOM')}{print $0, rand()}' reddest-with-z.txt
# Now test it again awk 'BEGIN{srand('$RANDOM')}{print $0, rand()}' reddest-with-z.txt awk 'BEGIN{srand('$RANDOM')}{print $0, rand()}' reddest-with-z.txt awk 'BEGIN{srand('$RANDOM')}{print $0, rand()}' reddest-with-z.txt
# It Changes! Now let's format the numbers so we can read them # easily. Let's say we are only interested in ID, SPECZ, and the random # number awk 'BEGIN{srand('$RANDOM')} \ {printf "%-8d%-10.3f%-10.3f\n", $1, $2, rand()}' \ reddest-with-z.txt
- The holy
if
# Simple if [ 5 -gt 2 ]; then echo "Duh"; else echo "Seriously?"; fi
# Now use a variable x=$RANDOM; if [ $x -gt 16000 ]; then echo "TOPHALF :-D $x"; else echo "BOTTOMHALF :-( $x"; fi
# You could also checking if a file exists, a string is matched, # etc. Where to get the info? The info! Open bash and search for # 'conditional constructs'. info bash
- The
while
loop# Just print the ID and Spectroscopic redshift cat magnitudes.txt | while read -r id z rest_of_line ; \ do echo "Object $id redshift $z"; done
# Now put each value in its own file! mkdir sample ls cat magnitudes.txt | while read -r id z rest; \ do echo "$id $z" > sample/$id.txt; done
# Similarly you can achieve the same with AWK rm sample/* ls sample/ awk '{print $1, $2 > "sample/"$1".txt"}' magnitudes.txt
- The
for
loopSet the index and the iterable:
# My Very Educated Mom Just Served Us Nine Pizzas for planet in Mars Venus Earth Mercury; do echo "Hi $planet"; done
# Or even list the files here for f in $(ls); do echo "file: $f"; done
# BEWARE of white space in filenames as well! It's a good practice to # use dash '-' instead of white space.
Now let’s print a sequence, using …
seq
!seq 5 seq 10 seq 5 10 seq 5 0.5 10
Again, in the for loop:
for i in $(seq 5); do echo "Galaxy $i"; done
Now check for some ids in the samples
for i in $(seq 20); do echo "Sample $i" ; cat sample/$i.txt ; done
Some samples did not exist! Let’s check for their existance first and then print the details
for i in $(seq 20); do if [ -f sample/$i.txt ]; then echo "Sample $i" ; cat sample/$i.txt ; fi; done
Let’s say now you’ve done some analysis and you’d like to archive it or send to a colleague. Instead of just sending it in its big size, you can compress it to prevent wasting space on the disk!
# Check the initial size ls -lh catalog-raw.txt du -h catalog-raw.txt
# Compress and check again gzip catalog-raw.txt ls -lh
# De-compress gunzip catalog-raw.txt.gz
How about all the files we just created? Let’s put them into a tarball so it becomes a single file
tar -cvf my-discovery.tar *.txt mkdir unpack cd unpack tar -xf ../my-discovery.tar
As you’ve already guessed, this can be compressed as well
cd .. file my-discovery.tar gzip my-discovery.tar file my-discovery.tar.gz ls -lh *.gz
Or all in one command
# Remove the previous compressed tarball rm my-discovery.tar.gz
# Create a new compressed tarball in one command tar -xvaf my-discovery.tar.gz *.txt ls -lh *.gz
Now this is how bash
figures out what was the last command!
history
Now check how many times we’ve called awk
history | grep awk history | grep awk | wc -l history | grep awk > hist-awk.txt
You can even search inside when you’re on the CLI using Ctrl+r
Ctrl+r <part of the command> Ctrl+r asttable
man awk info awk awk --help
If you’ve learned nothing, it doesn’t matter, take your time and watch the video, or even look for other tutorials.
Beware of “why shoud I care!? I’m not a programmer!”. If you’re writing a program, you’re doing a programmers work.
Two idioms:
- One who has a hammer, sees everything as nails.
- Do not reinvent the wheel.
Physicists are famous for solving complex problems. They break down the problem to smaller solvable chunks.
For instance, you get to where you must calculate an irregular area. The physicist’s art is done. Now you must figure out the answer with mathematics. An expert has invented a solution already. You know how to calculate the area of a simple rectangle! Divide it to infinitesimal parts and integrate over it! Remember: you’re not a mathematician, probably not one who can derive all formulas from scratch anyway! But you’re using the tools.
- Clean coding