The trick to understanding awk in all its terse glory is to understand its defaults. I made a screencast explaining how awk works by deconstructing a script I’d previously written for this blog 1. In this post we’ll look at deconstructing awk’s defaults so we can understand all those one-liner scripts stack overflow solutions throw your way.
The example
I have a file that contains the version info for my apps and I’d like to extract the first version number in there:
// appVersion.gradle
def baseCode = 30001
def appVersion = [
product-1 : [
name: "21.091.420",
code: baseCode
],
product-2: [
name: "20.090.300",
code: baseCode
],
//...
// I want to pluck 21.091.420 from this file
The first solution (meh)
Some quick googling revealed this stack overflow solution which gets us close:
gawk -F'"' '$0=$2' appVersion.gradle
# -- output --
# 21.091.420
# 20.090.300
I only require the first number though so a quick way2 to do this would just be:
gawk -F'"' '$0=$2' appVersion.gradle | head -n 1
# -- output --
# 21.091.420
The problem with solution 1
- awk is powerful and to reach out to
head
for that last teeny tiny mile seemed sacrilegious. I want this solution to be pure awk. - What the heck does that incantation
gawk '$0=$2'
do? 3
The basics
Let’s try to take that script apart piece by piece:
default input field delimiter
gawk -F'"' '$0=$2' appVersion.gradle
# ↑
# input field delimiter
If you don’t specify the input field delimiter, awk sensibly defaults to the space character. Let’s try some examples:
echo "Hello kind world" | gawk '{print $2}'
echo "Hello kind world" | gawk -F" " '{print $2}'
# -- output --
# kind
echo "Hello kind world" | gawk -F"," '{print $2}'
# -- no output --
Notice how the line is split into numbered “segments” where $1
, $2
, $3
hold the first three words in our example respectively. $0
represent the entire line.
default syntax
If you watched my screencast you’ll remember that awk’s general syntax is as follows:
awk '
BEGIN { a1; a2; a3; } ← optional
<pattern> { a1; a2; a3; } ← action block (mandatory)
END { a4; a6; } ← optional
' <filename>
Most awk one-liners typically don’t use the begin & end blocks.
So looking back at my simple one-liner:
echo "Hello kind world" \
| gawk ' { print $2 }'
# ↓
# action block ✅
🛑 ✋ but wait, what’s going on with the original one-liner 👇?
gawk -F'"' '$0=$2' appVersion.gradle
# ↑
# 🤔
# is this a <pattern>?
# is this an action block?
For this, we need to understand how the awk pattern recognition works:
# general syntax
gawk '<pattern> { a1; a2; a3; }'
echo "Hello kind world" \
| gawk '0 { print $2 }'
# ↑
# forcing result of <pattern> match as 0
# -- output --
# no output
echo "Hello kind world" \
| gawk '1 { print $2 }'
echo "Hello kind world" \
| gawk '2 { print $2 }'
echo "Hello kind world" \
| gawk '3 { print $2 }'
# ↑
# forcing result of <pattern> match as 3 / non-0
# -- output for all the above --
# kind
So the way that <pattern>
condition matching works is if awk sees 0 4 the pattern match condition is “false” and awk ignores the action block. Anything > 0 and awk treats the condition as “true” and executes the action block. Ok back to the one-liner:
gawk -F'"' '$0=$2' appVersion.gradle
# ↑
# is this a valid pattern?
# ✅ we're getting some non-0 value
# cause things are being printed
# is this an action block? 🤔
So $0=$2
is coming back with a result of > 0 and some invisible default is being executed. Progress… but still many questions.
default action
Let’s try some commands. Notice the output for each of them:
echo "Hello kind world" | gawk '0 {print $0}'
echo "Hello kind world" | gawk '0 {print}'
echo "Hello kind world" | gawk '0'
echo "Hello kind world" | gawk ''
# -- output --
# no output
echo "Hello kind world" | gawk '1 {print $0}'
echo "Hello kind world" | gawk '1 {print}'
echo "Hello kind world" | gawk '1'
# -- output --
# Hello kind world
So when the <pattern>
match is false (0) nothing is printed and when it is 1 then the default is to just print the entire line ($0
). In fact you don’t have to specify anything and awk assumes you want to print $0
by default.
variable reassignment
You know how we glorify immutability with most programming? awk
ain’t having any of that.
You can mutate the heck out of anything. You can mutate the current line before you even run an action on it. Check this piece of code out:
echo "Hello kind world" | gawk '{print $0" <-> "$1" <-> "$2" <-> "$3}'
# -- output --
# Hello kind world <-> Hello <-> kind <-> world
# ↑ ↑ ↑ ↑
# $0 $1 $2 $3
echo "Hello kind world" | gawk '1 {$0="hijack"; print $0" <-> "$1" <-> "$2" <-> "$3}'
# -- output --
# hijack <-> hijack <-> <->
# ↑ ↑ ↑ ↑
# $0 $1 🙅 🙅
Even before the action block is executed you can reassign the entire line.
number of fields
Here’s the last piece that should help bring this all together. Given this file again:
// appVersion.gradle
def baseCode = 30001
def appVersion = [
product-1 : [
name: "21.091.420",
code: baseCode
],
product-2: [
name: "20.090.300",
code: baseCode
],
//...
gawk '{print NF ": "$0}' appVersion.gradle
# -- output --
# 0:
# 4: def baseCode = 30001
# 0:
# 4: def appVersion = [
# 3: product-1 : [
# 2: name: "21.091.420",
# 2: code: baseCode
# 1: ],
# 0:
# 2: product-2: [
# 2: name: "20.090.300",
# 2: code: baseCode
# 1: ],
NF
in awk stands for number of fields.- awk takes the input field separator and stores the “number of fields” discovered in that variable.
- Notice above that when there’s an empty line NF is 0 since there’s no separator (or content) in that line.
- On the second line there’s 4 tokens (each separated by space)
The first solution (again)
All right, let’s do this one last time.
gawk -F'"' '$0=$2' appVersion.gradle
# -- output --
# 21.091.420
# 20.090.300
What’s happening here is a beautiful symphony of awk defaults stacking on top of each other.
We first reassign the variable holding the entire line ($0) to $2. Remember that $2 holds the second word/token after splitting the original content in $0 with the input field separator "
. This should help point out the resulting fields with the new field separator:
gawk -F'"' '{print NF ": "$0}' appVersion.gradle
# 0:
# 1: def baseCode = 30001
# 0:
# 1: def appVersion = [
# 1: product-1 : [
# 3: name: "21.091.420",
# 1: code: baseCode
# 1: ],
# 0:
# 1: product-2: [
# 3: name: "20.090.300",
# 1: code: baseCode
# 1: ],
- First line of this file is empty, so $2 is 0 and $0 is assigned to 0.
- Second line of this file has only one segment (since there’s no
"
) and is stored in $1. $2 is again 0 and $0 is assigned to 0. - Sixth line has 3 tokens, $2 is
21.091.420
and is assigned to $0.
The original command should make sense now:
gawk -F'"' '$0=$2' appVersion.gradle
# -- output --
# 21.091.420
# 20.090.300
- The
<pattern>
match condition provided to awk is the output value of$0=$2
- From the previous command output we noticed that for the first, second lines up until the sixth line, was all 0.
- when awk sees 0 it ignores the action and there’s no print
- The sixth line is the first time we encounter a non-0 value, so the action block is executed.
- what is the default action? print $0
- what is $0?
- the second token or segment $2 (after field separator
"
) which is21.091.420
- the second token or segment $2 (after field separator
💥
This is really such a gorgeous piece of code. Clever and poetic.
The final solution
If you’re curious how I came up with my own solution, I made it a little less clever, more verbose and hopefully now simpler to understand:
gawk -F'"' 'NF==3 {print $2; exit}' appVersion.gradle
# -- output --
# 21.091.420
Go forth and awk.