Learning awk

说明

主要参考 Understanding AWK 并在其上进行一些补充拓展。适合对 awk 了解不多,想学习使用的读者阅读使用,跟着示例一起操作学起来更快更有效~

这应该是个人学习总结,程序员的自我修养:Learning Tools系列的第一篇。后面有可能也会学习汇总一下 sed|grep|

其它参考资料列表

What is awk

awk 是一个记录处理工具,1977年由 Aho、Kernighan 和 Weinberger 编写,也正是这三位的名字缩写。grep 可以按行搜索,sed 可以按行做文本替换,awk 则是可以按行做计算。 > If grep lets you search for lines, and sed lets you do replacements in lines then awk was designed to let you do calculations on lines.

How to pronounce awk

在一个远古论坛找到了一个年龄和我差不多大的帖子:How to pronounce awk?,就按照 hawk [hɔːk]的同音来读了。

How to install awk

Linux 系统自带,mac 如果没有,用 Homebrew(brew install gawk)装。本篇 blog 主要在 CentOS7中完成:

$ awk --version
GNU awk 4.0.2
Copyright (C) 1989, 1991-2012 Free Software Foundation.

已经有了 Awk 了,开始吧

TO BEGIN WITH

awk Print

默认情况下,awk 接受标准输入,输出标准输出,使用 awk 做的最简单的事情就是打印输入的内容:

$ echo "one two three" | awk '{ print }'
one two three

注: {} 作为分隔符号,包含了一个 action,实际上,awk 的语法,整体上就是一句句的 pattern { action },在后面的更多例子中慢慢体会。

打印的时候也可以选择打印哪一列,或者哪一个值域(fields is a much more precise description):

$ echo "one two three" | awk '{ print $1 }'
one 
$ echo "one two three" | awk '{ print $2 }'
two 
$ echo "one two three" | awk '{ print $3 }'
three

这里的序号是从1开始的,0表示一整行:

$ echo "one two three" | awk '{ print $0 }'
one two three

最开始说了,awk 是按行操作的,上面几个例子都是只有一行输入,如果有多行输入:

$ echo "
 one two three
 four five six" \
| awk '{ print $1, $2 }'

one two
four five

除了$1,$2,, 还有 NFNR 两个特殊变量,NF 表示 number of fieldsNR 表示 number of records。前者表示一行里面一共有多少个 fields,后者表示当前处理了多少行(从1开始),这两个变量本身的值就是一个数,如果再带上$,就会得到对应的 fields 里的值了,体会一下下面两个例子:

$ echo "one two three
 four five six
 seven eight nine " \
 | awk '{ print NR, NF }'

1 3
2 3
3 3
$ echo "one two three
 four five six
 seven eight nine " \
 | awk '{ print $NR, $NF }'

one three
five six
nine nine

基本可以理解,{ print NR }是有意义的,而{ print $NR }可能很难找到使用的场合。另外 $NF 还可以用来计算倒数第几个:

$ echo "one two three
 four five six
 seven eight nine " \
 | awk '{ print $(NF-1) }'

two
five
eight

SPEED UP

获取一些 sample 数据用来练习

$ curl https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz \
| gunzip -c >> gamereviews.tsv
  • 其它数据集可以从 link 里找找。
  • 完全下载下来压缩包26.2MB,解压出来约70MB 文本,wc -l 显示一共有145432行。可以在 gunzip -c 命令后加上| head -n 10000获取前 n 行。

数据集格式

$ head -n 2 gameviews.tsv
marketplace	customer_id	review_id   product_id  product_parent  product_title   product_category    star_rating helpful_votes   total_votes vine    verified_purchase   review_headline review_body review_date
US  21269168    RSH1OZ87OYK92   B013PURRZW  603406193   Madden NFL 16 - Xbox One Digital Code   Digital_Video_Games 2   2   3   N   N   A slight improvement from last year.    I keep buying madden every year hoping they get back to football. This years version is a little better than last years -- but that's not saying much.The game looks great. The only thing wrong with the animation, is the way the players are always tripping on each other.<br /><br />The gameplay is still slowed down by the bloated pre-play controls. What used to take two buttons is now a giant PITA to get done before an opponent snaps the ball or the play clock runs out.<br /><br />The turbo button is back, but the player movement is still slow and awkward. If you liked last years version, I'm guessing you'll like this too. I haven't had a chance to play anything other than training and a few online games, so I'm crossing my fingers and hoping the rest is better.<br /><br />The one thing I can recommend is NOT TO BUY THE MADDEN BUNDLE. The game comes as a download. So if you hate it, there's no trading it in at Gamestop.    2015-08-31

第一行是表头,代表下面每一行中每一列的含义:

DATA COLUMNS:
01  marketplace       - 2 letter country code of the marketplace where the review was written.
02  customer_id       - Random identifier that can be used to aggregate reviews written by a single author.
03  review_id         - The unique ID of the review.
04  product_id        - The unique Product ID the review pertains to. 
05  product_parent    - Random identifier that can be used to aggregate reviews for the same product.
06  product_title     - Title of the product.
07  product_category  - Broad product category that can be used to group reviews 
08  star_rating       - The 1-5 star rating of the review.
09  helpful_votes     - Number of helpful votes.
10  total_votes       - Number of total votes the review received.
11  vine              - Review was written as part of the Vine program.
12  verified_purchase - The review is on a verified purchase.
13  review_headline   - The title of the review.
14  review_body       - The review text.
15  review_date       - The date the review was written.

回顾之前的 print 第几列:可以这样试一试:

$ awk '{print $1}' gamereviews.tsv | head 
marketplace
US
US
US
US
US
US
US
US
US

$ awk '{print $2}' gamereviews.tsv | head 
customer_id
21269168
133437
45765011
113118
22151364
22151364
38426028
6057518
20715661

$ awk '{print $3}' gamereviews.tsv | head 
review_id
RSH1OZ87OYK92
R1WFOQ3N9BO65I
R3YOOS71KM5M9
R3R14UATT3OUFU
RV2W9SGDNQA2C
R3CFKLIZ0I2KOB
R1LRYU1V0T3O38
R44QKV6FE5CJ2
R2TX1KLPXXXNYS

# ......

打到第6列标题的时候,结果开始不对劲了:

$ awk '{print $6}' gamereviews.tsv | head 
product_title
Madden
Xbox
Command
Playstation
Saints
Double
Sims
Playstation
Playstation

至少上面看过,第二行,也就是第一条 review,title 应该是 Madden NFL 16 - Xbox One Digital Code,而这里只切割出来的第一个单词。

原因就是,awk 默认以 ,一个空格作为区分一行中不同 fileds 的分隔符 Field Separators

Field Separators

使用参数 -F '{fields separators}' 来覆盖默认的空格分隔符,这个数据集里的分隔符都是一个 tab,所以上面的命令可以改为:

$ awk -F '\t' '{print $6}' gamereviews.tsv | head 
product_title
Madden NFL 16 - Xbox One Digital Code
Xbox Live Gift Card
Command & Conquer The Ultimate Collection [Instant Access]
Playstation Plus Subscription
Saints Row IV - Enter The Dominatrix [Online Game Code]
Double Dragon: Neon [Online Game Code]
Sims 4
Playstation Network Card
Playstation Network Card

Noted: By default, awk does more than split the input on spaces. It splits based on one or more sequence of space or tab or newline characters. In addition, any of these three characters at the start or end of input gets trimmed and won’t be part of field contents. Newline characters come into play if the record separator results in newline within the record content.

再回顾一下之前的 NFNR,获取一下最后一行的日期(15 reviewdate)、第8个评分(08 starrating)以及倒数第三个(13 review_headline):

$ awk -F '\t' '{print NR,$NF,$8,$(NF-2) }' gamereviews.tsv | head -n 10
1 review_date star_rating review_headline
2 2015-08-31 2 A slight improvement from last year.
3 2015-08-31 5 Five Stars
4 2015-08-31 5 Hail to the great Yuri!
5 2015-08-31 5 Five Stars
6 2015-08-31 5 Five Stars
7 2015-08-31 5 Five Stars
8 2015-08-31 4 i like the new skills like herbalism in this
9 2015-08-31 5 Five Stars
10 2015-08-31 5 Easy & Fast

Pattern Match With Regular Expressions

至此都是在对每一行进行操作,awk 实际包含模式匹配,之前提过,awk 的语法实际上是一条条的 pattern { action },目前我们也只用上了{print}这个 action

关于模式匹配,一个简单的例子:

$ echo "aa 1
bb 2
cc 3" | awk -F ' ' '/bb/ {print $2}'
2

可以看到,对于 awk 扫描的三行数据,只有匹配到了 bb 的第二行,运行了 printaction。这里的/bb/就是 pattern

回到之前的数据集,比如我想匹配 Minecraft 这个游戏的 reviews,看看大家的评分,这么写:

$ awk -F '\t' '/Minecraft/ {print $6, $8}' gamereviews.tsv | head
Minecraft for PC/Mac [Online Game Code] 4
Minecraft for PC/Mac [Online Game Code] 5
Xbox 360 Live Points Card 5
Minecraft for PC/Mac [Online Game Code] 2
Minecraft for PC/Mac [Online Game Code] 5
Minecraft - Xbox 360 5
Minecraft for PC/Mac [Online Game Code] 1
Nom Nom Galaxy  - PS4 [Digital Code] 4
Minecraft - Xbox One Digital Code 5
Minecraft for PC/Mac [Online Game Code] 4

确实过滤出来了 Minecraft,但是同时,任何字段里面带有 Minecraft 的 reviews 都被统计进来了,包括任何提及它的别的游戏的 reviews:

$ awk -F '\t' '/Minecraft/ {print $6, $8}' gamereviews.tsv | sort | uniq
8BitMMO [Game Connect] 2
Ace of Spades: Battle Builder [Online Game Code] 5
Agricultural Simulator Historical Farming [Download] 1
Bioschock Infinite: Clash in the Clouds 5
Blini Kids: Animals [Download] 5
Blockland 4
Blockland 5
Blockstorm [Download] 4
Borderlands 2 5
Call of Duty 4: Modern Warfare [Download] 5
# ...... more

所以还需要针对某个字段进行匹配:pattern 可以写成$6 ~ /Minecraft/来模糊匹配06 product_title 里面有 Minecraft

$ awk -F '\t' ' $6 ~ /Minecraft/ {print $4,$6,$8} ' gamereviews.tsv | head 
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 4
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 5
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 2
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 5
B010BWCOWI Minecraft - Xbox 360 5
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 1
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 5
B00NMO0IA8 Minecraft - Xbox One Digital Code 5
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 4
B010KYDNDG Minecraft for PC/Mac [Online Game Code] 4

或者根据我们确认到的04 product_idB010KYDNDG 来准确匹配所有的 Minecraft for PC/Mac [Online Game Code],使用$4 == "B010KYDNDG":

$ awk -F '\t' '$4 == "B010KYDNDG" {print $6, $8} ' gamereviews.tsv | head
Minecraft for PC/Mac [Online Game Code] 4
Minecraft for PC/Mac [Online Game Code] 5
Minecraft for PC/Mac [Online Game Code] 2
Minecraft for PC/Mac [Online Game Code] 5
Minecraft for PC/Mac [Online Game Code] 1
Minecraft for PC/Mac [Online Game Code] 5
Minecraft for PC/Mac [Online Game Code] 4
Minecraft for PC/Mac [Online Game Code] 4
Minecraft for PC/Mac [Online Game Code] 5
Minecraft for PC/Mac [Online Game Code] 5

已经精确匹配了$4 == "B010KYDNDG",修改一下 print 的字段,依次打印15:review_date, 13:review_headline, 08:star_rating

$ awk -F '\t' '$4 == "B010KYDNDG" {print $15, $13, $8} ' gamereviews.tsv | head
2015-08-31 FUN 4
2015-08-31 ... have disks for games as a backup it was nice to be able to get the code then start ... 5
2015-08-30 Would rather a disk. There always seems to be ... 2
2015-08-29 Five Stars 5
2015-08-27 Very Disappointed Dad and Birthday Boy 1
2015-08-26 Five Stars 5
2015-08-24 Four Stars 4
2015-08-24 My son says it is fun. He likes it but we were a little ... 4
2015-08-24 MINECWAFT 5
2015-08-22 MINECRAFT!!! 5

总结一下:

  • awk 的语法 pattern { action }
  • 模糊匹配使用两个/包括来:/regexp/
  • 针对字段模糊匹配: $n ~ /regexp/ ,~表示匹配上, !~表示不匹配
  • 针对字段精确匹配: $n == "value"

Case Sensitivity in Matching

忽略大小写,可以设置 IGNORECASE = 1

x = "aB"
if (x ~ /ab/) …   # this test will fail

IGNORECASE = 1
if (x ~ /ab/) …   # now it will succeed

Use printf

上面已经可以按照我们期望的条件,匹配我们需要的内容并打印出来,print 这个 action 后面接的参数,如果用,隔开,就是按照默认的输出域分隔符(Output Field Separators)来分割每个 fields,这个分隔符在 awk 内部用 OFS 表示,还有 ORS 表示输出记录分隔符(Output Record Separators)。记单词就很容易记下来:Field - 域(一个字段)Record - 记录(一行内容) > 其实之前的 -F 参数,也是修改的 FS。对应的输入记录分隔符为 RS,默认为一个换行符

上面的命令,可以通过加上修改参数 OFSORS 来让结果更美观:

$ awk '{FS="\t"; OFS=" : "; ORS= "\n\n"}  $4 == "B010KYDNDG" {print $15, $13, $8} ' gamereviews.tsv | head
2015-08-31 : FUN : 4

2015-08-31 : ... have disks for games as a backup it was nice to be able to get the code then start ... : 5

2015-08-30 : Would rather a disk. There always seems to be ... : 2

2015-08-29 : Five Stars : 5

2015-08-27 : Very Disappointed Dad and Birthday Boy : 1

除了设置输出分隔符,也可以让输出内容按照我们期望的格式直接输出,awkprintf 和 C/Shell 的格式很像:

Control-Letters

  • %c : Print a number as a character
  • %d,%i : Print a decimal integer
  • %f : Print a number in floating-point notation
  • %s : Print a string.
  • %x : 16 进制
  • %o : 8进制
  • %u : unsigned

更多格式控制符见 Control-Letters

用引号括住所有格式,后面的参数里按顺序填入对应的变量:

$ awk -F '\t' ' $4 == "B010KYDNDG" {printf "%s :%s %s\n", $15, $13, $8} ' gamereviews.tsv | head
2015-08-31 :FUN 4
2015-08-31 :... have disks for games as a backup it was nice to be able to get the code then start ... 5
2015-08-30 :Would rather a disk. There always seems to be ... 2
2015-08-29 :Five Stars 5
2015-08-27 :Very Disappointed Dad and Birthday Boy 1
2015-08-26 :Five Stars 5
2015-08-24 :Four Stars 4
2015-08-24 :My son says it is fun. He likes it but we were a little ... 4
2015-08-24 :MINECWAFT 5
2015-08-22 :MINECRAFT!!! 5

Modifiers

  • width 宽度修饰词:printf "%4s", "foo" => ·foo // ·表示空格
  • - (Minus),用在宽度修饰词前面,变成左对齐。(否则默认是右对齐):printf "%-4s", "foo" => foo·
  • . 精度修饰词,根据后面不同的格式控制符,含义稍有不同:printf "%.4s", "foobar" => foob
  • 精度修饰词可以叠加再其它修饰词后面使用,例如控制内容长度为25,宽度为30且左对齐,可以写作%-30.25s:
$ awk -F '\t' ' $4 == "B010KYDNDG" {printf "%s : %-30.25s %s\n", $15, $13, $8} ' gamereviews.tsv | head
2015-08-31 : FUN                            4
2015-08-31 : ... have disks for games       5
2015-08-30 : Would rather a disk. Ther      2
2015-08-29 : Five Stars                     5
2015-08-27 : Very Disappointed Dad and      1
2015-08-26 : Five Stars                     5
2015-08-24 : Four Stars                     4
2015-08-24 : My son says it is fun. He      4
2015-08-24 : MINECWAFT                      5
2015-08-22 : MINECRAFT!!!                   5

注: 原文章里提到了可以使用 build-in 方法 substr 来操作$13: sub($13,1,25)可以达到同样的效果。

其它更多修饰词及用法见 Format-Modifiers

BEGIN and END

现在以及能够搜索、匹配、按格式打印了,来试着做一点计算。

继续上面的例子,能够得到所有的评分,如果期望计算所有评分的平均值,我们需要一个临时变量来求和,awk 里,变量可以直接声明并使用:

$ awk -F '\t' ' \
$4 == "B010KYDNDG" {total = total + $8; printf "now we have : %d \n", total} \
' gamereviews.tsv | head
now we have : 4 
now we have : 9 
now we have : 11 
now we have : 16 
now we have : 17 
now we have : 22 
now we have : 26 
now we have : 30 
now we have : 35 
now we have : 40

每计算后都打印一次会输出太多无关内容,通过 BEGIN 或者 END 作为 pattern 的语句,可以在其它主要语句的运行前和运行后分别运行一次:

$ awk -F '\t' '\
BEGIN {printf "\nBEGIN:\n"} \
$4 == "B010KYDNDG" {total = total + $8; cnt = cnt + 1} \
END {printf "sum is %d, avg is %.2f\nEND\n", total, total/cnt} \
' gamereviews.tsv 

BEGIN:
sum is 131, avg is 3.97
END

SUPER POWER

awk Scripts

随着语句越写越复杂,在不够熟练且思路不连贯的情况下想一口气在一整行命令里完成功能还是挺有难度的,把这些语句放到文件里编辑,可以排版一下格式,思路&&看起来都更清晰:

`$ cat calc_avg_single.awk`

#!/bin/awk -f
BEGIN { 
    FS = "\t"
    print "BEGIN" 
}
$4 == "B010KYDNDG" { 
    title = $6
    total = total + $8;
    cnt = cnt + 1;
}
END { 
    print "Average review of", title, "is", total/cnt, "starts"
    print "END" 
}

$ ./calc_avg_single.awk gamereviews.tsv 
BEGIN
Average review of Minecraft for PC/Mac [Online Game Code] is 3.9697 starts
END
  • -f awk.fileawk 执行一个文件里的命令
  • -F '\t'需要写在 BEGIN 里了 FS = "\t"

Arrays in awk

Awk 内建了数组Array,实时使用体验和字典一样

arr["key1"] = "one"
arr["key2"] = "two"
arr["key3"] = "three"

for (i in arr){
    print i, arr[i]
}

key1 one
key2 two
key3 three

利用这个方法同时存储计算所有 title 模糊匹配 Minecraft 的平均评价:

$ cat calc_avg.awk 
#!/bin/awk -f
BEGIN { 
    FS = "\t"
}
$6 ~ /Minecraft/ { 
    title[$6] = $6
    total[$6] = total[$6] + $8;
    cnt[$6] = cnt[$6] + 1;
}
END { 
    for (i in cnt) {
        printf "-----------------------\n";
        printf "%s\n", title[i]
        printf "avg: %.2f\t Count: %s \n", total[i]/cnt[i], cnt[i]
        printf "-----------------------\n\n";
    }
}

运行效果如下:

$ ./calc_avg.awk gamereviews.tsv 
-----------------------
Minecraft - Xbox One Digital Code
avg: 4.38    Count: 52 
-----------------------

-----------------------
Minecraft for PC/Mac [Online Game Code]
avg: 3.97    Count: 33 
-----------------------

-----------------------
Minecraft - Xbox 360
avg: 5.00    Count: 1 
-----------------------

-----------------------
Minecraft PlayStation 3 Edition - PS3 [Digital Code]
avg: 4.00    Count: 1 
-----------------------

-----------------------
Xbox LIVE 1600 Microsoft Points for Minecraft: Xbox 360 Edition - Xbox 360 Digital Code
avg: 4.70    Count: 71 
-----------------------

Control Statements

流程控制语句 基本和 C 那一套完全一致。

If-Else

if (x % 2 == 0)
    print "x is even"
else
    print "x is odd"

Do-While

while (condition)
    body
do
  body
while (condition)

For

for (i in array)
    do something with array[i]
for (initialization; condition; increment)
  body

Switch

switch (expression) {
case value or regular expression:
    case-body
default:
    default-body
}

Break && Continue

Continue;

Next

The next statement forces awk to immediately stop processing the current record and go on to the next record.

More Sample && Exercise

统计特定类型文件大小

ls -l *.txt | awk '{sum+=$5} END {print sum}'

找出文件中长度大于80的行

awk 'length>80' log.txt

打印九九乘法表

seq 9 | sed 'H;g' | awk -v RS='' '{for(i=1;i<=NF;i++)printf("%dx%d=%d%s", i, NR, i*NR, i==NR?"\n":"\t")}'