Ch. 20: Text Processing
cut
The next set of tools are another great resource.
Don’t forget you can use the PDF to copy/paste larger chunks of data…In this case, it takes some regex playing to make it work, so I have put a copy of the distros.txt file in /ufrc/bsc4452/share/Class_Files/TLCL_files/
.
- p. 273: Applications of Text: Mostly this is pointing out the diversity of text data that we may encounter. Again, text processing is pervasive and important to so much of what we do, so these tools are important!
- p. 274:
cat
: Thecat
program is maybe the most used program on the command line…it displays the contents of a text file on the screen. In the RegEx chapter we mentioned that there is a character at the end of each line.cat
can show you this, as well as tab characters, with thecat -A
option. - p. 276: MS-DOS Text vs. Unix Text: Especially for Windows users this is an important box to read. DOS (the underlying OS of Windows) and Linux do not use the same characters to signify the end of a line. Many text editors on Windows default to DOS line breaks. If you then transfer a file with DOS line breaks to Linux, the file is often interpreted as one long line and this usually breaks things! VSCode, always uses Linux line breaks. But the
cat -A
command featured here is handy.
dos2unix
: e.g. dos2unix file.txt
. There is a file in the Class_Files folder with DOS line breaks:
[magitz@login3 ~]$ cat -A /blue/bsc4452/share/Class_Files/data/DOS_formatted_file.txt
This is an example DOS text file.^M$
^M$
Notice the extra new line characters when viewed with cat -A.^M$
^M$
Creating a file on Windows using DOS line breaks and uploading the file to^M$
Linux is a constant source of pain. When you look at the file, everything^M$
seems normal. But Linux commands will often read the file as one long line.^M$
This breaks many things and causes trouble…^M$
^M$
To convert a DOS file to Linux, use the dos2unix command:^M$
dos2unix file.text^M$
sort
- p. 277:
sort
: We have seen thesort
command before. This section points out some of the many options thatsort
has. - p. 280:
sort
: This section on sort introduces the idea of thinking of a data table as consisting of rows of records and columns of fields. - p. 280:
sort
: Thedistros.txt
file can be found at/blue/bsc4452/share/Class_Files/TLCL_files/distros.txt
. Lots of great examples of playing with data in this section!
cut -f 4 distros.txt
and then adds another transformation, or changes an option to get a slightly different output. This iterative process, starting simple and refining to achieve a desired result is an excellent strategy when approaching a problem. A command line like the one on 290–sort -k 3.7nbr -k 3.1nbr -k 3.4nbr distros.txt > distros-by-date.txt
–is not written fully formed, fully functional from the start; it is the product of testing, modifying, refining. Do not expect to write such commands from scratch, do not believe that the author did either!uniq
Slicing and Dicing with cut
and paste
-
p. 290 & 291:
paste
andjoin
:paste
can be useful, but keep in mind that the order of records in the files must be the same!join
however can be used to combine files that have a shared column of data. We will look at joins more when we get to databases. Have a look at this section, but don’t worry too much about the details. -
p. 293 - 298: Comparing Text: You can skip the rest of the chapter. The
diff
command can be useful, but there’s too much to learn here…
Editing on the Fly tr
and sed
-
p. 298:
tr
can be useful, so have a quick look at this section. -
p. 301:
sed
is super powerful, this is a quick introduction.sed
allows you to do regular expression find and replace like operations from the command line. I would suggest looking at this section, but focus on the substitution options: e.g.s/first/second/g
. The text has a ton of great examples, and maybe one of them speaks to you, but I think it may be overwhelming! -
p. 309:
aspell
: skip this section.