Given the following input (cat i.txt),
I want to remove duplicate field entries in each of the first three columns and none of the others.
DLORENZ;EDDELAK;BCL;G1;2019-04-01;175
DLORENZ;EDDELAK;BRV/COV;G1;2018-01-31;165
DLORENZ;EDDELAK;BRV/COV;G2;2018-02-28;165
DLORENZ;EDDELAK;BRV/COV;WH;2018-05-29;88
DLORENZ;EDDELAK;BRV/COV;WH;2018-10-02;139
...
The input is sorted first on column 1, then on column 2, then on column 3, then on column 4, then on column 5, then on column 6.
That is, from here (cat i.txt | column -s ';' -t)
DLORENZ EDDELAK BCL G1 2019-04-01 175
DLORENZ EDDELAK BRV/COV G1 2018-01-31 165
DLORENZ EDDELAK BRV/COV G2 2018-02-28 165
DLORENZ EDDELAK BRV/COV WH 2018-05-29 88
DLORENZ EDDELAK BRV/COV WH 2018-10-02 139
DLORENZ EDDELAK BRV/COV WH 2019-01-07 140
HELMGBR GUDENDORF BCL G1 2018-04-29 600
HELMGBR GUDENDORF BCL G2 2018-05-28 580
HELMGBR GUDENDORF BCL WH 2018-11-21 600
HELMGBR GUDENDORF BOT G1 2018-07-09 600
HELMGBR GUDENDORF BOT G2 2018-08-06 600
HELMGBR GUDENDORF BOT WH 2019-02-13 600
HELMGBR GUDENDORF CHLM G1 2017-12-14 600
HELMGBR GUDENDORF CHLM G2 2018-01-11 600
HELMGBR GUDENDORF CHLM WH 2018-09-05 550
HKARSTENS KUDEN BCL G1 2019-03-11 255
HKARSTENS KUDEN BCL G2 2019-04-10 255
HSCHLADETSCH EDDELAK BCL G1 2019-03-11 213
HSCHLADETSCH EDDELAK BCL G2 2019-04-08 201
HSCHLADETSCH EDDELAK BRV/COV G1 1979-01-01 218
HSCHLADETSCH EDDELAK BRV/COV G2 1979-01-01 218
HSCHLADETSCH EDDELAK BRV/COV WH 2018-03-13 218
HSCHLADETSCH EDDELAK BRV/COV WH 2018-09-10 160
HWULFF KUDEN BCL G1 2018-02-28 244
HWULFF KUDEN BCL G2 2018-03-28 244
HWULFF KUDEN BCL WH 2018-09-20 190
HWULFF KUDEN BCL WH 2019-03-19 250
HWULFF KUDEN CHLM G1 2018-04-01 244
HWULFF KUDEN CHLM G2 2018-04-29 244
HWULFF KUDEN CHLM WH 2019-03-28 250
JMEIER EDDELAK BCL G1 2018-04-30 360
JMEIER EDDELAK BCL G2 2018-05-28 360
JPETERS KAISERWILHELMKOOG CHLM G1 2018-02-26 65
JPETERS KAISERWILHELMKOOG CHLM G2 2018-03-26 65
JPETERS KAISERWILHELMKOOG CHLM WH 2019-01-18 79
JTHODE BUCHHOLZ BCL G1 2019-03-12 253
JTHODE BUCHHOLZ BCL G2 2019-04-12 253
KMEHLERT BRUNSBUETTEL BCL G1 2018-12-13 79
KMEHLERT BRUNSBUETTEL BCL G2 2019-01-10 119
MMAGENS BARLT CHLM G1 2018-02-13 165
MMAGENS BARLT CHLM G2 2018-03-13 165
MMAGENS BARLT CHLM WH 2018-09-12 136
MMAGENS BARLT CHLM WH 2019-03-14 132
MSCHNEPEL WINDBERGEN CHLM G1 2017-10-09 205
MSCHNEPEL WINDBERGEN CHLM G2 2017-11-02 263
MSCHNEPEL WINDBERGEN CHLM WH 2018-04-10 272
MSCHNEPEL WINDBERGEN CHLM WH 2018-10-25 208
NJUNGE EDDELAK BCL G1 2018-03-07 146
NJUNGE EDDELAK BCL G2 2018-04-04 146
NJUNGE EDDELAK BCL WH 2018-08-06 100
NJUNGE EDDELAK BCL WH 2018-11-14 105
NJUNGE EDDELAK BCL WH 2019-03-12 118
SMOHR BRUNSBUETTEL CHLM G1 2018-04-30 110
SMOHR BRUNSBUETTEL CHLM G2 2018-05-28 110
SMOHR BRUNSBUETTEL CHLM WH 2018-12-18 98
... I want to arrive at the following output (cat 1fertig.txt | column -s ';' -t):
DLORENZ EDDELAK BCL G1 2019-04-01 175
---- ---- BRV/COV G1 2018-01-31 165
---- ---- ---- G2 2018-02-28 165
---- ---- ---- WH 2018-05-29 88
---- ---- ---- WH 2018-10-02 139
---- ---- ---- WH 2019-01-07 140
HELMGBR GUDENDORF BCL G1 2018-04-29 600
---- ---- ---- G2 2018-05-28 580
---- ---- ---- WH 2018-11-21 600
---- ---- BOT G1 2018-07-09 600
---- ---- ---- G2 2018-08-06 600
---- ---- ---- WH 2019-02-13 600
---- ---- CHLM G1 2017-12-14 600
---- ---- ---- G2 2018-01-11 600
---- ---- ---- WH 2018-09-05 550
HKARSTENS KUDEN BCL G1 2019-03-11 255
---- ---- ---- G2 2019-04-10 255
HSCHLADETSCH EDDELAK BCL G1 2019-03-11 213
---- ---- ---- G2 2019-04-08 201
---- ---- BRV/COV G1 1979-01-01 218
---- ---- ---- G2 1979-01-01 218
---- ---- ---- WH 2018-03-13 218
---- ---- ---- WH 2018-09-10 160
HWULFF KUDEN BCL G1 2018-02-28 244
---- ---- ---- G2 2018-03-28 244
---- ---- ---- WH 2018-09-20 190
---- ---- ---- WH 2019-03-19 250
---- ---- CHLM G1 2018-04-01 244
---- ---- ---- G2 2018-04-29 244
---- ---- ---- WH 2019-03-28 250
JMEIER EDDELAK BCL G1 2018-04-30 360
---- ---- ---- G2 2018-05-28 360
JPETERS KAISERWILHELMKOOG CHLM G1 2018-02-26 65
---- ---- ---- G2 2018-03-26 65
---- ---- ---- WH 2019-01-18 79
JTHODE BUCHHOLZ BCL G1 2019-03-12 253
---- ---- ---- G2 2019-04-12 253
KMEHLERT BRUNSBUETTEL BCL G1 2018-12-13 79
---- ---- ---- G2 2019-01-10 119
MMAGENS BARLT CHLM G1 2018-02-13 165
---- ---- ---- G2 2018-03-13 165
---- ---- ---- WH 2018-09-12 136
---- ---- ---- WH 2019-03-14 132
MSCHNEPEL WINDBERGEN CHLM G1 2017-10-09 205
---- ---- ---- G2 2017-11-02 263
---- ---- ---- WH 2018-04-10 272
---- ---- ---- WH 2018-10-25 208
NJUNGE EDDELAK BCL G1 2018-03-07 146
---- ---- ---- G2 2018-04-04 146
---- ---- ---- WH 2018-08-06 100
---- ---- ---- WH 2018-11-14 105
---- ---- ---- WH 2019-03-12 118
SMOHR BRUNSBUETTEL CHLM G1 2018-04-30 110
---- ---- ---- G2 2018-05-28 110
---- ---- ---- WH 2018-12-18 98
The output will be further processed into a LaTeX input file.
The code I wrote is reasonably straightforward:
First kill the duplicates in column 3 of input, then kill the duplicates in column 2 of the result, then kill the duplicates in column 1 of that result.
It is even efficient enough for my needs (and I can't come up with anything substantially faster offhand, except for not writing to disk that much). But it is not readable at all.
n="$(wc -l < i.txt)"
rm -rfv f
mkdir f
cat i.txt > f/i.txt
cd f
while IFS=';' read lbezg rest
do
echo "$lbezg"';'"$rest" >> 1lw_"$lbezg"
done < i.txt
for file in 1lw_*
do
while IFS=';' read lbezg sbezg rest
do
echo "$lbezg"';'"$sbezg"';'"$rest" >> 1lw_2so_"$lbezg"_"$sbezg"
done < "$file"
done
for file in 1lw_2so_*
do
while IFS=';' read lbezg sbezg impfstoff rest
do
ii="$(echo "$impfstoff" | tr -d '/')"
echo "$lbezg"';'"$sbezg"';'"$impfstoff"';'"$rest" >> 1lw_2so_3impfstoff_"$lbezg"_"$sbezg"_"$ii"
done < "$file"
done
for file in 1lw_2so_3impfstoff_*
do
awk -F';' -v OFS=';' ' {if (NR>1) $3="----"; print $0}' < "$file"
done > 3fertig.txt
rm 1lw*
while IFS=';' read lbezg rest
do
echo "$lbezg"';'"$rest" >> 1lw_"$lbezg"
done < 3fertig.txt
for file in 1lw_*
do
while IFS=';' read lbezg sbezg rest
do
echo "$lbezg"';'"$sbezg"';'"$rest" >> 1lw_2so_"$lbezg"_"$sbezg"
done < "$file"
done
for file in 1lw_2so_*
do
awk -F';' -v OFS=';' ' {if (NR>1) $2="----"; print $0}' < "$file"
done > 2fertig.txt
rm 1lw*
while IFS=';' read lbezg rest
do
echo "$lbezg"';'"$rest" >> 1lw_"$lbezg"
done < 2fertig.txt
for file in 1lw_*
do
awk -F';' -v OFS=';' ' {if (NR>1) $1="----"; print $0}' < "$file"
done > 1fertig.txt
rm 1lw*
####### the rest is for nice error checking and not strictly necessary
time for i in $(seq 1 "$n")
do
l1="$(sed -n "$i"p < i.txt)"
l2="$(sed -n "$i"p < 1fertig.txt)"
echo "$i"';'"$l1"'|'"$i"';'"$l2"
done | column -s '|' -t > differ.txt
I wonder how you would go about this?