Removing duplicate field entries from sorted csv data

Question

Given the following input (cat i.txt), I want to remove duplicate field entries in each of the first three columns and none of the others.

DLORENZ;EDDELAK;BCL;G1;2019-04-01;175
DLORENZ;EDDELAK;BRV/COV;G1;2018-01-31;165
DLORENZ;EDDELAK;BRV/COV;G2;2018-02-28;165
DLORENZ;EDDELAK;BRV/COV;WH;2018-05-29;88
DLORENZ;EDDELAK;BRV/COV;WH;2018-10-02;139
...

The input is sorted first on column 1, then on column 2, then on column 3, then on column 4, then on column 5, then on column 6.

That is, from here (cat i.txt | column -s ';' -t)

DLORENZ       EDDELAK            BCL      G1  2019-04-01  175
DLORENZ       EDDELAK            BRV/COV  G1  2018-01-31  165
DLORENZ       EDDELAK            BRV/COV  G2  2018-02-28  165
DLORENZ       EDDELAK            BRV/COV  WH  2018-05-29  88
DLORENZ       EDDELAK            BRV/COV  WH  2018-10-02  139
DLORENZ       EDDELAK            BRV/COV  WH  2019-01-07  140
HELMGBR       GUDENDORF          BCL      G1  2018-04-29  600
HELMGBR       GUDENDORF          BCL      G2  2018-05-28  580
HELMGBR       GUDENDORF          BCL      WH  2018-11-21  600
HELMGBR       GUDENDORF          BOT      G1  2018-07-09  600
HELMGBR       GUDENDORF          BOT      G2  2018-08-06  600
HELMGBR       GUDENDORF          BOT      WH  2019-02-13  600
HELMGBR       GUDENDORF          CHLM     G1  2017-12-14  600
HELMGBR       GUDENDORF          CHLM     G2  2018-01-11  600
HELMGBR       GUDENDORF          CHLM     WH  2018-09-05  550
HKARSTENS     KUDEN              BCL      G1  2019-03-11  255
HKARSTENS     KUDEN              BCL      G2  2019-04-10  255
HSCHLADETSCH  EDDELAK            BCL      G1  2019-03-11  213
HSCHLADETSCH  EDDELAK            BCL      G2  2019-04-08  201
HSCHLADETSCH  EDDELAK            BRV/COV  G1  1979-01-01  218
HSCHLADETSCH  EDDELAK            BRV/COV  G2  1979-01-01  218
HSCHLADETSCH  EDDELAK            BRV/COV  WH  2018-03-13  218
HSCHLADETSCH  EDDELAK            BRV/COV  WH  2018-09-10  160
HWULFF        KUDEN              BCL      G1  2018-02-28  244
HWULFF        KUDEN              BCL      G2  2018-03-28  244
HWULFF        KUDEN              BCL      WH  2018-09-20  190
HWULFF        KUDEN              BCL      WH  2019-03-19  250
HWULFF        KUDEN              CHLM     G1  2018-04-01  244
HWULFF        KUDEN              CHLM     G2  2018-04-29  244
HWULFF        KUDEN              CHLM     WH  2019-03-28  250
JMEIER        EDDELAK            BCL      G1  2018-04-30  360
JMEIER        EDDELAK            BCL      G2  2018-05-28  360
JPETERS       KAISERWILHELMKOOG  CHLM     G1  2018-02-26  65
JPETERS       KAISERWILHELMKOOG  CHLM     G2  2018-03-26  65
JPETERS       KAISERWILHELMKOOG  CHLM     WH  2019-01-18  79
JTHODE        BUCHHOLZ           BCL      G1  2019-03-12  253
JTHODE        BUCHHOLZ           BCL      G2  2019-04-12  253
KMEHLERT      BRUNSBUETTEL       BCL      G1  2018-12-13  79
KMEHLERT      BRUNSBUETTEL       BCL      G2  2019-01-10  119
MMAGENS       BARLT              CHLM     G1  2018-02-13  165
MMAGENS       BARLT              CHLM     G2  2018-03-13  165
MMAGENS       BARLT              CHLM     WH  2018-09-12  136
MMAGENS       BARLT              CHLM     WH  2019-03-14  132
MSCHNEPEL     WINDBERGEN         CHLM     G1  2017-10-09  205
MSCHNEPEL     WINDBERGEN         CHLM     G2  2017-11-02  263
MSCHNEPEL     WINDBERGEN         CHLM     WH  2018-04-10  272
MSCHNEPEL     WINDBERGEN         CHLM     WH  2018-10-25  208
NJUNGE        EDDELAK            BCL      G1  2018-03-07  146
NJUNGE        EDDELAK            BCL      G2  2018-04-04  146
NJUNGE        EDDELAK            BCL      WH  2018-08-06  100
NJUNGE        EDDELAK            BCL      WH  2018-11-14  105
NJUNGE        EDDELAK            BCL      WH  2019-03-12  118
SMOHR         BRUNSBUETTEL       CHLM     G1  2018-04-30  110
SMOHR         BRUNSBUETTEL       CHLM     G2  2018-05-28  110
SMOHR         BRUNSBUETTEL       CHLM     WH  2018-12-18  98

... I want to arrive at the following output (cat 1fertig.txt | column -s ';' -t):

DLORENZ       EDDELAK            BCL      G1  2019-04-01  175
----          ----               BRV/COV  G1  2018-01-31  165
----          ----               ----     G2  2018-02-28  165
----          ----               ----     WH  2018-05-29  88
----          ----               ----     WH  2018-10-02  139
----          ----               ----     WH  2019-01-07  140
HELMGBR       GUDENDORF          BCL      G1  2018-04-29  600
----          ----               ----     G2  2018-05-28  580
----          ----               ----     WH  2018-11-21  600
----          ----               BOT      G1  2018-07-09  600
----          ----               ----     G2  2018-08-06  600
----          ----               ----     WH  2019-02-13  600
----          ----               CHLM     G1  2017-12-14  600
----          ----               ----     G2  2018-01-11  600
----          ----               ----     WH  2018-09-05  550
HKARSTENS     KUDEN              BCL      G1  2019-03-11  255
----          ----               ----     G2  2019-04-10  255
HSCHLADETSCH  EDDELAK            BCL      G1  2019-03-11  213
----          ----               ----     G2  2019-04-08  201
----          ----               BRV/COV  G1  1979-01-01  218
----          ----               ----     G2  1979-01-01  218
----          ----               ----     WH  2018-03-13  218
----          ----               ----     WH  2018-09-10  160
HWULFF        KUDEN              BCL      G1  2018-02-28  244
----          ----               ----     G2  2018-03-28  244
----          ----               ----     WH  2018-09-20  190
----          ----               ----     WH  2019-03-19  250
----          ----               CHLM     G1  2018-04-01  244
----          ----               ----     G2  2018-04-29  244
----          ----               ----     WH  2019-03-28  250
JMEIER        EDDELAK            BCL      G1  2018-04-30  360
----          ----               ----     G2  2018-05-28  360
JPETERS       KAISERWILHELMKOOG  CHLM     G1  2018-02-26  65
----          ----               ----     G2  2018-03-26  65
----          ----               ----     WH  2019-01-18  79
JTHODE        BUCHHOLZ           BCL      G1  2019-03-12  253
----          ----               ----     G2  2019-04-12  253
KMEHLERT      BRUNSBUETTEL       BCL      G1  2018-12-13  79
----          ----               ----     G2  2019-01-10  119
MMAGENS       BARLT              CHLM     G1  2018-02-13  165
----          ----               ----     G2  2018-03-13  165
----          ----               ----     WH  2018-09-12  136
----          ----               ----     WH  2019-03-14  132
MSCHNEPEL     WINDBERGEN         CHLM     G1  2017-10-09  205
----          ----               ----     G2  2017-11-02  263
----          ----               ----     WH  2018-04-10  272
----          ----               ----     WH  2018-10-25  208
NJUNGE        EDDELAK            BCL      G1  2018-03-07  146
----          ----               ----     G2  2018-04-04  146
----          ----               ----     WH  2018-08-06  100
----          ----               ----     WH  2018-11-14  105
----          ----               ----     WH  2019-03-12  118
SMOHR         BRUNSBUETTEL       CHLM     G1  2018-04-30  110
----          ----               ----     G2  2018-05-28  110
----          ----               ----     WH  2018-12-18  98

The output will be further processed into a LaTeX input file.

The code I wrote is reasonably straightforward:

First kill the duplicates in column 3 of input, then kill the duplicates in column 2 of the result, then kill the duplicates in column 1 of that result.

It is even efficient enough for my needs (and I can't come up with anything substantially faster offhand, except for not writing to disk that much). But it is not readable at all.

n="$(wc -l < i.txt)"

rm -rfv f

mkdir f

cat i.txt > f/i.txt

cd f

while IFS=';' read lbezg rest
do
    echo "$lbezg"';'"$rest" >> 1lw_"$lbezg"
done < i.txt

for file in 1lw_*
do
    while IFS=';' read lbezg sbezg rest
    do
        echo "$lbezg"';'"$sbezg"';'"$rest" >> 1lw_2so_"$lbezg"_"$sbezg"
    done < "$file"
done

for file in 1lw_2so_*
do
    while IFS=';' read lbezg sbezg impfstoff rest
    do
        ii="$(echo "$impfstoff" | tr -d '/')"
        echo "$lbezg"';'"$sbezg"';'"$impfstoff"';'"$rest" >> 1lw_2so_3impfstoff_"$lbezg"_"$sbezg"_"$ii"
    done < "$file"
done

for file in 1lw_2so_3impfstoff_*
do
    awk -F';' -v OFS=';' ' {if (NR>1) $3="----"; print $0}' < "$file"
done > 3fertig.txt

rm 1lw*

while IFS=';' read lbezg rest
do
    echo "$lbezg"';'"$rest" >> 1lw_"$lbezg"
done < 3fertig.txt

for file in 1lw_*
do
    while IFS=';' read lbezg sbezg rest
    do
        echo "$lbezg"';'"$sbezg"';'"$rest" >> 1lw_2so_"$lbezg"_"$sbezg"
    done < "$file"
done

for file in 1lw_2so_*
do
    awk -F';' -v OFS=';' ' {if (NR>1) $2="----"; print $0}' < "$file"
done > 2fertig.txt

rm 1lw*

while IFS=';' read lbezg rest
do
    echo "$lbezg"';'"$rest" >> 1lw_"$lbezg"
done < 2fertig.txt

for file in 1lw_*
do
    awk -F';' -v OFS=';' ' {if (NR>1) $1="----"; print $0}' < "$file"
done > 1fertig.txt

rm 1lw*

####### the rest is for nice error checking and not strictly necessary

time for i in $(seq 1 "$n")
do

    l1="$(sed -n "$i"p < i.txt)" 
    l2="$(sed -n "$i"p < 1fertig.txt)"

    echo "$i"';'"$l1"'|'"$i"';'"$l2"

done | column -s '|' -t > differ.txt

I wonder how you would go about this?

Oh My Goodness · Accepted Answer · 2019-04-30 12:55:26Z

2

The shell is usually a poor choice for processing data. Let awk do it for you:

#!/usr/bin/awk -f
BEGIN { FS = OFS = ";" }
{
  stub=""
  for (i=1;i<=3;i++) if (saw[ stub = stub FS $i ]++) $i="----"
  print
}

If it has to be bash:

#!/bin/bash
awk -F\; -vOFS=\; '{s=0; for(i=1;i<=3;i++) if(saw[s=s FS $i]++) $i="----"} 1' i.txt > 1fertig.txt

answered Apr 30, 2019 at 12:55

Oh My Goodness

4,3461 gold badge12 silver badges26 bronze badges

\$\begingroup\$ I'm afraid this does not produce the intended output, because, for example, the first line of the output now reads ` DLORENZ;EDDELAK;BRV/COV;G2;2018-02-28;165` which is incorrect, because there is no line above this line that has "G1" in the fourth field, whereas in the original output, we have ` DLORENZ;EDDELAK;BRV/COV;G1;2018-01-31;165` in line 2, followed by DLORENZ;EDDELAK;BRV/COV;G2;2018-02-28;165 in line 3. I'll see where I end up using your approach with arrays, though. Decorating the output of your script and re-sorting will allow me to restore the sort. \$\endgroup\$

Thure Dührsen
– Thure Dührsen

2019-05-02 12:18:59 +00:00
Commented May 2, 2019 at 12:18
\$\begingroup\$ I think you've introduced an error somewhere, or changed your input without realizing it. Before posting, I saved your example input and output. A diff between your output and mine gave zero differences. Testing again just now, the first line of output I get is DLORENZ;EDDELAK;BCL;G1;2019-04-01;175 \$\endgroup\$

Oh My Goodness
– Oh My Goodness

2019-05-02 22:04:18 +00:00
Commented May 2, 2019 at 22:04
\$\begingroup\$ What awk version do you use? ` [tdu:gimli] /tmp/tdu/work awk --version | head -n 2 GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2) Copyright (C) 1989, 1991-2016 Free Software Foundation. [tdu:gimli] /tmp/tdu/work head -n 1 i.txt DLORENZ;EDDELAK;BRV/COV;G2;2018-02-28;165;Lorenz:Dirk;DLORENZ [tdu:gimli] /tmp/tdu/work LC_ALL=C awk -F\; -vOFS=\; '{s=0; for(i=1;i<=3;i++) if(saw[s=s FS $i]++) $i="----"} 1' i.txt > 1fertig.txt [tdu:gimli] /tmp/tdu/work head -n 1 1fertig.txt DLORENZ;EDDELAK;BRV/COV;G2;2018-02-28;165;Lorenz:Dirk;DLORENZ [tdu:gimli] /tmp/tdu/work ` \$\endgroup\$

Thure Dührsen
– Thure Dührsen

2019-05-03 08:55:09 +00:00
Commented May 3, 2019 at 8:55
\$\begingroup\$ and the backticks for code seem not to be suitable for several lines in a row... :( \$\endgroup\$

Thure Dührsen
– Thure Dührsen

2019-05-03 08:55:49 +00:00
Commented May 3, 2019 at 8:55
\$\begingroup\$ GNU Awk 4.2.1. Look at your own input. The first line is G2 in the 4th field, and the output is the same. There's nothing in the awk code that can change the 4th field at all, let alone from G1 to G2. \$\endgroup\$

Oh My Goodness
– Oh My Goodness

2019-05-03 09:25:19 +00:00
Commented May 3, 2019 at 9:25

| Show 1 more comment

Stack Exchange Network

Removing duplicate field entries from sorted csv data

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Removing duplicate field entries from sorted csv data

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions