Data Table User defined functions

Question

I have a data.table with a column that lists the harmonized tariff codes for the goods that are being shipped. There are some input issues because sometimes a row may have repeated numbers "7601.00; 7601.00" and sometimes it might have different numbers, "7601.00; 8800.00". I have not decided what to do when I have differing entries, but the first thing I want to do is get rid of the duplicates. So I wrote a custom user defined function:

unique_hscodes <- function(hs_input){


  new <- strsplit(hs_input, split = ";")                   # Delimiter ;
  new <- lapply(new, str_replace_all, " ", "")

  if (length(unique(unlist(new))) == 1) {                  # Unique HS code
    return(unique(unlist(new)))
  }  
  else {

  new <- names(sort(table(unlist(new)),decreasing=TRUE)[1]) # Most frequent

  return(new) 

  } 

}

When I do, DT[, hs_code := unique_hscodes(hscode)] it returns me a data table with a column hs_code with the same number. But when I do DT[, hs_code := unique_hscodes(hscode), by =1:nrow(DT)], it is done properly.

Can someone please explain what is going on here?

Welcome to StackOverflow! Please read the info about how to ask a good question and how to give a reproducible example. This will make it much easier for others to help you. — Jaap
– Jaap, Commented Apr 23, 2018 at 18:45
Maybe DT[, hs_code := sapply(hscode, unique_hscodes)] works? — Jaap
– Jaap, Commented Apr 23, 2018 at 18:46

IRTFM · Accepted Answer · 2018-04-23 19:08:22Z

1

Your code returns multiple items from a single item input after the string split. When you run it with by=1:nrow(DT) only a single row is examined at once. That problem doesn't arise when only a single row is presented.

 DT <- data.table(hscode=c("7601.00; 7601.00" , "7601.00; 8800.00"))
 DT
#-----
             hscode
1: 7601.00; 7601.00
2: 7601.00; 8800.00
#--
 DT[ ,  table( unlist( strsplit(hscode, split="; "))) ]

#7601.00 8800.00 
#      3       1 
 DT[ ,  table( unlist( strsplit(hscode, split="; "))) , by=1:nrow(DT)]
#---------
  nrow V1
1:    1  2
2:    2  1
3:    2  1

I tried @Jaap's code with the simple example but it only split the column in two:

> DT[, hs_code := sapply(hscode, unique_hscodes)]
> DT
             hscode hs_code
1: 7601.00; 7601.00 7601.00
2: 7601.00; 8800.00 7601.00

answered Apr 23, 2018 at 19:08

IRTFM

264k22 gold badges381 silver badges503 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

IRTFM Over a year ago

I thought the entire question was why, since the by operation was "working".

IRTFM Over a year ago

@Frank: I guess I thought that the other code was actually "getting rid of the duplicates", but it wasn't, I see now.

Collectives™ on Stack Overflow

Data Table User defined functions

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related