1

I have a data.table with a column that lists the harmonized tariff codes for the goods that are being shipped. There are some input issues because sometimes a row may have repeated numbers "7601.00; 7601.00" and sometimes it might have different numbers, "7601.00; 8800.00". I have not decided what to do when I have differing entries, but the first thing I want to do is get rid of the duplicates. So I wrote a custom user defined function:

unique_hscodes <- function(hs_input){


  new <- strsplit(hs_input, split = ";")                   # Delimiter ;
  new <- lapply(new, str_replace_all, " ", "")

  if (length(unique(unlist(new))) == 1) {                  # Unique HS code
    return(unique(unlist(new)))
  }  
  else {

  new <- names(sort(table(unlist(new)),decreasing=TRUE)[1]) # Most frequent

  return(new) 

  } 

}

When I do, DT[, hs_code := unique_hscodes(hscode)] it returns me a data table with a column hs_code with the same number. But when I do DT[, hs_code := unique_hscodes(hscode), by =1:nrow(DT)], it is done properly.

Can someone please explain what is going on here?

2
  • 1
    Welcome to StackOverflow! Please read the info about how to ask a good question and how to give a reproducible example. This will make it much easier for others to help you. Commented Apr 23, 2018 at 18:45
  • 1
    Maybe DT[, hs_code := sapply(hscode, unique_hscodes)] works? Commented Apr 23, 2018 at 18:46

1 Answer 1

1

Your code returns multiple items from a single item input after the string split. When you run it with by=1:nrow(DT) only a single row is examined at once. That problem doesn't arise when only a single row is presented.

 DT <- data.table(hscode=c("7601.00; 7601.00" , "7601.00; 8800.00"))
 DT
#-----
             hscode
1: 7601.00; 7601.00
2: 7601.00; 8800.00
#--
 DT[ ,  table( unlist( strsplit(hscode, split="; "))) ]

#7601.00 8800.00 
#      3       1 
 DT[ ,  table( unlist( strsplit(hscode, split="; "))) , by=1:nrow(DT)]
#---------
  nrow V1
1:    1  2
2:    2  1
3:    2  1

I tried @Jaap's code with the simple example but it only split the column in two:

> DT[, hs_code := sapply(hscode, unique_hscodes)]
> DT
             hscode hs_code
1: 7601.00; 7601.00 7601.00
2: 7601.00; 8800.00 7601.00
Sign up to request clarification or add additional context in comments.

2 Comments

I thought the entire question was why, since the by operation was "working".
@Frank: I guess I thought that the other code was actually "getting rid of the duplicates", but it wasn't, I see now.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.