2

I have a table with product values as below:

  1. apple iphone

  2. iphone apple

  3. samsung phone

  4. phone samsung

I want to delete those products from the table which are exact reverse(as I consider them as duplicates), such that instead of 4 records, my table just have 2 records

  1. apple iphone

  2. samsung phone

I understand that there is REVERSE function in SQL Server, but it will reverse the whole string, and its not what I'm looking for.

I'd greatly appreciate any suggestions/ideas.

5
  • 1
    Are there only ever two words? Commented Aug 23, 2013 at 18:01
  • There can be more than two words too. Commented Aug 23, 2013 at 18:02
  • I understand it when you call iPhone's like that, since Apple has only one brand... But having had to support Galaxy S/Y/II/III/IV/Grand Duos/Grand Quattro/Win/Note/Note 2/Tab/Tab 2 7.0", I think "Samsung Phone" is calling a lot of different things by the same name... Commented Aug 23, 2013 at 18:03
  • 7
    Great! Can you please show those cases too, instead of just the simplest? When you only show the simplest scenario, people tend to solve for that, and then you have to come back and say "but it didn't work for..." - ask the whole question up front, please. Commented Aug 23, 2013 at 18:04
  • Are these keyed-in strings? I think you may be looking for approximate string matching algorithms, not word reversal. Commented Aug 23, 2013 at 18:19

5 Answers 5

5

Assuming that your dictionary does not include any XML entities (e.g. > or <), and that it is not practical to manually create a bunch of UPDATE statements for every combination of words in your table (if it is practical, then simplify your life, stop reading this answer, and use Justin's answer), you can create a function like this:

CREATE FUNCTION dbo.SplitSafeStrings
(
   @List       NVARCHAR(MAX),
   @Delimiter  NVARCHAR(255)
)
RETURNS TABLE
WITH SCHEMABINDING
AS
   RETURN 
   ( SELECT Item = LTRIM(RTRIM(y.i.value('(./text())[1]', 'nvarchar(4000)')))
     FROM ( SELECT x = CONVERT(XML, '<i>' 
          + REPLACE(@List, @Delimiter, '</i><i>') + '</i>').query('.')
      ) AS a CROSS APPLY x.nodes('i') AS y(i));
GO

(If XML is a problem, there are other, more complex alternatives, such as CLR.)

Then you can do this:

DECLARE @x TABLE(id INT IDENTITY(1,1), s VARCHAR(64));

INSERT @x(s) VALUES
  ('apple iphone'),
  ('iphone Apple'),
  ('iphone samsung hoochie blat'),
  ('samsung hoochie blat iphone');

;WITH cte1 AS 
(
  SELECT id, Item FROM @x AS x
  CROSS APPLY dbo.SplitSafeStrings(LOWER(x.s), ' ') AS y
),
cte2(id,words) AS 
(
  SELECT DISTINCT id, STUFF((SELECT ',' + orig.Item 
    FROM cte1 AS orig
    WHERE orig.id = cte1.id
    ORDER BY orig.Item
    FOR XML PATH(''), TYPE).value('.[1]','nvarchar(max)'),1,1,'')
  FROM cte1
),
cte3 AS 
(
  SELECT id, words, rn = ROW_NUMBER() OVER (PARTITION BY words ORDER BY id)
  FROM cte2
)
SELECT id, words, rn FROM cte3
-- WHERE rn = 1 -- rows to keep
-- WHERE rn > 1 -- rows to delete
;

So you could, after the three CTEs, instead of the final SELECT above, say:

DELETE t FROM @x AS t
  INNER JOIN cte3 ON cte3.id = t.id
  WHERE cte3.rn > 1;

And what should be left in @x?

SELECT id, s FROM @x;

Results:

id  s
--  ---------------------------
1   apple iphone
3   iphone samsung hoochie blat
Sign up to request clarification or add additional context in comments.

Comments

5

It seems to me that you are complicating this too much, a simple update statement would work:

UPDATE table SET productname = 'apple iphone' WHERE productname = 'iphone apple'

9 Comments

That assumes you know all of the possible combinations and it isn't too tedious to write all of those commands (what if there are thousands?). Also should be = 'apple iphone' - single quotes are string delimiters in T-SQL, double quotes are not. As an aside, how did you have an up-vote when your answer was exactly 3 seconds old?
@AaronBertrand I upvoted it. And it's been single quotes there from the beggining.
First, yes it assumes that. Second, you are correct, fixed. Third, I dont know
@Renan no, it has not. The original version had "apple iphone" but that won't show in the revision history because he fixed it during the grace period.
@AaronBertrand Also, a syntax error on an example should not be a reason to not upvote something IMO. In most cases, the answer should be more of a guide to help. It can be commented on to be fixed, or even fixed directly
|
3

I don't know how to do this in SQL, but in a language where you interface with SQL, you can do this:

You can tokenize each line so that you have an array of words, so that "iphone apple" becomes {"iphone","apple"} and then you can switch the order of the elements using a common swap statement so that it becomes {"apple","iphone"} and then you can turn it back into a string to make "apple iphone"

Although the process I describe above isn't all that hard to do, finding out which ones are duplicates of each other (knowing which ones to flip) might be a harder problem

Comments

2

Basing on data examples you've provided you could try something like this:

In case the "proper" format for productname is <brand> <product_type> you can just delete all products with productname not like '<brand>%'.

In case above won't help - are there any product naming rules?

As above idea cannot be applied, create Split function:

CREATE FUNCTION [dbo].[Split]
(
    @String NVARCHAR(4000),
    @Delimiter NCHAR(1)
)
RETURNS TABLE 
AS
RETURN 
(
    WITH Split(stpos,endpos) 
    AS(
        SELECT 0 AS stpos, CHARINDEX(@Delimiter,@String) AS endpos
        UNION ALL
        SELECT endpos+1, CHARINDEX(@Delimiter,@String,endpos+1)
            FROM Split
            WHERE endpos > 0
    )
    SELECT 'Id' = ROW_NUMBER() OVER (ORDER BY (SELECT 1)),
        'Data' = SUBSTRING(@String,stpos,COALESCE(NULLIF(endpos,0),LEN(@String)+1)-    stpos)
FROM Split
)

And use it in query:

select 
    (SELECT (', ' + Data) 
     FROM Split(t.textVal, ' ')
     order by [Data]
     FOR XML PATH( '' )
    )
from 
    test t

This will provide you with product name with sorted words. With this you can easily find duplicates. Second query is rough around the edges as i gotta go afk, but you should manage to smooth it out :) Good luck

1 Comment

There are no product naming rules as such, some other examples can be:"online nokia lumia shop", "shop lumia nokia online"
2

here's a solution for two or more words separated by space. basically the idea is to use a recursive CTE to split by space and then for xml to put the names back together sorted. Then you can group by the new name column to get your deduplicated list:

with split as (
  select id,
    convert(varchar(max), left(name, charindex(' ', name + ' ') - 1)) word,
    stuff(name, 1, charindex(' ', name + ' '), '') name
  from products

  union all

  select id,
    convert(varchar(max), left(name, charindex(' ', name + ' ') - 1)) word,
    stuff(name, 1, charindex(' ', name + ' '), '') name
  from split where name > ''
),
hom as (
  select id,
    (select word + ' '
     from split where id=o.id
     order by word for xml path('')) name
  from split o
)

select name, min(id) id from hom group by name

SQLFiddle

2 Comments

Your SQLfiddle breaks down pretty quickly, if you add a 3rd word (which the OP has indicated). New SQLfiddle
the solution for 2 or more words will involve a table-valued function.. just a sec

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.