Data extraction with multiple delimiters?

Question

I have a legacy data source column that is delimitted by semicolons and commas. The first semicolon indicates the last name, the second indicates the first and middle name (or initials), and the last semicolon indicates the type of individual. The comma indicates that a new name has began. Here is a sample of this data.

+-------+---------------------------------------------------------------------------------------------------------------------+
|  ID   | SOURCE                                                                                                              |
+-------+---------------------------------------------------------------------------------------------------------------------+
| 62963 | RENZ;MICHAEL;DECEASED,WANDER;MARIA;MINOR,WANDER;HENRY RUDOLPH;MINOR,WANDER;ROSA;MINOR,WANDER;PAUL EMIL;MINOR        |
| 62964 | HERNDON;A C;ESTATE,BERRING;A F;DECEASED,BEIRING;A F;DECEASED,BEIRING;ANDREAS FREDERICK;DECEASED                     |
| 62965 | ZINCH;;ESTATE,ZINTZ;;ESTATE,HAYNES;HENRY;DECEASED                                                                   |
| 62965 | ZINCH;;ESTATE,ZINTZ;;ESTATE,HAYNES;HENRY;DECEASED                                                                   |
| 62966 | KRAUS;JOSEPHINE;MINOR,KENNEDY;GEORGE;DECEASED                                                                       |
| 62967 | CAREY;JAMES;ESTATE,DE LA GARZA;REFUGIO;DECEASED                                                                     |
| 62968 | LEWIS;FLORENCE;ESTATE,LOCKWOOD;ALBERT A;DECEASED                                                                    |
| 62969 | GLAESER;EMMA;MINOR,GLAESER;HERMAN JR;MINOR,GLAESER;HERMAN;MINOR,RODRIGUEZ;HILARIO;DECEASED,RODRIGUEZ;MARIE;DECEASED |
| 62970 | STORY;BETTIE;ESTATE,EIGENDORFF;FRANZ;DECEASED                                                                       |
| 62971 | HOWELL;MAMIE;MINOR,HOWELL;ETHEL;MINOR                                                                               |
+-------+---------------------------------------------------------------------------------------------------------------------+

I am attempting to pull the data in a manner such as this:

+-----------+------------+-------------+-------------------+----------+
|      ID   |   SEQUENCE |    LAST     |    FIRSTMIDDLE    |   TYPE   |
+-----------+------------+-------------+-------------------+----------+
|     62963 |          1 | RENZ        | MICHAEL           | DECEASED |
|     62963 |          2 | WANDER      | MARIA             | MINOR    |
|     62963 |          3 | WANDER      | HENRY RUDOLPH     | MINOR    |
|     62963 |          4 | WANDER      | ROSA              | MINOR    |
|     62963 |          5 | WANDER      | PAUL EMIL         | MINOR    |
|     62964 |          1 | HERNDON     | A C               | ESTATE   |
|     62964 |          2 | BERRING     | A F               | DECEASED |
|     62964 |          3 | BEIRING     | A F               | DECEASED |
|     62964 |          4 | BEIRING     | ANDREAS FREDERICK | DECEASED |
|     62965 |          1 | ZINCH       |                   | ESTATE   |
|     62965 |          2 | ZINTZ       |                   | ESTATE   |
|     62965 |          3 | HAYNES      | HENRY             | DECEASED |
|     62966 |          1 | KRAUS       | JOSEPHINE         | MINOR    |
|     62966 |          2 | KENNEDY     | GEORGE            | DECEASED |
|     62967 |          1 | CAREY       | JAMES             | ESTATE   |
|     62967 |          2 | DE LA GARZA | REFUGIO           | DECEASED |
|     62968 |          1 | LEWIS       | FLORENCE          | ESTATE   |
|     62968 |          2 | LOCKWOOD    | ALBERT A          | DECEASED |
|     62969 |          1 | GLAESER     | EMMA              | MINOR    |
|     62969 |          2 | GLAESER     | HERMAN JR         | MINOR    |
|     62969 |          3 | GLAESER     | HERMAN            | MINOR    |
|     62969 |          4 | RODRIGUEZ   | HILARIO           | DECEASED |
|     62969 |          5 | RODRIGUEZ   | MARIE             | DECEASED |
|     62970 |          1 | STORY       | BETTIE            | ESTATE   |
|     62970 |          2 | EIGENDORFF  | FRANZ             | DECEASED |
|     62971 |          1 | HOWELL      | MAMIE             | MINOR    |
|     62971 |          2 | HOWELL      | ETHEL             | MINOR    |
+-----------+------------+-------------+-------------------+----------+

This type of data extraction is something I am not all too familiar with. I am thinking I need to use a complex combination of SUBSTRING and CHARINDEX, but given that the number of entries that the source column can contain varies, I am not sure how best to approach this. Any guidance on where I should begin would be incredibly helpful.

Google "SQL Split Function" there are literally thousands of examples. — RBarryYoung, Nov 24 '15 at 00:19
ideally, the purpose of this code is to fix that horribly broken schema. You never want to find yourself storing delimited data in columns. — Joel Coehoorn, Nov 24 '15 at 00:21
Parsing strings is not for what SQL was designed. Actually it's the worst idea. Have you consider [SQL CLR Functions](https://msdn.microsoft.com/en-us/library/ms189876.aspx)? — thepirat000, Nov 24 '15 at 00:55
@JoelCoehoorn That is correct. I am attempting to transform source data provided to us to a format that is applicable to our new schema. — PicoDeGallo, Nov 24 '15 at 02:03
@RBarryYoung and PicoDeGallo: this question should probably be closed in favor of the same question posted on [DBA.StackExchange](http://dba.stackexchange.com/questions/121915/data-extraction-with-multiple-delimiters/121916#121916). While I might be a little biased since that is my answer, I can say that the 2 answers provided here are not as durable since XML-based splitting will break if you ever have any of the following characters in the source data: `<`, `"`, and `&`. Also, using `PARSENAME` is convenient but will break if you ever have a period in the source data. — Solomon Rutzky, Nov 24 '15 at 14:41
@thepirat000 Actually this isn't so bad. If you look at my answer [here](http://dba.stackexchange.com/questions/121915/data-extraction-with-multiple-delimiters/121916#121916) on the cross-posted question, you can see that is mainly just a `CROSS APPLY` and a splitter. And while I did use a SQLCLR splitter, it didn't need to be SQLCLR. — Solomon Rutzky, Nov 24 '15 at 14:44
@srutzky there's dozens of good answers here and there already. I used to post these frequently 5-7 years ago, but tired of the repitition. We can close as a duplicate or just remove. Google answers this question in about five secnds. Heres one i posted last year: http://stackoverflow.com/a/21710424/109122 — RBarryYoung, Nov 24 '15 at 22:55
@srutzky and i have SQLCLR based splitters too, but i don't usually post them as answers because of the dependency they introduce and the barriers that crop up in most corporations to them. For a DBA that's comfortable with it and has already dealt with the auditors, etc., it's not a problem, but for client-side or even most SQL developers it can be. — RBarryYoung, Nov 24 '15 at 23:10

score 3 · Answer 1 · answered Nov 24 '15 at 05:30

use split string concept and parsename to do this

SELECT id,
       Row_number()
         OVER (
           partition BY id
           ORDER BY (SELECT NULL ))AS sequence,
       Parsename(Replace(col3, ';', '.'), 3) as LAST,
       Parsename(Replace(col3, ';', '.'), 2) as FIRSTMIDDLE,
       Parsename(Replace(col3, ';', '.'), 1) as TYPE
FROM   (SELECT id,
               Split.a.value('.', 'VARCHAR(100)') col3
        FROM   (SELECT id,
                       Cast ('<M>' + Replace(item_id, ',', '</M><M>')
                             + '</M>' AS XML) AS Data
                FROM   #yourtable) AS A
               CROSS APPLY Data.nodes ('/M') AS Split(a))a

Just FYI, using `PARSENAME` is quick and easy, but it will break if there is ever a period (`.`) in the source data. You would probably need to first convert those to an unused character, then convert them back to a period after the `PARSENAME` function returns the value. — Solomon Rutzky, Nov 24 '15 at 14:42

score 1 · Answer 2 · answered Nov 24 '15 at 06:27

create table #temp (id int, [source] nvarchar(4000))

insert #temp (id, [source])
      select 62963, 'RENZ;MICHAEL;DECEASED,WANDER;MARIA;MINOR,WANDER;HENRY RUDOLPH;MINOR,WANDER;ROSA;MINOR,WANDER;PAUL EMIL;MINOR'
union select 62964, 'HERNDON;A C;ESTATE,BERRING;A F;DECEASED,BEIRING;A F;DECEASED,BEIRING;ANDREAS FREDERICK;DECEASED'
union select 62965, 'ZINCH;;ESTATE,ZINTZ;;ESTATE,HAYNES;HENRY;DECEASED'
union select 62965, 'ZINCH;;ESTATE,ZINTZ;;ESTATE,HAYNES;HENRY;DECEASED'
union select 62966, 'KRAUS;JOSEPHINE;MINOR,KENNEDY;GEORGE;DECEASED'
union select 62967, 'CAREY;JAMES;ESTATE,DE LA GARZA;REFUGIO;DECEASED'
union select 62968, 'LEWIS;FLORENCE;ESTATE,LOCKWOOD;ALBERT A;DECEASED'
union select 62969, 'GLAESER;EMMA;MINOR,GLAESER;HERMAN JR;MINOR,GLAESER;HERMAN;MINOR,RODRIGUEZ;HILARIO;DECEASED,RODRIGUEZ;MARIE;DECEASED'
union select 62970, 'STORY;BETTIE;ESTATE,EIGENDORFF;FRANZ;DECEASED'
union select 62971, 'HOWELL;MAMIE;MINOR,HOWELL;ETHEL;MINOR'

select id, 
    row_number() over(partition by id order by id) as [sequence],
    [1] as [last], 
    [2] as [firstmiddle], 
    [3] as [type]
from (
    select id, attributeid, attribute, 
        row_number() over(partition by attributeid order by personid) x
    from (
        select id, 
            personid,
            row_number() over(partition by personid order by personid) attributeid,
            attribute
        from (
            select id, 
                personid, 
                attribute = y.i.value('(./text())[1]', 'nvarchar(4000)')
            from 
            ( 
                select id, personid, x = convert(xml, '<i>' 
                    + replace(person, ';', '</i><i>') 
                    + '</i>').query('.')
                from (
                    select id, 
                        row_number() over (order by id) as personid, 
                        person = y.i.value('(./text())[1]', 'nvarchar(4000)')
                    from ( 
                        select id, x = convert(xml, '<i>' 
                            + replace([source], ',', '</i><i>') 
                            + '</i>').query('.')
                        from #temp
                    ) personxml 
                    cross apply x.nodes('i') AS y(i)
                ) personsplit
            ) attributexml
            cross apply x.nodes('i') AS y(i)
        ) attributesplit
    ) attributes
) as sourcetable
pivot (
    min(attribute)
    for attributeid in ([1],[2],[3])
) as pivottable

Data extraction with multiple delimiters?

2 Answers2