Parse XML With Additional String

Question

I need to support parsing xml that is inside an email body but with extra text in the beginning and the end.

I've tried the HTML agility pack but this does not remove the non-xml texts.

So how do I cleanse the string w/c contains an entire xml text mixed with other texts around it?

var bodyXmlPart= @"Hi please see below client <?xml version=""1.0"" encoding=""UTF-8""?>" +
"<ac_application>" +
"    <primary_applicant_data>" +
"       <first_name>Ross</first_name>" +
"       <middle_name></middle_name>" +
"       <last_name>Geller</last_name>" +
"       <ssn>123456789</ssn>" +
"    </primary_applicant_data>" +
"</ac_application> thank you, \n john ";

//How do I clean up the body xml part before loading into xml
//This will fail:
var xDoc = XDocument.Parse(bodyXmlPart);

Are you certain that any instances of < and > that aren't part of the XML will be escaped? If parts of the text that aren't XML contain those characters then it will be very difficult. Otherwise, just trim everything before the first < and everything after the last >. — Scott Hannen
– Scott Hannen, Commented Dec 21, 2017 at 1:24
@ScottHannen its an email body coming from an unknown source so yes its very likely that ">" will occur at some point. Why should it be difficult can't we use regex? — james
– james, Commented Dec 21, 2017 at 1:31
Thinking really hard for a good answer to that... I've got nothing. I rarely use regex and it didn't even occur to me. — Scott Hannen
– Scott Hannen, Commented Dec 21, 2017 at 1:42
@james: Why should it be difficult can't we use regex? Because (1) regex is fundamentally the wrong way to parse XML and (2) you can't extract a variable structure from within an undefined context -- since you have no ability to restrict or even define what could be before or after your "XML", and you have no way of specifying how your "XML" differs from its context, you fundamentally cannot write a parser to extract the "XML". — kjhughes
– kjhughes, Commented Dec 21, 2017 at 2:24
The best you can do here is to treat your string as containing bad / not well-formed "XML", which is a tough problem to solve. See duplicate link for an explanation and multiple options, including several for .NET such as XmlReader.ReadToFollowing() or Microsoft.Language.Xml.XMLParser. — kjhughes
– kjhughes, Commented Dec 21, 2017 at 4:01

Sunil · Accepted Answer · 2017-12-21 03:00:04Z

1

If you mean that body can contain any XML and not just ac_application. You can use the following code:

var bodyXmlPart = @"Hi please see below client " +
                  "<ac_application>" +
                  "    <primary_applicant_data>" +
                  "       <first_name>Ross</first_name>" +
                  "       <middle_name></middle_name>" +
                  "       <last_name>Geller</last_name>" +
                  "       <ssn>123456789</ssn>" +
                  "    </primary_applicant_data>" +
                  "</ac_application> thank you, \n john ";

 StringBuilder pattern = new StringBuilder();
 Regex regex = new Regex(@"<\?xml.*\?>", RegexOptions.Singleline);
 var match = regex.Match(bodyXmlPart);
 if (match.Success) // There is an xml declaration
 {
     pattern.Append(@"<\?xml.*");
 }
 Regex regexFirstTag = new Regex(@"\s*<(\w+:)?(\w+)>", RegexOptions.Singleline);
 var match1 = regexFirstTag.Match(bodyXmlPart);
 if (match1.Success) // xml has body and we got the first tag
 {
     pattern.Append(match1.Value.Trim().Replace(">",@"\>" + ".*"));
     string firstTag = match1.Value.Trim();
     Regex regexFullXmlBody = new Regex(pattern.ToString() + @"<\/" + firstTag.Trim('<','>') + @"\>", RegexOptions.None);
     var matchBody = regexFullXmlBody.Match(bodyXmlPart);
     if (matchBody.Success)
     {
        string xml = matchBody.Value;
     }
 }

This code can extract any XML and not just ac_application.

Assumptions are, that the body will always contain XML declaration tag. This code will look for XML declaration tag and then find first tag immediately following it. This first tag will be treated as root tag to extract entire xml.

edited Dec 21, 2017 at 3:00

answered Dec 21, 2017 at 2:00

Sunil

3,44210 gold badges27 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

kjhughes Over a year ago

Problem: XML declaration is optional in XML.

Sunil Over a year ago

Ok. Updated the code to make XML declaration optional.

K Johnson · Accepted Answer · 2017-12-21 01:37:50Z

I'd probably do something like this...

using System.Diagnostics;
using System.Text.RegularExpressions;

namespace Test {

    class Program {
        static void Main(string[] args) {
            var bodyXmlPart = @"Hi please see below client <?xml version=""1.0"" encoding=""UTF-8""?>" +
            "<ac_application>" +
            "    <primary_applicant_data>" +
            "       <first_name>Ross</first_name>" +
            "       <middle_name></middle_name>" +
            "       <last_name>Geller</last_name>" +
            "       <ssn>123456789</ssn>" +
            "    </primary_applicant_data>" +
            "</ac_application> thank you, \n john ";

            Regex regex = new Regex(@"(?<pre>.*)(?<xml>\<\?xml.*</ac_application\>)(?<post>.*)", RegexOptions.Singleline);
            var match = regex.Match(bodyXmlPart);
            if (match.Success) {
                Debug.WriteLine($"pre={match.Groups["pre"].Value}");
                Debug.WriteLine($"xml={match.Groups["xml"].Value}");
                Debug.WriteLine($"post={match.Groups["post"].Value}");
            }
        }
    }
}

This outputs...

pre=Hi please see below client 
xml=<?xml version="1.0" encoding="UTF-8"?><ac_application>    <primary_applicant_data>       <first_name>Ross</first_name>       <middle_name></middle_name>       <last_name>Geller</last_name>       <ssn>123456789</ssn>    </primary_applicant_data></ac_application>
post= thank you, 
 john

Collectives™ on Stack Overflow

Parse XML With Additional String

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related