0

I need to support parsing xml that is inside an email body but with extra text in the beginning and the end.

I've tried the HTML agility pack but this does not remove the non-xml texts.

So how do I cleanse the string w/c contains an entire xml text mixed with other texts around it?

var bodyXmlPart= @"Hi please see below client <?xml version=""1.0"" encoding=""UTF-8""?>" +
"<ac_application>" +
"    <primary_applicant_data>" +
"       <first_name>Ross</first_name>" +
"       <middle_name></middle_name>" +
"       <last_name>Geller</last_name>" +
"       <ssn>123456789</ssn>" +
"    </primary_applicant_data>" +
"</ac_application> thank you, \n john ";

//How do I clean up the body xml part before loading into xml
//This will fail:
var xDoc = XDocument.Parse(bodyXmlPart);
12
  • 1
    Are you certain that any instances of < and > that aren't part of the XML will be escaped? If parts of the text that aren't XML contain those characters then it will be very difficult. Otherwise, just trim everything before the first < and everything after the last >. Commented Dec 21, 2017 at 1:24
  • @ScottHannen its an email body coming from an unknown source so yes its very likely that ">" will occur at some point. Why should it be difficult can't we use regex? Commented Dec 21, 2017 at 1:31
  • Thinking really hard for a good answer to that... I've got nothing. I rarely use regex and it didn't even occur to me. Commented Dec 21, 2017 at 1:42
  • @james: Why should it be difficult can't we use regex? Because (1) regex is fundamentally the wrong way to parse XML and (2) you can't extract a variable structure from within an undefined context -- since you have no ability to restrict or even define what could be before or after your "XML", and you have no way of specifying how your "XML" differs from its context, you fundamentally cannot write a parser to extract the "XML". Commented Dec 21, 2017 at 2:24
  • The best you can do here is to treat your string as containing bad / not well-formed "XML", which is a tough problem to solve. See duplicate link for an explanation and multiple options, including several for .NET such as XmlReader.ReadToFollowing() or Microsoft.Language.Xml.XMLParser. Commented Dec 21, 2017 at 4:01

2 Answers 2

1

If you mean that body can contain any XML and not just ac_application. You can use the following code:

var bodyXmlPart = @"Hi please see below client " +
                  "<ac_application>" +
                  "    <primary_applicant_data>" +
                  "       <first_name>Ross</first_name>" +
                  "       <middle_name></middle_name>" +
                  "       <last_name>Geller</last_name>" +
                  "       <ssn>123456789</ssn>" +
                  "    </primary_applicant_data>" +
                  "</ac_application> thank you, \n john ";

 StringBuilder pattern = new StringBuilder();
 Regex regex = new Regex(@"<\?xml.*\?>", RegexOptions.Singleline);
 var match = regex.Match(bodyXmlPart);
 if (match.Success) // There is an xml declaration
 {
     pattern.Append(@"<\?xml.*");
 }
 Regex regexFirstTag = new Regex(@"\s*<(\w+:)?(\w+)>", RegexOptions.Singleline);
 var match1 = regexFirstTag.Match(bodyXmlPart);
 if (match1.Success) // xml has body and we got the first tag
 {
     pattern.Append(match1.Value.Trim().Replace(">",@"\>" + ".*"));
     string firstTag = match1.Value.Trim();
     Regex regexFullXmlBody = new Regex(pattern.ToString() + @"<\/" + firstTag.Trim('<','>') + @"\>", RegexOptions.None);
     var matchBody = regexFullXmlBody.Match(bodyXmlPart);
     if (matchBody.Success)
     {
        string xml = matchBody.Value;
     }
 }

This code can extract any XML and not just ac_application.

Assumptions are, that the body will always contain XML declaration tag. This code will look for XML declaration tag and then find first tag immediately following it. This first tag will be treated as root tag to extract entire xml.

Sign up to request clarification or add additional context in comments.

2 Comments

Problem: XML declaration is optional in XML.
Ok. Updated the code to make XML declaration optional.
0

I'd probably do something like this...

using System.Diagnostics;
using System.Text.RegularExpressions;

namespace Test {

    class Program {
        static void Main(string[] args) {
            var bodyXmlPart = @"Hi please see below client <?xml version=""1.0"" encoding=""UTF-8""?>" +
            "<ac_application>" +
            "    <primary_applicant_data>" +
            "       <first_name>Ross</first_name>" +
            "       <middle_name></middle_name>" +
            "       <last_name>Geller</last_name>" +
            "       <ssn>123456789</ssn>" +
            "    </primary_applicant_data>" +
            "</ac_application> thank you, \n john ";

            Regex regex = new Regex(@"(?<pre>.*)(?<xml>\<\?xml.*</ac_application\>)(?<post>.*)", RegexOptions.Singleline);
            var match = regex.Match(bodyXmlPart);
            if (match.Success) {
                Debug.WriteLine($"pre={match.Groups["pre"].Value}");
                Debug.WriteLine($"xml={match.Groups["xml"].Value}");
                Debug.WriteLine($"post={match.Groups["post"].Value}");
            }
        }
    }
}

This outputs...

pre=Hi please see below client 
xml=<?xml version="1.0" encoding="UTF-8"?><ac_application>    <primary_applicant_data>       <first_name>Ross</first_name>       <middle_name></middle_name>       <last_name>Geller</last_name>       <ssn>123456789</ssn>    </primary_applicant_data></ac_application>
post= thank you, 
 john 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.