Scrubbing ASCII Control Characters From Web Service Output


Staff member
I was having some difficulties with some text I was receiving from a Web Service I consume recently. The web service sends back XML, which is fine, but we're getting ASCII control characters in the middle of some of the XML. I wanted to paste an example in this posting but being invalid characters, I can't even paste it into this textarea.

I spent some time researching what to do in these cases and I found this informative article: <a href="" rel="nofollow"></a>. Here is a quote from this article that is relevant:

These aren’t characters that have any
business being in XML data; they’re
illegal characters that should be

So, following the article's advice I've written some code to take the raw output from this service and strip it of any character that is a control character (and that is not a space, tab, cr or lf)

Here is that code:

System.Net.WebClient client = new System.Net.WebClient();

byte[] invalidCharacters = { 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7, 0x8, 0xB, 
                             0xC, 0xE, 0xF, 0x10, 0x11, 0x12, 0x14, 0x15, 0x16, 
                             0x17, 0x18, 0x1A, 0x1B, 0x1E, 0x1F, 0x7F };

byte[] sanitizedResponse = (from a in client.DownloadData(url)
                            where !invalidCharacters.Contains(a)
                            select a).ToArray();

result = System.Text.UTF8Encoding.UTF8.GetString(sanitizedResponse);

This got me thinking though. If I receive double-byte characters, will I screw up any of the data I'm getting back? Is it valid for some codepages to have double-byte characters that are made up of one or two single byte ASCII control characters? The article saying that these characters have "no business" being in XML data sounds final but I want a second opinion.

Appreciate any feedback