24

Can someone provide a regular expression to search and replace illegal characters found

Example, removing �

I am not sure how many types of 'illegal' characters exist but I think this will be a good start.

Many thanks

edit - I have no control over the data, we're trying to create a catch for the potentially bad data we're receiving.

4
  • 1
    I think first you should see why they're getting there. What's the encoding? Commented Oct 5, 2012 at 21:29
  • I think it may be better to include only those characters which are legal, which is probably really easy. Then again I don't know how many characters are legal to you. Commented Oct 5, 2012 at 21:30
  • We're receiving bad data, trying to push for the vendor to make sure the strings are encoded correctly, but we're trying to setup a catch for it. Commented Oct 5, 2012 at 21:30
  • I'd recommend only removing the characters that the string decoder throws up, which are replaced with 0xFFFD as I suggested below. Commented Oct 5, 2012 at 21:55

3 Answers 3

35

Invalid characters get converted to 0xFFFD on parsing, so any invalid character codes would get replaced with:

myString = myString.replace(/\uFFFD/g, '')

You can get all types of invalid sorts of chars here

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for the info, I won't be able to check again for a few days but this didn't work on the first attempt. I think this is the way forward though so I'll just check to see whether we implemented it correctly, it's hardly a lot of code :-)
did you reassign the string? replace isn't destructive, so you need to reassign the replaced string.
Yes it was a return myString.replace(/\uFFFD/g, '') we'll re-review it during the working week I wouldn't be surprised if something was overlooked
In my case, I had to replace the \uFFFD chars from my string, not the unicode symbol. So I solved the problem by using myString.replace(/\\uFFFD/g, '')
21

Instead of having a blacklist, you could use a whitelist. e.g. If you want to only accept letters, numbers, space, and a few punctuation characters, you could do

myString.replace(/[^a-z0-9 ,.?!]/ig, '')

3 Comments

An invalid character in this context is clearly malformed UTF-8, not non-ASCII.
You're reading more into the question than what's actually stated. The OP may be having a problem with encodings, but that's not what his question actually says.
I don't think I am: "Example, removing �" I think he makes it pretty clear the types of invalid characters he wants to remove are invalid characters in the sense that a string decoder can't, not that he doesn't like them. As I previously stated, limiting valid utf-8 to ascii is appalling advice.
4

Try this, it will work for all unexpected character like ♫ ◘ etc...

dataStr.replace(/[\u{0080}-\u{FFFF}]/gu,"");

1 Comment

That replaces accents

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.