4

Simple question from a Java novice. I want to encode a url so that nonstandard characters will be transformed to their hex value (that is %XX) while characters one expects to see in a url - letter, digits, forward slashes, question marks and whatever, will be left intact.

For example, encoding

"hi/hello?who=moris\\boris"

should result with

"hi/hello?who=moris%5cboris"

ideas?

6 Answers 6

1

This is actually, a rather tricky problem. And the reason that it is tricky is that the different parts of a URL need to be handled (encoded) differently.

In my experience, the best way to do this is to assemble the url from its components using the URL or URI class, letting the them take care of the encoding the components correctly.


In fact, now that I think about it, you have to encode the components before they get assembled. Once the parts are assembled it is impossible to tell whether (for example) a "?" is intended to the query separator (don't escape it) or a character in a pathname component (escape it).

Sign up to request clarification or add additional context in comments.

4 Comments

I'm not sure it is impossible. Maybe, not well defined. After all, web browsers do just that, if they need to download a resource given in an "unencoded" absolute form.
@rOu1i - when I say impossible, I really mean "impossible with 100% guaranteed correctness". What web browsers to is to apply a bunch of heuristics to the (ahem) "web address" that the user has typed, and encode the components that are obviously syntactically incorrect. That won't (and can't) work in all cases.
java.net.URL and java.net.URI do not encode query parameters. e.g. new java.net.URI("http", "www.foo.com", "/bar", "param1=value1&param2=google.com?q=foo").toString results in "foo.com/bar#param1=value1&param2=http://…"
@waterlooalex - Yup. You are assembling the URL from its components, and the URL and URI classes know which encodings to apply to the respective URL components. (But I think you would find that some characters in a query parameter would be encoded. So "do not encode" is probably incorrect.)
1

OWASP Enterprise Security API provides solution for this.

Please visit following links for more details http://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet#RULE_.235_-_URL_Escape_Before_Inserting_Untrusted_Data_into_HTML_URL_Parameter_Values

http://code.google.com/p/owasp-esapi-java/source/browse/trunk/src/main/java/org/owasp/esapi/codecs/PercentCodec.java

Comments

1

You can use below to escape special chars in URLs. However you need to pass the value only not the whole url

public static String escapeSpecialCharacters(String input) {
        StringBuilder resultStr = new StringBuilder();
        for (char ch : input.toCharArray()) {
            if (isSafe(ch)) {
                resultStr.append(ch);
            } else{
                resultStr.append('%');
                resultStr.append(toHex(ch / 16));
                resultStr.append(toHex(ch % 16));                   
            }
        }

        return resultStr.toString();
    }

    private static char toHex(int ch) {
        return (char) (ch < 10 ? '0' + ch : 'A' + ch - 10);
    }

    private static boolean isSafe(char ch) {
    return ((ch>='A' && ch<='Z') || (ch>='a' && ch<='z') || (ch>='0' && ch<='9') || "-_.~".indexOf(ch)>=0); 
}

4 Comments

you are missing a lot of unsafe chars. it's easier to enumerate safe chars: a-z A-Z 0-9 - _ . ~
any idea how to parse only the values, given the whole url, which might contain illegal characters that need encoding?
I don't think there is as it will be impossible to to decide if the illegal char is actually part of URL itself or part of parameter value that needs to be escaped.
@irreputable perishablepress.com/stop-using-unsafe-characters-in-urls says ~ is unsafe, but instead lists other safe characters you haven't listed: «Safe characters Alphanumerics [0-9a-zA-Z], special characters $-_.+!*'(),»
0

org.apache.commons.codec.net.URLCodec will encode special characters (e.g. the \ as you indicated). However, you will likely need to break up the url as you don't want characters in the path encoded. Additionally, you will need to split up the parameter names and values since ? & and = need to remain intact to pass the parameters individually and not as one huge parameter name.

Comments

0

You can try spring UriUtils.This seems to be handling the URL encoding/decoding correctly for the appropriate parts of the URL.

http://docs.spring.io/spring/docs/current/javadoc-api/org/springframework/web/util/UriUtils.html

Comments

-2

Use URLEncoder.encode(url, "UTF-8"), see the Javadoc.

5 Comments

doesn't work, it also converts the "special characters" allowed in a url - such as forward slashse and question marks
Oh I see. What you need to do is only URLEncode the portion of the URL you want to be encoded. Can you encode only the arguments?
URLEncoder is for HTML form encoding, not for URLs. However, I agree that the naming is very bad. (see javadoc) eg. space is converted to + where as it should be %20 for URLs
Like other posters have suggested you need to break up the URL into its components and encode only those components that need to not alter the structure of the containing URL.
URLEncoder is not going to solve this problem. The answer below provides a much more robust solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.