8

if this type character '這' = NonEnglish each will take up 2 word space, and English will take up 1 word space, Max length limit is 10 word space; How to get the first 10 space.
for below example how to get the result This這 is?
I'm trying to use for loop from first word but I don't know how to get each word in string...

string = "This這 is是 English中文 …";

var NonEnglish = "[^\u0000-\u0080]+",
    Pattern = new RegExp(NonEnglish),
    MaxLength = 10,
    Ratio = 2;
5
  • Do you need to get first 10 symbols of string or what? Commented Feb 27, 2014 at 5:29
  • If it's a mixed of english & non-english, cant you just remove non-english since you don't need them? then do a split after that Commented Feb 27, 2014 at 5:29
  • @Good.luck I need to get first 10 symbols but if there is 1 non english word will equal 2 symbol Commented Feb 27, 2014 at 5:30
  • @fedmich ?? the words just for example the string maybe will be th中文isisiisi Commented Feb 27, 2014 at 5:32
  • @user1775888 Are we supposed to use the same regex you provide or something of our own ? Commented Feb 27, 2014 at 5:38

2 Answers 2

8

If you mean you want to get that part of the string where it's length has reached 10, here's the answer:

var string = "This這 is是 English中文 …";

function check(string){
  // Length of A-Za-z characters is 1, and other characters which OP wants is 2
  var length = i = 0, len = string.length; 

  // you can iterate over strings just as like arrays
  for(;i < len; i++){

    // if the character is what the OP wants, add 2, else 1
    length += /\u0000-\u0080/.test(string[i]) ? 2 : 1;

    // if length is >= 10, come out of loop
    if(length >= 10) break;
  }

  // return string from the first letter till the index where we aborted the for loop
  return string.substr(0, i);
}

alert(check(string));

Live Demo

EDIT 1:

  1. Replaced .match with .test. The former returns a whole array while the latter simply returns true or false.
  2. Improved RegEx. Since we are checking only one character, no need for ^ and + that were before.
  3. Replaced len with string.length. Here's why.
Sign up to request clarification or add additional context in comments.

3 Comments

is it possible to use variable i out of scope of for loop?
Just be careful as this can take time long time to process because you used regex on "each character of string"
@user1775888 I agree with fedmich. Here's a video which shows how. And see my answer edit, I added some things which might increase the speed significantly with larger strings.
0

I'd suggest something along the following lines (assuming that you're trying to break the string up into snippets that are <= 10 bytes in length):

string = "This這 is是 English中文 …";

function byteCount(text) {
    //get the number of bytes consumed by a string
    return encodeURI(text).split(/%..|./).length - 1;
}

function tokenize(text, targetLen) {
    //break a string up into snippets that are <= to our target length
    var result = [];

    var pos = 0;
    var current = "";
    while (pos < text.length) {
        var next = current + text.charAt(pos);

        if (byteCount(next) > targetLen) {
            result.push(current);
            current = "";
            pos--;
        }
        else if (byteCount(next) == targetLen) {
            result.push(next);
            current = "";
        }
        else {
            current = next;
        }

        pos++;
    }
    if (current != "") {
       result.push(current);
    }

    return result;
};

console.log(tokenize(string, 10));

http://jsfiddle.net/5pc6L/

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.