I am reading something about searching for a (range of) string(s) in a sorted array of strings.
If you want to find all strings starting with “h”, you can run a
binary search for the strings “h” and “h\uFFFF”. This gives all the
indexes of the band for all the keys that start with “h”. Note that a
binary search can return the index where the string would be even if
it is not actually in the array.
I don’t understand anything from this paragraph.
h\uFFFF how does it help/is used in the binary search and does the last sentense also mean that even this search is faulty?
Any help to understand what is being said here please?
\uFFFF is the “character” that sorts last in the 16-bit “alphabet”, i.e. after any valid letter, character, or special symbol.
When you do binary search for a string in a sorted array, you find a place where that string could be inserted. When you have multiple identical strings, you get a location ahead of the first one. When you append “the last letter of the alphabet” after your string, the insertion point will be after the last of the identical strings, hence giving you a range of identical strings in a sorted array.
Imagine this: suppose you are not allowed to use letter
Z in your words. Now you have a sorted array of strings:
0 1 2 3 4 5 6 aab abb abc abc abd bcx bdy
If you search for
abc, binary search tells you the first place where you can insert it, which is 2. If you search for
abcZ, thoug, binary search would return 4, because
abcZ comes alphabetically right after
abc. This lets you know that the range between 2, inclusive, and 4, exclusive, is occupied by the string
abc. If both searches return the same number, you know that the string is not present in the array.
In the paragraph that you quoted,
\uFFFF plays the role of the “prohibited letter Z” from my example.
\uFFFF is the largest possible character in Java. Since the strings are sorted, searching for
h will find the start of the range while
h\uFFFF will find the end (assuming unicode strings here) since no second character can be larger than
\uFFFF. Even if it can’t match the string exactly, the search will return the index of where the target would be even if it’s not really there.
\uFFFF is the largest possible sortable unicode character in 16 bit block, if you are working with 32 bit blocks use
U+10FFFF (whatever that is in Java). I’ve personally never worked 32 bit unicode blocks in Java. See the section 16.7 of the 5.2.0 spec.
U+FFFF and U+10FFFF. These two noncharacter code points have the
attribute of being associated with the largest code unit values for
particular Unicode encoding forms. In UTF-16, U+FFFF is associated
with the largest 16-bit code unit value, FFFF . U+10FFFF is
associated with the largest legal UTF-32 32-bit code unit value,
10FFFF . This attribute renders these two noncharacter code points
useful for internal purposes as sentinels. For example, they might be
used to indicate the end of a list, to represent a value in an index
guaranteed to be higher than any valid character value, and so on
\uFFFF sequence in Java denotes the character with Unicode codepoint U+FFFF. However, the codepoint does not encode a character at all:
U+FFFF is used to represent a numeric value that is guaranteed not to be a character, for uses such as the final value at the end of an index.
As other answers have specified, searching for
h will find the beginning of the range of strings starting with
h\uFFFF will find the end (exclusive) of the range of strings starting with
h in your set of data.
The last sentence means that searching for
h\uFFFF will show you where you would insert such a string, if it does not exist in your data, which is why it gives you the exclusive end of your range.