Module ux_uca

UNICODE COLLATION ALGORITHM see Unicode Technical Standard #10.

Copyright © 2010-2011 Michael Uvarov

Authors: Michael Uvarov (arcusfelis@gmail.com).

Description

UNICODE COLLATION ALGORITHM see Unicode Technical Standard #10

Additional information (and links)

1. Hangul Collation Requirements PS: There is the main source of information.

2. Terminator weight for Hangul

3. Theory vs. practice for Korean text collation PS: there is no any practice. They do not the UCA :/

4. Wiki

6. Unicode implementer's guide part 3: Conjoining jamo behavior

7. Unicode implementer's guide part 5: Collation

8. Unicode collation works now PS: I found it so late. :(

9. ICU

10. String Sorting (Natural) in Erlang Cookbook

For hangul collation: 11. Hangul Collation Requirements 12. UTR 10 13. KSX1001 on Wiki

Levels

http://unicode.org/reports/tr10/#Multi_Level_Comparison

* L1 Base characters * L2 Accents * L3 Case * L4 Punctuation

Example using levels:
   C = ux_uca_options:get_options([{strength, 3}]).
   ux_uca:sort_key(C, "Get L1-L3 weights").

Common configurations

Non-ignorable

Variable collation elements are not reset to be ignorable, but get the weights explicitly mentioned in the file.

* SPACE would have the value [.0209.0020.0002] * Capital A would be unchanged, with the value [.06D9.0020.0008] * Ignorables are unchanged.

Example:
   C = ux_uca_options:get_options(non_ignorable).
   ux_uca:sort_key(C, "Non-ignorable collation sort key").

Blanked

Variable collation elements and any subsequent ignorables are reset so that their weights at levels one through three are zero. For example,

* SPACE would have the value [.0000.0000.0000] * A combining grave accent after a space would have the value [.0000.0000.0000] * Capital A would be unchanged, with the value [.06D9.0020.0008] * A combining grave accent after a Capital A would be unchanged

Example:
   C = ux_uca_options:get_options(non_ignorable).
   ux_uca:sort_key(C, "Blanked collation sort key").

Shifted

Variable collation elements are reset to zero at levels one through three. In addition, a new fourth-level weight is appended, whose value depends on the type, as shown in Table 12. Any subsequent primary or secondary ignorables following a variable are reset so that their weights at levels one through four are zero.

* A combining grave accent after a space would have the value [.0000.0000.0000.0000]. * A combining grave accent after a Capital A would be unchanged.

Example:
   C = ux_uca_options:get_options(shifted).
   ux_uca:sort_key(C, "Shifted collation sort key").

Shift-trimmed

This option is the same as Shifted, except that all trailing FFFFs are trimmed from the sort key. This could be used to emulate POSIX behavior.

Example:
   C = ux_uca_options:get_options(shift_trimmed).
   ux_uca:sort_key(C, "Shift-trimmed collation sort key").

Data Types

result()

result() = {[uca_elem()], string()}

search_result()

search_result() = {string(), string(), string()}

uca_alternate()

uca_alternate() = shifted | shift_trimmed | non_ignorable | blanked

uca_array()

uca_array() = [uca_elem()]

uca_case_first()

uca_case_first() = lower | upper | off

uca_compare_result()

uca_compare_result() = lower | greater | equal

uca_elem()

uca_elem() = [atom() | uca_weight()]

uca_sort_key_format()

uca_sort_key_format() = binary | list | uncompressed

uca_strength()

uca_strength() = 1 | 2 | 3 | 4

uca_weight()

uca_weight() = integer()

uca_weights()

uca_weights() = [uca_weight()]

Function Index

compare/2Compare two strings and return: lower, greater or equal.
compare/3
search/2
search/3
search/4
sort/1Sort a list of strings.
sort/2Sort a list of strings.
sort_array/1Convert the unicode string to the collation element array
sort_array/2
sort_key/1Convert the unicode string to the sort key.
sort_key/2

Function Details

compare/2

compare(S1::string(), S2::string()) -> uca_compare_result()

Compare two strings and return: lower, greater or equal.

compare/3

compare(Uca_options::#uca_options{}, S1::string(), S2::string()) -> uca_compare_result()

search/2

search(Target::string(), Pattern::string()) -> search_result()

search/3

search(Target::string(), Pattern::string(), MatchStyle::atom()) -> search_result()

search(Uca_options::#uca_options{}, Target::string(), Pattern::string()) -> search_result()

search/4

search(Uca_options::#uca_options{}, Target::string(), Pattern::string(), MatchStyle::atom()) -> search_result()

sort/1

sort(Strings::[string()]) -> [string()]

Sort a list of strings.

sort/2

sort(Uca_options::#uca_options{}, Strings::[string()]) -> [string()]

Sort a list of strings.

sort_array/1

sort_array(S) -> any()

Convert the unicode string to the collation element array

sort_array/2

sort_array(C, S) -> any()

sort_key/1

sort_key(S) -> any()

Convert the unicode string to the sort key.

sort_key/2

sort_key(C, S) -> any()


Generated by EDoc