Module Ufind
Case insensitive, accent insensitive search engine
Ufind is a small ocaml library that provides a case insentitive, accent insensitive search in strings encoded in utf8. It is meant to be easy to use, either for searching simple lists, or for digging in large databases.
Thanks to Ubase, accents and more general diacritics are recognized for all Latin characters. For other alphabets/scripts, searching will remain accent sensitive.
- version
- 0.01
Examples
The simplest search function is filter_list; you can use it for searching an arbitrary list with only one line of code:
# let list = [1,"Arthur"; 2,"Benoît"; 3,"Camille"; 4,"Damián"] in
Ufind.filter_list ~get_name:snd "á" list;;
- : (int * string) list = [(4, "Dami\195\161n"); (1, "Arthur"); (3, "Camille")] Note that the results are sorted by relevance; the best match is "Damián" because it has an "á" with the correct accent.
However, for better speed and memory usage, as soon as you intend to do several searches in the same database, we recommend using a two-step approach, as in the example below.
Let us search a substring in a list of strings. Here we use the sample list of names that can be found in the test directory.
First we prepare the data:
# let items = Ufind.items_from_names sample;;
val items : string Ufind.search_item Seq.t = <fun>And then we may search:
# let result = Ufind.select_data items "ap";;
val result : string list =
["Olivia Apodaca"; "Gi\195\161p \196\144\195\180ng Ngh\225\187\139"]
# List.iter print_endline result;;
Olivia Apodaca
Giáp Đông Nghị
- : unit = () The string "Olivia Apodaca" came first, because the substring "ap" is present without any accent substitution. If we searched "áp" instead, the order of the results would have been inverted.
Search filters
The search can be greatly modified by playing with two filters: casefolding and matching_defect.
Casefolding
The casefolding parameter is the function used to perform a case-insensitive search. Two strings that have the same image under this function will be considered equal (exact match).
This function is applied only in the preprocess stage, see below.
type casefolding=|CF_D144Unicode Core Spec, page 157.
|CF_D145Unicode Core Spec, page 158.
|CF_D147Unicode Core Spec, page 158.
|CF_NONEno transformation
|CF_LATIN145D145 restricted to Latin letters
|CF_LATIN147D147 restricted to Latin letters
|CF_ASCIIASCII lowercase
|CF_CUSTOM of string → stringany compatible function -- see below
For enforcing a case-sensitive search, just use
CF_NONE. The casefolding function can also be used to normalize the utf8 string; this is done byCF_D145andCF_D147. However, recall that normalizing is a slow operation, and it should rather be done directly when storing items in your database. The expected properties of the casefolding function are:- Removing accents (with
utf8_to_ascii) followed by casefolding should give the same result as casefolding followed be removing accents. The resulting string is called the base string. It contains only ASCII characters.
- Applying casefolding on a base string should return the unmodified string.
- If two items have different base strings, they will never match in any search.
- If two items have the same base string, they should match in any search, but maybe we a low ranking (low matching quality).
Expected property for case-insensitive search:
- Equality of casefolded ASCII strings should be equivalent to equality of the strings obtained by applying
String.lowercase_ascii.
CUSTOM casefolding functions can be used for specific cases. For instance, one can use
capitalize_casefold, which combines a usual lower-case function with capitalizing the first letter. In this way we force the match to happen at the start of the string: "Ver" (or "ver") will match with "Véronique" and "VÉRONIQUE" but not with "Prévert" or "PRÉVERT". Of course, the same result can be obtained by a simplematching_defectfunction, see below.- Removing accents (with
Matching defect
Ufind uses a function that establishes the quality of "A being a substring of B". It is entirely parameterizable. It should be fast and not deal with accents, only raw strings.
Searching
The Ufind library is meant for searching through a database by filtering a string field, typically a name. We use the vocabulary "name" for denoting the field in question; but of course, it can be any string field, as long as it is UTF8 encoded.
Preparing the data
Before searching, the data has to be preprocessed, in order to transform it into a sequence of search_items. The preprocess can be lazy (will be executed only when real search queries will be made), but if memory allows it, it will be much faster to preprocess the whole data in memory, especially if you intend to perform several searches in the same database.
A search item does not have to contain all the data of the original records, it only contains the "name" field and a "data" pointer to recover the original data.
val base_of_item : 'a search_item → stringReturn the ASCII version of the item.
val data_of_item : 'a search_item → 'aReturn the data associated with the item.
val preprocess : ?folding:casefolding → ?limit:(int * int) → get_name:('a → string) → get_data:('a → 'b) → 'a Seq.t → 'b search_item Seq.tpreprocess ~get_name ~get_data seqreturns a sequence ofsearch_items from the source sequenceseq.This function is a general way of obtaining a sequence of search items from any kind of data source, as long as this source can be accessed by a sequence (ocaml type
Seq.t). The functionget_nametakes an element of the sequence and should return the name field that we are searching. The functionget_dataon an element of the sequence can return any type of data that we want to associate with the result of the search. It can be the same asget_name. But, the data can also be the id of the correspondng record, so that from the result of the search on the name field we can recover the other fields of the matching records.The
preprocessfunction is not lazy; as a consequence, the resulting sequence is very fast to search. Iflimit=(offset,count),preprocesswill force evaluation of the firstoffset+countelements of the sequenceseq, and return a sequence of at mostcounteffectively computedsearch_itemsstarting at item #offset(inclusive). Warning, if nolimitis given, the wholeseqis processed; hence ifseqis infinite, it will never terminate until memory overflows.In all
preprocess*functions, the returned sequence is not mutable, has no side-effect, and will always point to the start of the source sequence. So there is no need to reset it for each new search.
val preprocess_list : ?folding:casefolding → get_name:('a → string) → get_data:('a → 'b) → 'a list → 'b search_item Seq.tSame as
preprocessbut the database is a list instead of a sequence.
val preprocess_file : ?limit:(int * int) → get_name:(string → string) → get_data:(string → 'a) → string → 'a search_item Seq.tUse this to search in a file, where each item must be encoded in a single line.
The name field should be obtained by
get_name line, and the data field byget_data line. Returns the sequence of pre-processed search items.
val items_from_seq : ?folding:casefolding → get_name:('a → string) → get_data:('a → 'b) → 'a Seq.t → 'b search_item Seq.tSimilar to
preprocessexcept that there is no preprocessing: this immediately returns a lazy sequence of items, suitable for searching.If the source sequence is mutable (most sequences are), then for each new search, it has to be reset to its origin position, and a new call to
items_from_seqis required.If memory allows and if you intend to perform several searches, use
preprocessfor faster searching.
val items_from_names : ?folding:casefolding → string list → string search_item Seq.titems_from_names listimmediately returns a lazy sequence of search items from a list of strings.The returned sequence is not mutable, and will always point to the start of the list. So there is no need to reset it for each new search.
For faster searching, rather use
preprocess_list ~get_name:id ~get_data:id list, whereid x = x. The only interest ofitems_from_namesis when the list is really long and we don't want to duplicate it in memory.
val items_from_text : ?folding:casefolding → string → (int * string) search_item Seq.titems_from_text textimmediately constructs a lazy list of search_item corresponding to each word of the stringtext, where word delimiters are[ \t\n()]. Usual punctuation signs are removed from the end of words.The returned sequence is not mutable, and will always point to the start of the text. So there is no need to reset it for each new search.
After searching with
select_data, the resulting data is a list of pairs(pos, word)whereposis the position of the word in the original string.
val items_from_channel : ?folding:casefolding → Pervasives.in_channel → (int * string) search_item Seq.titems_from_channel channelimmediately constructs a lazy list of search_item corresponding to each word read from thechannel, where word delimiters are[ \t\n()]. Usual punctuation signs are removed from the end of words.The resulting sequence is mutable, and will point to the current search position in the channel, which is not closed by this function.
After searching with
select_data, the resulting data is a list of pairs(pos, word)whereposis the byte position of the word in channel, starting from the initial state of the channel.
Search results
val select_data : ?folding:casefolding → ?stop:('a search_item → bool) → ?matching_stop:((int * 'a search_item) → bool) → ?matching_defect:matching_defect → 'a search_item Seq.t → string → 'a listselect_data items namesearches for the stringnamewithin the sequence of search itemsitemsand returns the sorted list of data corresponding to matching items.If
stopis not provided, the search will explore the whole sequence. Otherwise, the search will stop after processing the firstiteminitemsfor whichstop item = true. The argumentmatching_stopoperates in a similar way, but is executed only on matching items, and its argument is the couple(distance, item).The
foldingparameter must be the same as the one used to create theitemssequence.
val make_stop : ?count:int → ?timeout:float → unit → ('a → bool) * (unit → bool)make_stop ~count ~timeout ()returns the pair(stop, flag), wherestopis a 'stop' function suitable for use inselect_data; andflagis a function that returnstruewhen the stop test is effectively triggered.When used with
select_data, the search will stop after processingcountelements, or when thetimeout(in seconds) is elapsed. Note that the timer starts as soon at the unit argument()is provided.The stop function has to be created again after each use.
In conjunction with
seq_split, theflagcan be used to resume a previously stopped search, as follows.let seq1, seq2 = seq_split items in let stop, flag = make_stop ~count:10 () in let result = select_data ~stop seq1 name in if flag () then begin print_endline "The search was interrupted. We resume it."; let result2 = select_data seq2 name in (* ...now the complete result is the union {result,result2}, but the global ranking is lost... *) end else print_endline "The search was complete."
val filter_list : ?folding:casefolding → ?matching_defect:matching_defect → get_name:('a → string) → string → 'a list → 'a listThis is the simplest search function; it doesn't require any preprocessing.
filter_list ~get_name name listwill filter (and sort by relevance) the givenlistby returning only those elements whose name field (extracted byget_name) matchesname.Example:
# let list = [1,"Arthur"; 2,"Benoît"; 3,"Camille"; 4,"Damián"] in filter_list ~get_name:snd "á" list;; - : (int * string) list = [(4, "Dami\195\161n"); (1, "Arthur"); (3, "Camille")]Because the returned list is just a subset of the initial list, one can easily refine a search by chaining several calls to
filter_list.
module Matching : sig ... endObtaining detailed matching results
Utilities
Sequences
Sequences, or "delayed lists" is a standard data type in ocaml, see here. They represent lists where each element is obtained by evaluating some function, and the evaluation is "lazy", i.e. is done only when absolutely needed.
(In order to keep the Ufind library compatible with ocaml 4.05, the newer Seq functions that appeared in 4.07 are not used.)
val seq_to_list_rev : 'a Seq.t → 'a listEvaluate the whole sequence and convert it to a list, in reverse order.
val seq_eval : 'a Seq.t → 'a Seq.tEvaluate the whole sequence and return a new sequence with the evaluated values.
val seq_truncate : int → int → 'a Seq.t → 'a Seq.t(Half)-immediate truncation of a sequence.
seq_truncate offset count seqreturns a sequence of lengthcount(or less in case the initial sequence is too short) containing the elements of the initialseqstarting at theoffset-eth element.This operation is not entirely lazy: elements before #
offsetwill be evaluated. But no other element.
val seq_stop : ('a → bool) → 'a Seq.t → 'a Seq.tseq_stop stop seqreturns a lazy truncation of the sequenceseq. When evaluated, the sequence will stop just after the functionstopevaluated on an element returns true. For instancestopcan be a timer, seemake_stop.
String conversions
Shortcuts to some Ubase functions.
Mysql interface
See an example here.