Module Ufind
Case insensitive, accent insensitive search engine
Ufind is a small ocaml library that provides a case insentitive, accent insensitive search in strings encoded in utf8. It is meant to be easy to use, either for searching simple lists, or for digging in large databases.
Thanks to Ubase, accents and more general diacritics are recognized for all Latin characters. For other alphabets/scripts, searching will remain accent sensitive.
- version
- 0.01
Examples
The simplest search function is filter_list
; you can use it for searching an arbitrary list with only one line of code:
# let list = [1,"Arthur"; 2,"Benoît"; 3,"Camille"; 4,"Damián"] in
Ufind.filter_list ~get_name:snd "á" list;;
- : (int * string) list = [(4, "Dami\195\161n"); (1, "Arthur"); (3, "Camille")]
Note that the results are sorted by relevance; the best match is "Damián" because it has an "á" with the correct accent.
However, for better speed and memory usage, as soon as you intend to do several searches in the same database, we recommend using a two-step approach, as in the example below.
Let us search a substring in a list of strings. Here we use the sample
list of names that can be found in the test directory.
First we prepare the data:
# let items = Ufind.items_from_names sample;;
val items : string Ufind.search_item Seq.t = <fun>
And then we may search:
# let result = Ufind.select_data items "ap";;
val result : string list =
["Olivia Apodaca"; "Gi\195\161p \196\144\195\180ng Ngh\225\187\139"]
# List.iter print_endline result;;
Olivia Apodaca
Giáp Đông Nghị
- : unit = ()
The string "Olivia Apodaca" came first, because the substring "ap" is present without any accent substitution. If we searched "áp" instead, the order of the results would have been inverted.
Search filters
The search can be greatly modified by playing with two filters: casefolding and matching_defect.
Casefolding
The casefolding
parameter is the function used to perform a case-insensitive search. Two strings that have the same image under this function will be considered equal (exact match).
This function is applied only in the preprocess stage, see below.
type casefolding
=
|
CF_D144
Unicode Core Spec, page 157.
|
CF_D145
Unicode Core Spec, page 158.
|
CF_D147
Unicode Core Spec, page 158.
|
CF_NONE
no transformation
|
CF_LATIN145
D145 restricted to Latin letters
|
CF_LATIN147
D147 restricted to Latin letters
|
CF_ASCII
ASCII lowercase
|
CF_CUSTOM of string → string
any compatible function -- see below
For enforcing a case-sensitive search, just use
CF_NONE
. The casefolding function can also be used to normalize the utf8 string; this is done byCF_D145
andCF_D147
. However, recall that normalizing is a slow operation, and it should rather be done directly when storing items in your database. The expected properties of the casefolding function are:- Removing accents (with
utf8_to_ascii
) followed by casefolding should give the same result as casefolding followed be removing accents. The resulting string is called the base string. It contains only ASCII characters.
- Applying casefolding on a base string should return the unmodified string.
- If two items have different base strings, they will never match in any search.
- If two items have the same base string, they should match in any search, but maybe we a low ranking (low matching quality).
Expected property for case-insensitive search:
- Equality of casefolded ASCII strings should be equivalent to equality of the strings obtained by applying
String.lowercase_ascii
.
CUSTOM casefolding functions can be used for specific cases. For instance, one can use
capitalize_casefold
, which combines a usual lower-case function with capitalizing the first letter. In this way we force the match to happen at the start of the string: "Ver" (or "ver") will match with "Véronique" and "VÉRONIQUE" but not with "Prévert" or "PRÉVERT". Of course, the same result can be obtained by a simplematching_defect
function, see below.- Removing accents (with
Matching defect
Ufind uses a function that establishes the quality of "A being a substring of B". It is entirely parameterizable. It should be fast and not deal with accents, only raw strings.
Searching
The Ufind library is meant for searching through a database by filtering a string field, typically a name. We use the vocabulary "name" for denoting the field in question; but of course, it can be any string field, as long as it is UTF8 encoded.
Preparing the data
Before searching, the data has to be preprocessed, in order to transform it into a sequence of search_item
s. The preprocess can be lazy (will be executed only when real search queries will be made), but if memory allows it, it will be much faster to preprocess the whole data in memory, especially if you intend to perform several searches in the same database.
A search item does not have to contain all the data of the original records, it only contains the "name" field and a "data" pointer to recover the original data.
val base_of_item : 'a search_item → string
Return the ASCII version of the item.
val data_of_item : 'a search_item → 'a
Return the data associated with the item.
val preprocess : ?folding:casefolding → ?limit:(int * int) → get_name:('a → string) → get_data:('a → 'b) → 'a Seq.t → 'b search_item Seq.t
preprocess ~get_name ~get_data seq
returns a sequence ofsearch_item
s from the source sequenceseq
.This function is a general way of obtaining a sequence of search items from any kind of data source, as long as this source can be accessed by a sequence (ocaml type
Seq.t
). The functionget_name
takes an element of the sequence and should return the name field that we are searching. The functionget_data
on an element of the sequence can return any type of data that we want to associate with the result of the search. It can be the same asget_name
. But, the data can also be the id of the correspondng record, so that from the result of the search on the name field we can recover the other fields of the matching records.The
preprocess
function is not lazy; as a consequence, the resulting sequence is very fast to search. Iflimit=(offset,count)
,preprocess
will force evaluation of the firstoffset+count
elements of the sequenceseq
, and return a sequence of at mostcount
effectively computedsearch_items
starting at item #offset
(inclusive). Warning, if nolimit
is given, the wholeseq
is processed; hence ifseq
is infinite, it will never terminate until memory overflows.In all
preprocess*
functions, the returned sequence is not mutable, has no side-effect, and will always point to the start of the source sequence. So there is no need to reset it for each new search.
val preprocess_list : ?folding:casefolding → get_name:('a → string) → get_data:('a → 'b) → 'a list → 'b search_item Seq.t
Same as
preprocess
but the database is a list instead of a sequence.
val preprocess_file : ?limit:(int * int) → get_name:(string → string) → get_data:(string → 'a) → string → 'a search_item Seq.t
Use this to search in a file, where each item must be encoded in a single line.
The name field should be obtained by
get_name line
, and the data field byget_data line
. Returns the sequence of pre-processed search items.
val items_from_seq : ?folding:casefolding → get_name:('a → string) → get_data:('a → 'b) → 'a Seq.t → 'b search_item Seq.t
Similar to
preprocess
except that there is no preprocessing: this immediately returns a lazy sequence of items, suitable for searching.If the source sequence is mutable (most sequences are), then for each new search, it has to be reset to its origin position, and a new call to
items_from_seq
is required.If memory allows and if you intend to perform several searches, use
preprocess
for faster searching.
val items_from_names : ?folding:casefolding → string list → string search_item Seq.t
items_from_names list
immediately returns a lazy sequence of search items from a list of strings.The returned sequence is not mutable, and will always point to the start of the list. So there is no need to reset it for each new search.
For faster searching, rather use
preprocess_list ~get_name:id ~get_data:id list
, whereid x = x
. The only interest ofitems_from_names
is when the list is really long and we don't want to duplicate it in memory.
val items_from_text : ?folding:casefolding → string → (int * string) search_item Seq.t
items_from_text text
immediately constructs a lazy list of search_item corresponding to each word of the stringtext
, where word delimiters are[ \t\n()]
. Usual punctuation signs are removed from the end of words.The returned sequence is not mutable, and will always point to the start of the text. So there is no need to reset it for each new search.
After searching with
select_data
, the resulting data is a list of pairs(pos, word)
wherepos
is the position of the word in the original string.
val items_from_channel : ?folding:casefolding → Pervasives.in_channel → (int * string) search_item Seq.t
items_from_channel channel
immediately constructs a lazy list of search_item corresponding to each word read from thechannel
, where word delimiters are[ \t\n()]
. Usual punctuation signs are removed from the end of words.The resulting sequence is mutable, and will point to the current search position in the channel, which is not closed by this function.
After searching with
select_data
, the resulting data is a list of pairs(pos, word)
wherepos
is the byte position of the word in channel, starting from the initial state of the channel.
Search results
val select_data : ?folding:casefolding → ?stop:('a search_item → bool) → ?matching_stop:((int * 'a search_item) → bool) → ?matching_defect:matching_defect → 'a search_item Seq.t → string → 'a list
select_data items name
searches for the stringname
within the sequence of search itemsitems
and returns the sorted list of data corresponding to matching items.If
stop
is not provided, the search will explore the whole sequence. Otherwise, the search will stop after processing the firstitem
initems
for whichstop item = true
. The argumentmatching_stop
operates in a similar way, but is executed only on matching items, and its argument is the couple(distance, item)
.The
folding
parameter must be the same as the one used to create theitems
sequence.
val make_stop : ?count:int → ?timeout:float → unit → ('a → bool) * (unit → bool)
make_stop ~count ~timeout ()
returns the pair(stop, flag)
, wherestop
is a 'stop' function suitable for use inselect_data
; andflag
is a function that returnstrue
when the stop test is effectively triggered.When used with
select_data
, the search will stop after processingcount
elements, or when thetimeout
(in seconds) is elapsed. Note that the timer starts as soon at the unit argument()
is provided.The stop function has to be created again after each use.
In conjunction with
seq_split
, theflag
can be used to resume a previously stopped search, as follows.let seq1, seq2 = seq_split items in let stop, flag = make_stop ~count:10 () in let result = select_data ~stop seq1 name in if flag () then begin print_endline "The search was interrupted. We resume it."; let result2 = select_data seq2 name in (* ...now the complete result is the union {result,result2}, but the global ranking is lost... *) end else print_endline "The search was complete."
val filter_list : ?folding:casefolding → ?matching_defect:matching_defect → get_name:('a → string) → string → 'a list → 'a list
This is the simplest search function; it doesn't require any preprocessing.
filter_list ~get_name name list
will filter (and sort by relevance) the givenlist
by returning only those elements whose name field (extracted byget_name
) matchesname
.Example:
# let list = [1,"Arthur"; 2,"Benoît"; 3,"Camille"; 4,"Damián"] in filter_list ~get_name:snd "á" list;; - : (int * string) list = [(4, "Dami\195\161n"); (1, "Arthur"); (3, "Camille")]
Because the returned list is just a subset of the initial list, one can easily refine a search by chaining several calls to
filter_list
.
module Matching : sig ... end
Obtaining detailed matching results
Utilities
Sequences
Sequences, or "delayed lists" is a standard data type in ocaml, see here. They represent lists where each element is obtained by evaluating some function, and the evaluation is "lazy", i.e. is done only when absolutely needed.
(In order to keep the Ufind library compatible with ocaml 4.05, the newer Seq
functions that appeared in 4.07 are not used.)
val seq_to_list_rev : 'a Seq.t → 'a list
Evaluate the whole sequence and convert it to a list, in reverse order.
val seq_eval : 'a Seq.t → 'a Seq.t
Evaluate the whole sequence and return a new sequence with the evaluated values.
val seq_truncate : int → int → 'a Seq.t → 'a Seq.t
(Half)-immediate truncation of a sequence.
seq_truncate offset count seq
returns a sequence of lengthcount
(or less in case the initial sequence is too short) containing the elements of the initialseq
starting at theoffset
-eth element.This operation is not entirely lazy: elements before #
offset
will be evaluated. But no other element.
val seq_stop : ('a → bool) → 'a Seq.t → 'a Seq.t
seq_stop stop seq
returns a lazy truncation of the sequenceseq
. When evaluated, the sequence will stop just after the functionstop
evaluated on an element returns true. For instancestop
can be a timer, seemake_stop
.
String conversions
Shortcuts to some Ubase functions.
Mysql interface
See an example here.