2009年3月31日 星期二

Sphinx memo(3)

9.2. Index configuration options

9.2.1 type
type default 是空的, 目前有空字串和 distributed 兩種值
Index type. Optional, default is empty (index is plain local index). Known values are empty string or 'distributed'.

Sphinx 支援兩種 index, local , 儲存在和運作在local machine 還有 distributed 不只包含local searching 還有 遠端的 querying instance
Sphinx supports two different types of indexes: local, that are stored and processed on the local machine; and distributed, that involve not only local searching but querying remote searchd instances over the network as well. Index type settings lets you choose this type. By default, indexes are local. Specifying 'distributed' for type enables distributed searching, see Section 4.7, “Distributed searching”.
Example:

type = distributed



9.2.2. source

Adds document source to local index. Multi-value, mandatory.

可以指定任何不同的source type, 認定document source 是從現在index 被indexed時做的, 至少必須有一個source
Specifies document source to get documents from when the current index is indexed. There must be at least one source. There may be multiple sources, without any restrictions on the source types: ie. you can pull part of the data from MySQL server, part from PostgreSQL, part from the filesystem using xmlpipe2 wrapper.

document IDs 必須是 globally unique, 這邊指的document 應該是table,
However, there are some restrictions on the source data. First, document IDs must be globally unique across all sources. If that condition is not met, you might get unexpected search results. Second, source schemas must be the same in order to be stored within the same index.

source ID 不會自動就有
No source ID is stored automatically. Therefore, in order to be able to tell what source the matched document came from, you will need to store some additional information yourself. Two typical approaches include:

1. mangling document ID and encoding source ID in it:
2. storing source ID simply as an attribute:
Example:
example



9.2.3. path
For reference, different index files store the following data:

* .spa stores document attributes (used in extern docinfo storage mode only);
* .spd stores matching document ID lists for each word ID;
* .sph stores index header information;
* .spi stores word lists (word IDs and pointers to .spd file);
* .spm stores MVA data;
* .spp stores hit (aka posting, aka word occurence) lists for each word ID.



9.2.4. docinfo

Document attribute values (docinfo) storage mode. Optional, default is 'extern'. Known values are 'none', 'extern' and 'inline'.

none 表示沒有attribute, inline 表示存在spd file 裡跟documnet ID lists 一起, extern 表示自己存在不同的file, spa 裡
Docinfo storage mode defines how exactly docinfo will be physically stored on disk and RAM. "none" means that there will be no docinfo at all (ie. no attributes). Normally you need not to set "none" explicitly because Sphinx will automatically select "none" when there are no attributes configured. "inline" means that the docinfo will be stored in the .spd file, along with the document ID lists. "extern" means that the docinfo will be stored separately (externally) from document ID lists, in a special .spa file.



9.2.20. ngram_len

N-gram lengths for N-gram indexing. Optional, default is 0 (disable n-gram indexing). Known values are 0 and 1 (other lengths to be implemented).

提供基本的分詞
N-grams provide basic CJK (Chinese, Japanse, Koreasn) support for unsegmented texts. The issue with CJK searching is that there could be no clear separators between the words. Ideally, the texts would be filtered through a special program called segmenter that would insert separators in proper locations. However, segmenters are slow and error prone, and it's common to index contiguous groups of N characters, or n-grams, instead.

When this feature is enabled, streams of CJK characters are indexed as N-grams. For example, if incoming text is "ABCDEF" (where A to F represent some CJK characters) and length is 1, in will be indexed as if it was "A B C D E F". (With length equal to 2, it would produce "AB BC CD DE EF"; but only 1 is supported at the moment.) Only those characters that are listed in ngram_chars table will be split this way; other ones will not be affected.

Note that if search query is segmented, ie. there are separators between individual words, then wrapping the words in quotes and using extended mode will resut in proper matches being found even if the text was not segmented. For instance, assume that the original query is BC DEF. After wrapping in quotes on the application side, it should look like "BC" "DEF" (with quotes). This query will be passed to Sphinx and internally split into 1-grams too, resulting in "B C" "D E F" query, still with quotes that are the phrase matching operator. And it will match the text even though there were no separators in the text.

Even if the search query is not segmented, Sphinx should still produce good results, thanks to phrase based ranking: it will pull closer phrase matches (which in case of N-gram CJK words can mean closer multi-character word matches) to the top.


9.2.21. ngram_chars
只有在ngram_char 裡的characters 才會被切
N-gram characters list. Optional, default is empty.

To be used in conjunction with in ngram_len, this list defines characters, sequences of which are subject to N-gram extraction. Words comprised of other characters will not be affected by N-gram indexing feature. The value format is identical to charset_table.



9.2.11. min_word_len

Minimum indexed word length. Optional, default is 1 (index everything).

Only those words that are not shorter than this minimum will be indexed. For instance, if min_word_len is 4, then 'the' won't be indexed, but 'they' will be.



9.2.25. html_index_attrs

A list of markup attributes to index when stripping HTML. Optional, default is empty (do not index markup attributes).

Specifies HTML markup attributes whose contents should be retained and indexed even though other HTML markup is stripped. The format is per-tag enumeration of indexable attributes, as shown in the example below.
Example:

html_index_attrs = img=alt,title; a=title;



9.2.16. min_infix_len

Minimum infix prefix length to index. Optional, default is 0 (do not index infixes).
Infix indexing allows to implement wildcard searching by 'start*', '*end', and '*middle*' wildcards (refer to enable_star option for details on wildcard syntax). When mininum infix length is set to a positive number, indexer will index all the possible keyword infixes (ie. substrings) in addition to the keywords themselves. Too short infixes (below the minimum allowed length) will not be indexed.

For instance, indexing a keyword "test" with min_infix_len=2 will result in indexing "te", "es", "st", "tes", "est" infixes along with the word itself. Searches against such index for "es" will match documents that contain "test" word, even if they do not contain "es" on itself. However, indexing infixes will make the index grow significantly (because of many more indexed keywords), and will degrade both indexing and searching times.

There's no automatic way to rank perfect word matches higher in an infix index, but the same tricks as with prefix indexes can be applied.



9.2.24. html_strip

Whether to strip HTML markup from incoming full-text data. Optional, default is 0. Known values are 0 (disable stripping) and 1 (enable stripping).

Stripping does not work with xmlpipe source type (it's suggested to upgrade to xmlpipe2 anyway). It should work with properly formed HTML and XHTML, but, just as most browsers, may produce unexpected results on malformed input (such as HTML with stray <'s or unclosed >'s).

Only the tags themselves, and also HTML comments, are stripped. To strip the contents of the tags too (eg. to strip embedded scripts), see html_remove_elements option. There are no restrictions on tag names; ie. everything that looks like a valid tag start, or end, or a comment will be stripped.



9.2.26. html_remove_elements
指定可以找到的tag
A list of HTML elements for which to strip contents along with the elements themselves. Optional, default is empty string (do not strip contents of any elements).

This feature allows to strip element contents, ie. everything that is between the opening and the closing tags. It is useful to remove embedded scripts, CSS, etc. Short tag form for empty elements (ie.
) is properly supported; ie. the text that follows such tag will not be removed.

The value is a comma-separated list of element (tag) names whose contents should be removed. Tag names are case insensitive.
Example:

html_remove_elements = style, script



9.2.13. charset_table
能吃的character
Accepted characters table, with case folding rules. Optional, default value depends on charset_type value.

charset_table is the main workhorse of Sphinx tokenizing process, ie. the process of extracting keywords from document text or query txet. It controls what characters are accepted as valid and what are not, and how the accepted characters should be transformed (eg. should the case be removed or not).

You can think of charset_table as of a big table that has a mapping for each and every of 100K+ characters in Unicode (or as of a small 256-character table if you're using SBCS). By default, every character maps to 0, which means that it does not occur within keywords and should be treated as a separator. Once mentioned in the table, character is mapped to some other character (most frequently, either to itself or to a lowercase letter), and is treated as a valid keyword part.

The expected value format is a commas-separated list of mappings. Two simplest mappings simply declare a character as valid, and map a single character to another single character, respectively. But specifying the whole table in such form would result in bloated and barely manageable specifications. So there are several syntax shortcuts that let you map ranges of characters at once. The complete list is as follows:

A->a
Single char mapping, declares source char 'A' as allowed to occur within keywords and maps it to destination char 'a' (but does not declare 'a' as allowed).
A..Z->a..z
Range mapping, declares all chars in source range as allowed and maps them to the destination range. Does not declare destination range as allowed. Also checks ranges' lengths (the lengths must be equal).
a
Stray char mapping, declares a character as allowed and maps it to itself. Equivalent to a->a single char mapping.
a..z
Stray range mapping, declares all characters in range as allowed and maps them to themselves. Equivalent to a..z->a..z range mapping.
A..Z/2
Checkerboard range map. Maps every pair of chars to the second char. More formally, declares odd characters in range as allowed and maps them to the even ones; also declares even characters as allowed and maps them to themselves. For instance, A..Z/2 is equivalent to A->B, B->B, C->D, D->D, ..., Y->Z, Z->Z. This mapping shortcut is helpful for a number of Unicode blocks where uppercase and lowercase letters go in such interleaved order instead of contiguous chunks.

Control characters with codes from 0 to 31 are always treated as separators. Characters with codes 32 to 127, ie. 7-bit ASCII characters, can be used in the mappings as is. To avoid configuration file encoding issues, 8-bit ASCII characters and Unicode characters must be specified in U+xxx form, where 'xxx' is hexadecimal codepoint number. This form can also be used for 7-bit ASCII characters to encode special ones: eg. use U+20 to encode space, U+2E to encode dot, U+2C to encode comma.



9.2.8. stopwords

Stopword files list (space separated). Optional, default is empty.

Stopwords are the words that will not be indexed. Typically you'd put most frequent words in the stopwords list because they do not add much value to search results but consume a lot of resources to process.

You can specify several file names, separated by spaces. All the files will be loaded. Stopwords file format is simple plain text. The encoding must match index encoding specified in charset_type. File data will be tokenized with respect to charset_table settings, so you can use the same separators as in the indexed data. The stemmers will also be applied when parsing stopwords file.

While stopwords are not indexed, they still do affect the keyword positions. For instance, assume that "the" is a stopword, that document 1 contains the line "in office", and that document 2 contains "in the office". Searching for "in office" as for exact phrase will only return the first document, as expected, even though "the" in the second one is stopped.

沒有留言: