2009年3月30日 星期一

Sphinx memo(2)

Restrictions on the source data
所有document的 id, 也就是table primay key 都必須是 unique integer不然會暴
There are a few different restrictions imposed on the source data which is going to be indexed by Sphinx, of which the single most important one is:

ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS (32-BIT OR 64-BIT, DEPENDING ON BUILD TIME SETTINGS).

If this requirement is not met, different bad things can happen. For instance, Sphinx can crash with an internal assertion while indexing; or produce strange results when searching due to conflicting IDs. Also, a 1000-pound gorilla might eventually come out of your display and start throwing barrels at you. You've been warned.



4.1. Matching modes

* SPH_MATCH_ALL, matches all query words (default mode);
* SPH_MATCH_ANY, matches any of the query words;
* SPH_MATCH_PHRASE, matches query as a phrase, requiring perfect match;
* SPH_MATCH_BOOLEAN, matches query as a boolean expression (see Section 4.2, “Boolean query syntax”);
EXTENDED, 是serach加強版, 可以多做很多神奇的指令
* SPH_MATCH_EXTENDED, matches query as an expression in Sphinx internal query language (see Section 4.3, “Extended query syntax”). As of 0.9.9, this has been superceded by SPH_MATCH_EXTENDED2, providing additional functionality and better performance. The ident is retained for legacy application code that will continue to be compatible once Sphinx and its components, including the API, are upgraded.
* SPH_MATCH_EXTENDED2, matches query using the second version of the Extended matching mode.
* SPH_MATCH_FULLSCAN, matches query, forcibly using the "full scan" mode as below. NB, any query terms will be ignored, such that filters, filter-ranges and grouping will still be applied, but no text-matching.

FULLSCAN 會被自動啟動 如果查詢的string 是空的, 和 docinfo storage 被設成extern
The SPH_MATCH_FULLSCAN mode will be automatically activated in place of the specified matching mode when the following conditions are met:

1. The query string is empty (ie. its length is zero).
2. docinfo storage is set to extern.

在full scan mode 所有的 indexed 的 document 都會被認為是match 的, 這樣的query 仍然可以用來filter, sort 和group 他可以減輕SQL server 的負擔, 例如查詢 Forum 裡某個user所有的 post
in full scan mode, all the indexed documents will be considered as matching. Such queries will still apply filters, sorting, and group by, but will not perform any full-text searching. This can be useful to unify full-text and non-full-text searching code, or to offload SQL server (there are cases when Sphinx scans will perform better than analogous MySQL queries). An example of using the full scan mode might be to find posts in a forum. By selecting the forum's user ID via SetFilter() but not actually providing any search text, Sphinx will match every document (i.e. every post) where SetFilter() would match - in this case providing every post from that user. By default this will be ordered by relevancy, followed by Sphinx document ID in ascending order (earliest first).



4.4. Weighting
search 的結果誰排前面誰排後面, 就是比重量, 不同的search mode 有不同的計算方式
Specific weighting function (currently) depends on the search mode.

There are these major parts which are used in the weighting functions:

1. phrase rank,
2. statistical rank.

比最長的, 最完美match 的
Phrase rank is based on a length of longest common subsequence (LCS) of search words between document body and query phrase. So if there's a perfect phrase match in some document then its phrase rank would be the highest possible, and equal to query words count.

比出現頻率的
Statistical rank is based on classic BM25 function which only takes word frequencies into account. If the word is rare in the whole database (ie. low frequency over document collection) or mentioned a lot in specific document (ie. high frequency over matching document), it receives more weight. Final BM25 weight is a floating point number between 0 and 1.

我們也可以對 filed 預先做加權 X2 或 X3 這樣
In all modes, per-field weighted phrase ranks are computed as a product of LCS multiplied by per-field weight speficifed by user. Per-field weights are integer, default to 1, and can not be set lower than 1.

BOOLEAN mode , weight 都是 1
In SPH_MATCH_BOOLEAN mode, no weighting is performed at all, every match weight is set to 1.

ALL 和 PHRASE 會算加權
In SPH_MATCH_ALL and SPH_MATCH_PHRASE modes, final weight is a sum of weighted phrase ranks.

In SPH_MATCH_ANY mode, the idea is essentially the same, but it also adds a count of matching words in each field. Before that, weighted phrase ranks are additionally mutliplied by a value big enough to guarantee that higher phrase rank in any field will make the match ranked higher, even if it's field weight is low.

In SPH_MATCH_EXTENDED mode, final weight is a sum of weighted phrase ranks and BM25 weight, multiplied by 1000 and rounded to integer.

ALL 和 ANY 以後也可以用BM25, 這對那種1-word queries 特別有幫助
This is going to be changed, so that MATCH_ALL and MATCH_ANY modes use BM25 weights as well. This would improve search results in those match spans where phrase ranks are equal; this is especially useful for 1-word queries.

重點 phrase 比較好
The key idea (in all modes, besides boolean) is that better subphrase matches are ranked higher, and perfect matches are pulled to the top. Author's experience is that this phrase proximity based ranking provides noticeably better search quality than any statistical scheme alone (such as BM25, which is commonly used in other search engines).



4.5. Sorting modes
There are the following result sorting modes available:

好的在前面
* SPH_SORT_RELEVANCE mode, that sorts by relevance in descending order (best matches first);
根據某個attribute value
* SPH_SORT_ATTR_DESC mode, that sorts by an attribute in descending order (bigger attribute values first);
* SPH_SORT_ATTR_ASC mode, that sorts by an attribute in ascending order (smaller attribute values first);
照時間
* SPH_SORT_TIME_SEGMENTS mode, that sorts by time segments (last hour/day/week/month) in descending order, and then by relevance in descending order;
* SPH_SORT_EXTENDED mode, that sorts by SQL-like combination of columns in ASC/DESC order;
* SPH_SORT_EXPR mode, that sorts by an arithmetic expression.

RELEVANCE 會忽略其他的參數, 而其他的sort 也是指定的第一, 然後 weight 第二, id 第三
SPH_SORT_RELEVANCE ignores any additional parameters and always sorts matches by relevance rank. All other modes require an additional sorting clause, with the syntax depending on specific mode. SPH_SORT_ATTR_ASC, SPH_SORT_ATTR_DESC and SPH_SORT_TIME_SEGMENTS modes require simply an attribute name. SPH_SORT_RELEVANCE is equivalent to sorting by "@weight DESC, @id ASC" in extended sorting mode, SPH_SORT_ATTR_ASC is equivalent to "attribute ASC, @weight DESC, @id ASC", and SPH_SORT_ATTR_DESC to "attribute DESC, @weight DESC, @id ASC" respectively.

SPH_SORT_TIME_SEGMENTS mode

照時間排的 很適合blog, 先時間 後 weight
In SPH_SORT_TIME_SEGMENTS mode, attribute values are split into so-called time segments, and then sorted by time segment first, and by relevance second.

The segments are calculated according to the current timestamp at the time when the search is performed, so the results would change over time. The segments are as follows:

* last hour,
* last day,
* last week,
* last month,
* last 3 months,
* everything else.

These segments are hardcoded, but it is trivial to change them if necessary.

This mode was added to support searching through blogs, news headlines, etc. When using time segments, recent records would be ranked higher because of segment, but withing the same segment, more relevant records would be ranked higher - unlike sorting by just the timestamp attribute, which would not take relevance into account at all.

還有其他模式的設定, 偷懶不看了 XD



4.8. searchd query log format
Match mode can take one of the following values:

* "all" for SPH_MATCH_ALL mode;
* "any" for SPH_MATCH_ANY mode;
* "phr" for SPH_MATCH_PHRASE mode;
* "bool" for SPH_MATCH_BOOLEAN mode;
* "ext" for SPH_MATCH_EXTENDED mode;
* "ext2" for SPH_MATCH_EXTENDED2 mode;
* "scan" if the full scan mode was used, either by being specified with SPH_MATCH_FULLSCAN, or if the query was empty (as documented under Matching Modes)

Sort mode can take one of the following values:

* "rel" for SPH_SORT_RELEVANCE mode;
* "attr-" for SPH_SORT_ATTR_DESC mode;
* "attr+" for SPH_SORT_ATTR_ASC mode;
* "tsegs" for SPH_SORT_TIME_SEGMENTS mode;
* "ext" for SPH_SORT_EXTENDED mode.

沒有留言: