2009年3月30日 星期一

Sphinx memo(1)

Sphinx 0.9.9 reference manual

3.1 Indexing
3.1 Data source
資料的來源會來自db, 文字檔, html file, 信等等, 對sphinx 而言, 會把這樣不同的data 建立成它自己的structured document, 以SQL的說法而言就是每一列各對應一個document, 每一個column各對應一個field

The data to be indexed can generally come from very different sources: SQL databases, plain text files, HTML files, mailboxes, and so on. From Sphinx point of view, the data it indexes is a set of structured documents, each of which has the same set of fields. This is biased towards SQL, where each row correspond to a document, and each column to a field.

資料不同建index的方式也不同, 這稱為data source driver
Depending on what source Sphinx should get the data from, different code is required to fetch the data and prepare it for indexing. This code is called data source driver (or simply driver or data source for brevity)

目前有Mysql 和 PostgreSQl 的 driver 可以用來跑 query 和拿data 了, 也有一種driver 叫做 xmlpiple 跑他自有的指令和從stdout 拿data
At the time of this writing, there are drivers for MySQL and PostgreSQL databases, which can connect to the database using its native C/C++ API, run queries and fetch the data. There's also a driver called xmlpipe, which runs a specified command and reads the data from its stdout. See Section 3.8, “xmlpipe data source” section for the format description



3.2 attributes
attributes 是準備來給filter 和 sort 時用的
Attributes are additional values associated with each document that can be used to perform additional filtering and sorting during search.

因為我們常需要對某些column 做sort, 或者根據其他的tabel 做 filter, 例如search blog 的時候指針對某些user, Sphinx 准許我們加些attribute 來給每個document
It is often desired to additionally process full-text search results based not only on matching document ID and its rank, but on a number of other per-document values as well. For instance, one might need to sort news search results by date and then relevance, or search through products within specified price range, or limit blog search to posts made by selected users, or group results by month. To do that efficiently, Sphinx allows to attach a number of additional attributes to each document, and store their values in the full-text index. It's then possible to use stored values to filter, sort, or group full-text matches.

attributes 不像field , 他們不是full-text indexed, 只是存在index 裡, 但是不能search 他們
Attributes, unlike the fields, are not full-text indexed. They are stored in the index, but it is not possible to search them as full-text, and attempting to do so results in an error.

Attributes can be used for filtering, though, to restrict returned rows, as well as sorting or result grouping; it is entirely possible to sort results purely based on attributes, and ignore the search relevance tools. Additionally, attributes are returned from the search daemon, while the indexed text is not

一個用attribute 的好例子是討論區的文章, 假設只有標題跟內容需要做search, 但是有時候卻需要限定是某個作者或者是某個分類時, 或者需要針對發表時間來做sort時, 這時我們就需要attributes
A good example for attributes would be a forum posts table. Assume that only title and content fields need to be full-text searchable - but that sometimes it is also required to limit search to a certain author or a sub-forum (ie. search only those rows that have some specific values of author_id or forum_id columns in the SQL table); or to sort matches by post_date column; or to group matching posts by month of the post_date and calculate per-group match counts.

#這個可以被達成, 藉由指定所有被提到的欄位(如作者, 時間等), indexing 他們
This can be achieved by specfying all the mentioned columns (excluding title and content, that are full-text fields) as attributes, indexing them, and then using API calls to setup filtering, sorting, and grouping. Here as an example
attribute 是不分大小寫的
Attributes are named. Attribute names are case insensitive. Attributes are not full-text indexed; they are stored in the index as is. Currently supported attribute types are:

現在的attribute 支援下列的格式
* unsigned integers (1-bit to 32-bit wide);
* UNIX timestamps;
* floating point values (32-bit, IEEE 754 single precision);
* string ordinals (specially computed integers); 可能是 "1st", "2nd", "3rd", "4th" 這樣的string
* MVA, multi-value attributes (variable-length lists of 32-bit unsigned integers).

The complete set of per-document attribute values is sometimes referred to as docinfo. Docinfos can either be

* stored separately from the main full-text index data ("extern" storage, in .spa file), or
* attached to each occurence of document ID in full-text index data ("inline" storage, in .spd file).

當使用extern storage, spa file 儲存了所有的document 和 attribute value 在RAM裡, 因為比較快, 相反地inline storage 不儲存在RAM裡, 但是index size 會變很大, 因為當document ID 有變動時 他就需要copy 所有的 attributes, Inline storage 可能只會在attributes 很少而且在有限的RAM 時侯才會用, 大部分的情況 extern storage 在indexing 和 search 都比較有效率
When using extern storage, a copy of .spa file (with all the attribute values for all the documents) is kept in RAM by searchd at all times. This is for performance reasons; random disk I/O would be too slow. On the contrary, inline storage does not require any additional RAM at all, but that comes at the cost of greatly inflating the index size: remember that it copies all attribute value every time when the document ID is mentioned, and that is exactly as many times as there are different keywords in the document. Inline may be the only viable option if you have only a few attributes and need to work with big datasets in limited RAM. However, in most cases extern storage makes both indexing and searching much more efficient.

計算search 時需要的memory, 這樣算出來的是跑daemon 的memory 不是每次query, 每次deamon 起來會分配這樣的memory 做準備
Search-time memory requirements for extern storage are (1+number_of_attrs)*number_of_docs*4 bytes, ie. 10 million docs with 2 groups and 1 timestamp will take (1+2+1)*10M*4 = 160 MB of RAM. This is PER DAEMON, not per query. searchd will allocate 160 MB on startup, read the data and keep it shared between queries. The children will NOT allocate any additional copies of this data.



3.3. MVA (multi-valued attributes)
可以把一串的value 當成document 的attribute, filtering and group-by 可以對MVA attributes 但是不能用MVA 做sort, MVA 在像article tags 或是 production cateories 很有幫助
MVAs, or multi-valued attributes, are an important special type of per-document attributes in Sphinx. MVAs make it possible to attach lists of values to every document. They are useful for article tags, product categories, etc. Filtering and group-by (but not sorting) on MVA attributes is supported.

MVA 只能由unsigned 32-bit integer 構成, 但是list 的長度沒有限制, 只要RAM 夠用, spm file 儲存的MVA value 會先被cache 在RAM, 後面都不懂= =+
Currently, MVA list entries are limited to unsigned 32-bit integers. The list length is not limited, you can have an arbitrary number of values attached to each document as long as RAM permits (.spm file that contains the MVA values will be precached in RAM by searchd). The source data can be taken either from a separate query, or from a document field; see source type in sql_attr_multi. In the first case the query will have to return pairs of document ID and MVA values, in the second one the field will be parsed for integer values. There are absolutely no requirements as to incoming data order; the values will be automatically grouped by document ID (and internally sorted within the same ID) during indexing anyway.

當filter 時, 如果document 的 MVA attributes 的任何值符合filter 條件, 就算是match, 當想要針對一個MVA attribute 做group 時, 結果會是根據MVA 裡各各value 的group,例如 一個 document 有 tag 的MVA , value 是 5,7,11 而對tag 做group 時會產生三個group 而對他們的groupby key 分別是 5,7,11
注意當對MVAy做group 時 有可能會導致重複的set, 因為每個document 都會帶有許相同的tag value
When filtering, a document will match the filter on MVA attribute if any of the values satisfy the filtering condition. (Therefore, documents that pass through exclude filters will not contain any of the forbidden values.) When grouping by MVA attribute, a document will contribute to as many groups as there are different MVA values associated with that document. For instance, if the collection contains exactly 1 document having a 'tag' MVA with values 5, 7, and 11, grouping on 'tag' will produce 3 groups with '@count' equal to 1 and '@groupby' key values of 5, 7, and 11 respectively. Also note that grouping by MVA might lead to duplicate documents in the result set: because each document can participate in many groups, it can be chosen as the best one in in more than one group, leading to duplicate IDs. PHP API historically uses ordered hash on the document ID for the resulting rows; so you'll also need to use SetArrayResult() in order to employ group-by on MVA with PHP API.

沒有留言: