五款常用mysql slow log分析工具的比較
三方法優化MySQL數據庫查詢
Optimizing Queries with EXPLAIN
type
The join type. The different join types are listed here, ordered from the best type to the worst:
*system
The table has only one row (= system table). This is a special case of the const join type.
*const
直接搜primary key 或 unique index時
The table has at most one matching row, which is read at the start of the query. Because there is only one row, values from the column in this row can be regarded as constants by the rest of the optimizer. const tables are very fast because they are read only once.
const is used when you compare all parts of a PRIMARY KEY or UNIQUE index to constant values. In the following queries, tbl_name can be used as a const table:
*eq_ref
join 的key 都是primary key 或 unique index時
One row is read from this table for each combination of rows from the previous tables. Other than the system and const types, this is the best possible join type. It is used when all parts of an index are used by the join and the index is a PRIMARY KEY or UNIQUE index.
eq_ref can be used for indexed columns that are compared using the = operator. The comparison value can be a constant or an expression that uses columns from tables that are read before this table. In the following examples, MySQL can use an eq_ref join to process ref_table:
*ref
join 的key 不是primary key 或 unique index時
All rows with matching index values are read from this table for each combination of rows from the previous tables. ref is used if the join uses only a leftmost prefix of the key or if the key is not a PRIMARY KEY or UNIQUE index (in other words, if the join cannot select a single row based on the key value). If the key that is used matches only a few rows, this is a good join type.
ref can be used for indexed columns that are compared using the = or <=> operator. In the following examples, MySQL can use a ref join to process ref_table:
*fulltext
The join is performed using a FULLTEXT index.
*ref_or_null
This join type is like ref, but with the addition that MySQL does an extra search for rows that contain NULL values. This join type optimization is used most often in resolving subqueries. In the following examples, MySQL can use a ref_or_null join to process ref_table:
See Section 7.2.8, “IS NULL Optimization”.
*index_merge
This join type indicates that the Index Merge optimization is used. In this case, the key column in the output row contains a list of indexes used, and key_len contains a list of the longest key parts for the indexes used. For more information, see Section 7.2.6, “Index Merge Optimization”.
*unique_subquery
This type replaces ref for some IN subqueries of the following form:
unique_subquery is just an index lookup function that replaces the subquery completely for better efficiency.
*index_subquery
This join type is similar to unique_subquery. It replaces IN subqueries, but it works for non-unique indexes in subqueries of the following form:
*range
Only rows that are in a given range are retrieved, using an index to select the rows. The key column in the output row indicates which index is used. The key_len contains the longest key part that was used. The ref column is NULL for this type.
range can be used when a key column is compared to a constant using any of the =, <>, >, >=, <, <=, IS NULL, <=>, BETWEEN, or IN() operators:
*index
This join type is the same as ALL, except that only the index tree is scanned. This usually is faster than ALL because the index file usually is smaller than the data file.
MySQL can use this join type when the query uses only columns that are part of a single index.
*ALL
A full table scan is done for each combination of rows from the previous tables. This is normally not good if the table is the first table not marked const, and usually very bad in all other cases. Normally, you can avoid ALL by adding indexes that allow row retrieval from the table based on constant values or column values from earlier tables.
possible_keys
The possible_keys column indicates which indexes MySQL can choose from use to find the rows in this table.
If this column is NULL, there are no relevant indexes. In this case, you may be able to improve the performance of your query by examining the WHERE clause to check whether it refers to some column or columns that would be suitable for indexing
key
The key column indicates the key (index) that MySQL actually decided to use. If MySQL decides to use one of the possible_keys indexes to look up rows, that index is listed as the key value.
It is possible that key will name an index that is not present in the possible_keys value. This can happen if none of the possible_keys indexes are suitable for looking up rows, but all the columns selected by the query are columns of some other index. That is, the named index covers the selected columns, so although it is not used to determine which rows to retrieve, an index scan is more efficient than a data row scan.
key_len
The key_len column indicates the length of the key that MySQL decided to use. The length is NULL if the key column says NULL. Note that the value of key_len enables you to determine how many parts of a multiple-part key MySQL actually uses.
ref
The ref column shows which columns or constants are compared to the index named in the key column to select rows from the table.
rows
遇小愈好
The rows column indicates the number of rows MySQL believes it must examine to execute the query.
Extra
This column contains additional information about how MySQL resolves the query
2009年3月31日 星期二
query-reviewer, slow query
使用rails_reviewer優化Rails數據庫查詢性能
Query indexing with rails_reviewer plugin
query-reviewer An advanced mysql query analyzer for rails
神物阿, 可以幫你分析每一頁的query, 這樣就不用人工去找slow query 了, GOOD
Query indexing with rails_reviewer plugin
query-reviewer An advanced mysql query analyzer for rails
神物阿, 可以幫你分析每一頁的query, 這樣就不用人工去找slow query 了, GOOD
Sphinx memo(3)
9.2. Index configuration options
9.2.1 type
type default 是空的, 目前有空字串和 distributed 兩種值
Index type. Optional, default is empty (index is plain local index). Known values are empty string or 'distributed'.
Sphinx 支援兩種 index, local , 儲存在和運作在local machine 還有 distributed 不只包含local searching 還有 遠端的 querying instance
Sphinx supports two different types of indexes: local, that are stored and processed on the local machine; and distributed, that involve not only local searching but querying remote searchd instances over the network as well. Index type settings lets you choose this type. By default, indexes are local. Specifying 'distributed' for type enables distributed searching, see Section 4.7, “Distributed searching”.
Example:
type = distributed
9.2.2. source
Adds document source to local index. Multi-value, mandatory.
可以指定任何不同的source type, 認定document source 是從現在index 被indexed時做的, 至少必須有一個source
Specifies document source to get documents from when the current index is indexed. There must be at least one source. There may be multiple sources, without any restrictions on the source types: ie. you can pull part of the data from MySQL server, part from PostgreSQL, part from the filesystem using xmlpipe2 wrapper.
document IDs 必須是 globally unique, 這邊指的document 應該是table,
However, there are some restrictions on the source data. First, document IDs must be globally unique across all sources. If that condition is not met, you might get unexpected search results. Second, source schemas must be the same in order to be stored within the same index.
source ID 不會自動就有
No source ID is stored automatically. Therefore, in order to be able to tell what source the matched document came from, you will need to store some additional information yourself. Two typical approaches include:
1. mangling document ID and encoding source ID in it:
2. storing source ID simply as an attribute:
Example:
example
9.2.3. path
For reference, different index files store the following data:
* .spa stores document attributes (used in extern docinfo storage mode only);
* .spd stores matching document ID lists for each word ID;
* .sph stores index header information;
* .spi stores word lists (word IDs and pointers to .spd file);
* .spm stores MVA data;
* .spp stores hit (aka posting, aka word occurence) lists for each word ID.
9.2.4. docinfo
Document attribute values (docinfo) storage mode. Optional, default is 'extern'. Known values are 'none', 'extern' and 'inline'.
none 表示沒有attribute, inline 表示存在spd file 裡跟documnet ID lists 一起, extern 表示自己存在不同的file, spa 裡
Docinfo storage mode defines how exactly docinfo will be physically stored on disk and RAM. "none" means that there will be no docinfo at all (ie. no attributes). Normally you need not to set "none" explicitly because Sphinx will automatically select "none" when there are no attributes configured. "inline" means that the docinfo will be stored in the .spd file, along with the document ID lists. "extern" means that the docinfo will be stored separately (externally) from document ID lists, in a special .spa file.
9.2.20. ngram_len
N-gram lengths for N-gram indexing. Optional, default is 0 (disable n-gram indexing). Known values are 0 and 1 (other lengths to be implemented).
提供基本的分詞
N-grams provide basic CJK (Chinese, Japanse, Koreasn) support for unsegmented texts. The issue with CJK searching is that there could be no clear separators between the words. Ideally, the texts would be filtered through a special program called segmenter that would insert separators in proper locations. However, segmenters are slow and error prone, and it's common to index contiguous groups of N characters, or n-grams, instead.
When this feature is enabled, streams of CJK characters are indexed as N-grams. For example, if incoming text is "ABCDEF" (where A to F represent some CJK characters) and length is 1, in will be indexed as if it was "A B C D E F". (With length equal to 2, it would produce "AB BC CD DE EF"; but only 1 is supported at the moment.) Only those characters that are listed in ngram_chars table will be split this way; other ones will not be affected.
Note that if search query is segmented, ie. there are separators between individual words, then wrapping the words in quotes and using extended mode will resut in proper matches being found even if the text was not segmented. For instance, assume that the original query is BC DEF. After wrapping in quotes on the application side, it should look like "BC" "DEF" (with quotes). This query will be passed to Sphinx and internally split into 1-grams too, resulting in "B C" "D E F" query, still with quotes that are the phrase matching operator. And it will match the text even though there were no separators in the text.
Even if the search query is not segmented, Sphinx should still produce good results, thanks to phrase based ranking: it will pull closer phrase matches (which in case of N-gram CJK words can mean closer multi-character word matches) to the top.
9.2.21. ngram_chars
只有在ngram_char 裡的characters 才會被切
N-gram characters list. Optional, default is empty.
To be used in conjunction with in ngram_len, this list defines characters, sequences of which are subject to N-gram extraction. Words comprised of other characters will not be affected by N-gram indexing feature. The value format is identical to charset_table.
9.2.11. min_word_len
Minimum indexed word length. Optional, default is 1 (index everything).
Only those words that are not shorter than this minimum will be indexed. For instance, if min_word_len is 4, then 'the' won't be indexed, but 'they' will be.
9.2.25. html_index_attrs
A list of markup attributes to index when stripping HTML. Optional, default is empty (do not index markup attributes).
Specifies HTML markup attributes whose contents should be retained and indexed even though other HTML markup is stripped. The format is per-tag enumeration of indexable attributes, as shown in the example below.
Example:
html_index_attrs = img=alt,title; a=title;
9.2.16. min_infix_len
Minimum infix prefix length to index. Optional, default is 0 (do not index infixes).
Infix indexing allows to implement wildcard searching by 'start*', '*end', and '*middle*' wildcards (refer to enable_star option for details on wildcard syntax). When mininum infix length is set to a positive number, indexer will index all the possible keyword infixes (ie. substrings) in addition to the keywords themselves. Too short infixes (below the minimum allowed length) will not be indexed.
For instance, indexing a keyword "test" with min_infix_len=2 will result in indexing "te", "es", "st", "tes", "est" infixes along with the word itself. Searches against such index for "es" will match documents that contain "test" word, even if they do not contain "es" on itself. However, indexing infixes will make the index grow significantly (because of many more indexed keywords), and will degrade both indexing and searching times.
There's no automatic way to rank perfect word matches higher in an infix index, but the same tricks as with prefix indexes can be applied.
9.2.24. html_strip
Whether to strip HTML markup from incoming full-text data. Optional, default is 0. Known values are 0 (disable stripping) and 1 (enable stripping).
Stripping does not work with xmlpipe source type (it's suggested to upgrade to xmlpipe2 anyway). It should work with properly formed HTML and XHTML, but, just as most browsers, may produce unexpected results on malformed input (such as HTML with stray <'s or unclosed >'s).
Only the tags themselves, and also HTML comments, are stripped. To strip the contents of the tags too (eg. to strip embedded scripts), see html_remove_elements option. There are no restrictions on tag names; ie. everything that looks like a valid tag start, or end, or a comment will be stripped.
9.2.26. html_remove_elements
指定可以找到的tag
A list of HTML elements for which to strip contents along with the elements themselves. Optional, default is empty string (do not strip contents of any elements).
This feature allows to strip element contents, ie. everything that is between the opening and the closing tags. It is useful to remove embedded scripts, CSS, etc. Short tag form for empty elements (ie.
) is properly supported; ie. the text that follows such tag will not be removed.
The value is a comma-separated list of element (tag) names whose contents should be removed. Tag names are case insensitive.
Example:
html_remove_elements = style, script
9.2.13. charset_table
能吃的character
Accepted characters table, with case folding rules. Optional, default value depends on charset_type value.
charset_table is the main workhorse of Sphinx tokenizing process, ie. the process of extracting keywords from document text or query txet. It controls what characters are accepted as valid and what are not, and how the accepted characters should be transformed (eg. should the case be removed or not).
You can think of charset_table as of a big table that has a mapping for each and every of 100K+ characters in Unicode (or as of a small 256-character table if you're using SBCS). By default, every character maps to 0, which means that it does not occur within keywords and should be treated as a separator. Once mentioned in the table, character is mapped to some other character (most frequently, either to itself or to a lowercase letter), and is treated as a valid keyword part.
The expected value format is a commas-separated list of mappings. Two simplest mappings simply declare a character as valid, and map a single character to another single character, respectively. But specifying the whole table in such form would result in bloated and barely manageable specifications. So there are several syntax shortcuts that let you map ranges of characters at once. The complete list is as follows:
A->a
Single char mapping, declares source char 'A' as allowed to occur within keywords and maps it to destination char 'a' (but does not declare 'a' as allowed).
A..Z->a..z
Range mapping, declares all chars in source range as allowed and maps them to the destination range. Does not declare destination range as allowed. Also checks ranges' lengths (the lengths must be equal).
a
Stray char mapping, declares a character as allowed and maps it to itself. Equivalent to a->a single char mapping.
a..z
Stray range mapping, declares all characters in range as allowed and maps them to themselves. Equivalent to a..z->a..z range mapping.
A..Z/2
Checkerboard range map. Maps every pair of chars to the second char. More formally, declares odd characters in range as allowed and maps them to the even ones; also declares even characters as allowed and maps them to themselves. For instance, A..Z/2 is equivalent to A->B, B->B, C->D, D->D, ..., Y->Z, Z->Z. This mapping shortcut is helpful for a number of Unicode blocks where uppercase and lowercase letters go in such interleaved order instead of contiguous chunks.
Control characters with codes from 0 to 31 are always treated as separators. Characters with codes 32 to 127, ie. 7-bit ASCII characters, can be used in the mappings as is. To avoid configuration file encoding issues, 8-bit ASCII characters and Unicode characters must be specified in U+xxx form, where 'xxx' is hexadecimal codepoint number. This form can also be used for 7-bit ASCII characters to encode special ones: eg. use U+20 to encode space, U+2E to encode dot, U+2C to encode comma.
9.2.8. stopwords
Stopword files list (space separated). Optional, default is empty.
Stopwords are the words that will not be indexed. Typically you'd put most frequent words in the stopwords list because they do not add much value to search results but consume a lot of resources to process.
You can specify several file names, separated by spaces. All the files will be loaded. Stopwords file format is simple plain text. The encoding must match index encoding specified in charset_type. File data will be tokenized with respect to charset_table settings, so you can use the same separators as in the indexed data. The stemmers will also be applied when parsing stopwords file.
While stopwords are not indexed, they still do affect the keyword positions. For instance, assume that "the" is a stopword, that document 1 contains the line "in office", and that document 2 contains "in the office". Searching for "in office" as for exact phrase will only return the first document, as expected, even though "the" in the second one is stopped.
9.2.1 type
type default 是空的, 目前有空字串和 distributed 兩種值
Index type. Optional, default is empty (index is plain local index). Known values are empty string or 'distributed'.
Sphinx 支援兩種 index, local , 儲存在和運作在local machine 還有 distributed 不只包含local searching 還有 遠端的 querying instance
Sphinx supports two different types of indexes: local, that are stored and processed on the local machine; and distributed, that involve not only local searching but querying remote searchd instances over the network as well. Index type settings lets you choose this type. By default, indexes are local. Specifying 'distributed' for type enables distributed searching, see Section 4.7, “Distributed searching”.
Example:
type = distributed
9.2.2. source
Adds document source to local index. Multi-value, mandatory.
可以指定任何不同的source type, 認定document source 是從現在index 被indexed時做的, 至少必須有一個source
Specifies document source to get documents from when the current index is indexed. There must be at least one source. There may be multiple sources, without any restrictions on the source types: ie. you can pull part of the data from MySQL server, part from PostgreSQL, part from the filesystem using xmlpipe2 wrapper.
document IDs 必須是 globally unique, 這邊指的document 應該是table,
However, there are some restrictions on the source data. First, document IDs must be globally unique across all sources. If that condition is not met, you might get unexpected search results. Second, source schemas must be the same in order to be stored within the same index.
source ID 不會自動就有
No source ID is stored automatically. Therefore, in order to be able to tell what source the matched document came from, you will need to store some additional information yourself. Two typical approaches include:
1. mangling document ID and encoding source ID in it:
2. storing source ID simply as an attribute:
Example:
example
9.2.3. path
For reference, different index files store the following data:
* .spa stores document attributes (used in extern docinfo storage mode only);
* .spd stores matching document ID lists for each word ID;
* .sph stores index header information;
* .spi stores word lists (word IDs and pointers to .spd file);
* .spm stores MVA data;
* .spp stores hit (aka posting, aka word occurence) lists for each word ID.
9.2.4. docinfo
Document attribute values (docinfo) storage mode. Optional, default is 'extern'. Known values are 'none', 'extern' and 'inline'.
none 表示沒有attribute, inline 表示存在spd file 裡跟documnet ID lists 一起, extern 表示自己存在不同的file, spa 裡
Docinfo storage mode defines how exactly docinfo will be physically stored on disk and RAM. "none" means that there will be no docinfo at all (ie. no attributes). Normally you need not to set "none" explicitly because Sphinx will automatically select "none" when there are no attributes configured. "inline" means that the docinfo will be stored in the .spd file, along with the document ID lists. "extern" means that the docinfo will be stored separately (externally) from document ID lists, in a special .spa file.
9.2.20. ngram_len
N-gram lengths for N-gram indexing. Optional, default is 0 (disable n-gram indexing). Known values are 0 and 1 (other lengths to be implemented).
提供基本的分詞
N-grams provide basic CJK (Chinese, Japanse, Koreasn) support for unsegmented texts. The issue with CJK searching is that there could be no clear separators between the words. Ideally, the texts would be filtered through a special program called segmenter that would insert separators in proper locations. However, segmenters are slow and error prone, and it's common to index contiguous groups of N characters, or n-grams, instead.
When this feature is enabled, streams of CJK characters are indexed as N-grams. For example, if incoming text is "ABCDEF" (where A to F represent some CJK characters) and length is 1, in will be indexed as if it was "A B C D E F". (With length equal to 2, it would produce "AB BC CD DE EF"; but only 1 is supported at the moment.) Only those characters that are listed in ngram_chars table will be split this way; other ones will not be affected.
Note that if search query is segmented, ie. there are separators between individual words, then wrapping the words in quotes and using extended mode will resut in proper matches being found even if the text was not segmented. For instance, assume that the original query is BC DEF. After wrapping in quotes on the application side, it should look like "BC" "DEF" (with quotes). This query will be passed to Sphinx and internally split into 1-grams too, resulting in "B C" "D E F" query, still with quotes that are the phrase matching operator. And it will match the text even though there were no separators in the text.
Even if the search query is not segmented, Sphinx should still produce good results, thanks to phrase based ranking: it will pull closer phrase matches (which in case of N-gram CJK words can mean closer multi-character word matches) to the top.
9.2.21. ngram_chars
只有在ngram_char 裡的characters 才會被切
N-gram characters list. Optional, default is empty.
To be used in conjunction with in ngram_len, this list defines characters, sequences of which are subject to N-gram extraction. Words comprised of other characters will not be affected by N-gram indexing feature. The value format is identical to charset_table.
9.2.11. min_word_len
Minimum indexed word length. Optional, default is 1 (index everything).
Only those words that are not shorter than this minimum will be indexed. For instance, if min_word_len is 4, then 'the' won't be indexed, but 'they' will be.
9.2.25. html_index_attrs
A list of markup attributes to index when stripping HTML. Optional, default is empty (do not index markup attributes).
Specifies HTML markup attributes whose contents should be retained and indexed even though other HTML markup is stripped. The format is per-tag enumeration of indexable attributes, as shown in the example below.
Example:
html_index_attrs = img=alt,title; a=title;
9.2.16. min_infix_len
Minimum infix prefix length to index. Optional, default is 0 (do not index infixes).
Infix indexing allows to implement wildcard searching by 'start*', '*end', and '*middle*' wildcards (refer to enable_star option for details on wildcard syntax). When mininum infix length is set to a positive number, indexer will index all the possible keyword infixes (ie. substrings) in addition to the keywords themselves. Too short infixes (below the minimum allowed length) will not be indexed.
For instance, indexing a keyword "test" with min_infix_len=2 will result in indexing "te", "es", "st", "tes", "est" infixes along with the word itself. Searches against such index for "es" will match documents that contain "test" word, even if they do not contain "es" on itself. However, indexing infixes will make the index grow significantly (because of many more indexed keywords), and will degrade both indexing and searching times.
There's no automatic way to rank perfect word matches higher in an infix index, but the same tricks as with prefix indexes can be applied.
9.2.24. html_strip
Whether to strip HTML markup from incoming full-text data. Optional, default is 0. Known values are 0 (disable stripping) and 1 (enable stripping).
Stripping does not work with xmlpipe source type (it's suggested to upgrade to xmlpipe2 anyway). It should work with properly formed HTML and XHTML, but, just as most browsers, may produce unexpected results on malformed input (such as HTML with stray <'s or unclosed >'s).
Only the tags themselves, and also HTML comments, are stripped. To strip the contents of the tags too (eg. to strip embedded scripts), see html_remove_elements option. There are no restrictions on tag names; ie. everything that looks like a valid tag start, or end, or a comment will be stripped.
9.2.26. html_remove_elements
指定可以找到的tag
A list of HTML elements for which to strip contents along with the elements themselves. Optional, default is empty string (do not strip contents of any elements).
This feature allows to strip element contents, ie. everything that is between the opening and the closing tags. It is useful to remove embedded scripts, CSS, etc. Short tag form for empty elements (ie.
) is properly supported; ie. the text that follows such tag will not be removed.
The value is a comma-separated list of element (tag) names whose contents should be removed. Tag names are case insensitive.
Example:
html_remove_elements = style, script
9.2.13. charset_table
能吃的character
Accepted characters table, with case folding rules. Optional, default value depends on charset_type value.
charset_table is the main workhorse of Sphinx tokenizing process, ie. the process of extracting keywords from document text or query txet. It controls what characters are accepted as valid and what are not, and how the accepted characters should be transformed (eg. should the case be removed or not).
You can think of charset_table as of a big table that has a mapping for each and every of 100K+ characters in Unicode (or as of a small 256-character table if you're using SBCS). By default, every character maps to 0, which means that it does not occur within keywords and should be treated as a separator. Once mentioned in the table, character is mapped to some other character (most frequently, either to itself or to a lowercase letter), and is treated as a valid keyword part.
The expected value format is a commas-separated list of mappings. Two simplest mappings simply declare a character as valid, and map a single character to another single character, respectively. But specifying the whole table in such form would result in bloated and barely manageable specifications. So there are several syntax shortcuts that let you map ranges of characters at once. The complete list is as follows:
A->a
Single char mapping, declares source char 'A' as allowed to occur within keywords and maps it to destination char 'a' (but does not declare 'a' as allowed).
A..Z->a..z
Range mapping, declares all chars in source range as allowed and maps them to the destination range. Does not declare destination range as allowed. Also checks ranges' lengths (the lengths must be equal).
a
Stray char mapping, declares a character as allowed and maps it to itself. Equivalent to a->a single char mapping.
a..z
Stray range mapping, declares all characters in range as allowed and maps them to themselves. Equivalent to a..z->a..z range mapping.
A..Z/2
Checkerboard range map. Maps every pair of chars to the second char. More formally, declares odd characters in range as allowed and maps them to the even ones; also declares even characters as allowed and maps them to themselves. For instance, A..Z/2 is equivalent to A->B, B->B, C->D, D->D, ..., Y->Z, Z->Z. This mapping shortcut is helpful for a number of Unicode blocks where uppercase and lowercase letters go in such interleaved order instead of contiguous chunks.
Control characters with codes from 0 to 31 are always treated as separators. Characters with codes 32 to 127, ie. 7-bit ASCII characters, can be used in the mappings as is. To avoid configuration file encoding issues, 8-bit ASCII characters and Unicode characters must be specified in U+xxx form, where 'xxx' is hexadecimal codepoint number. This form can also be used for 7-bit ASCII characters to encode special ones: eg. use U+20 to encode space, U+2E to encode dot, U+2C to encode comma.
9.2.8. stopwords
Stopword files list (space separated). Optional, default is empty.
Stopwords are the words that will not be indexed. Typically you'd put most frequent words in the stopwords list because they do not add much value to search results but consume a lot of resources to process.
You can specify several file names, separated by spaces. All the files will be loaded. Stopwords file format is simple plain text. The encoding must match index encoding specified in charset_type. File data will be tokenized with respect to charset_table settings, so you can use the same separators as in the indexed data. The stemmers will also be applied when parsing stopwords file.
While stopwords are not indexed, they still do affect the keyword positions. For instance, assume that "the" is a stopword, that document 1 contains the line "in office", and that document 2 contains "in the office". Searching for "in office" as for exact phrase will only return the first document, as expected, even though "the" in the second one is stopped.
2009年3月30日 星期一
Sphinx memo(2)
Restrictions on the source data
所有document的 id, 也就是table primay key 都必須是 unique integer不然會暴
There are a few different restrictions imposed on the source data which is going to be indexed by Sphinx, of which the single most important one is:
ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS (32-BIT OR 64-BIT, DEPENDING ON BUILD TIME SETTINGS).
If this requirement is not met, different bad things can happen. For instance, Sphinx can crash with an internal assertion while indexing; or produce strange results when searching due to conflicting IDs. Also, a 1000-pound gorilla might eventually come out of your display and start throwing barrels at you. You've been warned.
4.1. Matching modes
* SPH_MATCH_ALL, matches all query words (default mode);
* SPH_MATCH_ANY, matches any of the query words;
* SPH_MATCH_PHRASE, matches query as a phrase, requiring perfect match;
* SPH_MATCH_BOOLEAN, matches query as a boolean expression (see Section 4.2, “Boolean query syntax”);
EXTENDED, 是serach加強版, 可以多做很多神奇的指令
* SPH_MATCH_EXTENDED, matches query as an expression in Sphinx internal query language (see Section 4.3, “Extended query syntax”). As of 0.9.9, this has been superceded by SPH_MATCH_EXTENDED2, providing additional functionality and better performance. The ident is retained for legacy application code that will continue to be compatible once Sphinx and its components, including the API, are upgraded.
* SPH_MATCH_EXTENDED2, matches query using the second version of the Extended matching mode.
* SPH_MATCH_FULLSCAN, matches query, forcibly using the "full scan" mode as below. NB, any query terms will be ignored, such that filters, filter-ranges and grouping will still be applied, but no text-matching.
FULLSCAN 會被自動啟動 如果查詢的string 是空的, 和 docinfo storage 被設成extern
The SPH_MATCH_FULLSCAN mode will be automatically activated in place of the specified matching mode when the following conditions are met:
1. The query string is empty (ie. its length is zero).
2. docinfo storage is set to extern.
在full scan mode 所有的 indexed 的 document 都會被認為是match 的, 這樣的query 仍然可以用來filter, sort 和group 他可以減輕SQL server 的負擔, 例如查詢 Forum 裡某個user所有的 post
in full scan mode, all the indexed documents will be considered as matching. Such queries will still apply filters, sorting, and group by, but will not perform any full-text searching. This can be useful to unify full-text and non-full-text searching code, or to offload SQL server (there are cases when Sphinx scans will perform better than analogous MySQL queries). An example of using the full scan mode might be to find posts in a forum. By selecting the forum's user ID via SetFilter() but not actually providing any search text, Sphinx will match every document (i.e. every post) where SetFilter() would match - in this case providing every post from that user. By default this will be ordered by relevancy, followed by Sphinx document ID in ascending order (earliest first).
4.4. Weighting
search 的結果誰排前面誰排後面, 就是比重量, 不同的search mode 有不同的計算方式
Specific weighting function (currently) depends on the search mode.
There are these major parts which are used in the weighting functions:
1. phrase rank,
2. statistical rank.
比最長的, 最完美match 的
Phrase rank is based on a length of longest common subsequence (LCS) of search words between document body and query phrase. So if there's a perfect phrase match in some document then its phrase rank would be the highest possible, and equal to query words count.
比出現頻率的
Statistical rank is based on classic BM25 function which only takes word frequencies into account. If the word is rare in the whole database (ie. low frequency over document collection) or mentioned a lot in specific document (ie. high frequency over matching document), it receives more weight. Final BM25 weight is a floating point number between 0 and 1.
我們也可以對 filed 預先做加權 X2 或 X3 這樣
In all modes, per-field weighted phrase ranks are computed as a product of LCS multiplied by per-field weight speficifed by user. Per-field weights are integer, default to 1, and can not be set lower than 1.
BOOLEAN mode , weight 都是 1
In SPH_MATCH_BOOLEAN mode, no weighting is performed at all, every match weight is set to 1.
ALL 和 PHRASE 會算加權
In SPH_MATCH_ALL and SPH_MATCH_PHRASE modes, final weight is a sum of weighted phrase ranks.
In SPH_MATCH_ANY mode, the idea is essentially the same, but it also adds a count of matching words in each field. Before that, weighted phrase ranks are additionally mutliplied by a value big enough to guarantee that higher phrase rank in any field will make the match ranked higher, even if it's field weight is low.
In SPH_MATCH_EXTENDED mode, final weight is a sum of weighted phrase ranks and BM25 weight, multiplied by 1000 and rounded to integer.
ALL 和 ANY 以後也可以用BM25, 這對那種1-word queries 特別有幫助
This is going to be changed, so that MATCH_ALL and MATCH_ANY modes use BM25 weights as well. This would improve search results in those match spans where phrase ranks are equal; this is especially useful for 1-word queries.
重點 phrase 比較好
The key idea (in all modes, besides boolean) is that better subphrase matches are ranked higher, and perfect matches are pulled to the top. Author's experience is that this phrase proximity based ranking provides noticeably better search quality than any statistical scheme alone (such as BM25, which is commonly used in other search engines).
4.5. Sorting modes
There are the following result sorting modes available:
好的在前面
* SPH_SORT_RELEVANCE mode, that sorts by relevance in descending order (best matches first);
根據某個attribute value
* SPH_SORT_ATTR_DESC mode, that sorts by an attribute in descending order (bigger attribute values first);
* SPH_SORT_ATTR_ASC mode, that sorts by an attribute in ascending order (smaller attribute values first);
照時間
* SPH_SORT_TIME_SEGMENTS mode, that sorts by time segments (last hour/day/week/month) in descending order, and then by relevance in descending order;
* SPH_SORT_EXTENDED mode, that sorts by SQL-like combination of columns in ASC/DESC order;
* SPH_SORT_EXPR mode, that sorts by an arithmetic expression.
RELEVANCE 會忽略其他的參數, 而其他的sort 也是指定的第一, 然後 weight 第二, id 第三
SPH_SORT_RELEVANCE ignores any additional parameters and always sorts matches by relevance rank. All other modes require an additional sorting clause, with the syntax depending on specific mode. SPH_SORT_ATTR_ASC, SPH_SORT_ATTR_DESC and SPH_SORT_TIME_SEGMENTS modes require simply an attribute name. SPH_SORT_RELEVANCE is equivalent to sorting by "@weight DESC, @id ASC" in extended sorting mode, SPH_SORT_ATTR_ASC is equivalent to "attribute ASC, @weight DESC, @id ASC", and SPH_SORT_ATTR_DESC to "attribute DESC, @weight DESC, @id ASC" respectively.
SPH_SORT_TIME_SEGMENTS mode
照時間排的 很適合blog, 先時間 後 weight
In SPH_SORT_TIME_SEGMENTS mode, attribute values are split into so-called time segments, and then sorted by time segment first, and by relevance second.
The segments are calculated according to the current timestamp at the time when the search is performed, so the results would change over time. The segments are as follows:
* last hour,
* last day,
* last week,
* last month,
* last 3 months,
* everything else.
These segments are hardcoded, but it is trivial to change them if necessary.
This mode was added to support searching through blogs, news headlines, etc. When using time segments, recent records would be ranked higher because of segment, but withing the same segment, more relevant records would be ranked higher - unlike sorting by just the timestamp attribute, which would not take relevance into account at all.
還有其他模式的設定, 偷懶不看了 XD
4.8. searchd query log format
Match mode can take one of the following values:
* "all" for SPH_MATCH_ALL mode;
* "any" for SPH_MATCH_ANY mode;
* "phr" for SPH_MATCH_PHRASE mode;
* "bool" for SPH_MATCH_BOOLEAN mode;
* "ext" for SPH_MATCH_EXTENDED mode;
* "ext2" for SPH_MATCH_EXTENDED2 mode;
* "scan" if the full scan mode was used, either by being specified with SPH_MATCH_FULLSCAN, or if the query was empty (as documented under Matching Modes)
Sort mode can take one of the following values:
* "rel" for SPH_SORT_RELEVANCE mode;
* "attr-" for SPH_SORT_ATTR_DESC mode;
* "attr+" for SPH_SORT_ATTR_ASC mode;
* "tsegs" for SPH_SORT_TIME_SEGMENTS mode;
* "ext" for SPH_SORT_EXTENDED mode.
所有document的 id, 也就是table primay key 都必須是 unique integer不然會暴
There are a few different restrictions imposed on the source data which is going to be indexed by Sphinx, of which the single most important one is:
ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS (32-BIT OR 64-BIT, DEPENDING ON BUILD TIME SETTINGS).
If this requirement is not met, different bad things can happen. For instance, Sphinx can crash with an internal assertion while indexing; or produce strange results when searching due to conflicting IDs. Also, a 1000-pound gorilla might eventually come out of your display and start throwing barrels at you. You've been warned.
4.1. Matching modes
* SPH_MATCH_ALL, matches all query words (default mode);
* SPH_MATCH_ANY, matches any of the query words;
* SPH_MATCH_PHRASE, matches query as a phrase, requiring perfect match;
* SPH_MATCH_BOOLEAN, matches query as a boolean expression (see Section 4.2, “Boolean query syntax”);
EXTENDED, 是serach加強版, 可以多做很多神奇的指令
* SPH_MATCH_EXTENDED, matches query as an expression in Sphinx internal query language (see Section 4.3, “Extended query syntax”). As of 0.9.9, this has been superceded by SPH_MATCH_EXTENDED2, providing additional functionality and better performance. The ident is retained for legacy application code that will continue to be compatible once Sphinx and its components, including the API, are upgraded.
* SPH_MATCH_EXTENDED2, matches query using the second version of the Extended matching mode.
* SPH_MATCH_FULLSCAN, matches query, forcibly using the "full scan" mode as below. NB, any query terms will be ignored, such that filters, filter-ranges and grouping will still be applied, but no text-matching.
FULLSCAN 會被自動啟動 如果查詢的string 是空的, 和 docinfo storage 被設成extern
The SPH_MATCH_FULLSCAN mode will be automatically activated in place of the specified matching mode when the following conditions are met:
1. The query string is empty (ie. its length is zero).
2. docinfo storage is set to extern.
在full scan mode 所有的 indexed 的 document 都會被認為是match 的, 這樣的query 仍然可以用來filter, sort 和group 他可以減輕SQL server 的負擔, 例如查詢 Forum 裡某個user所有的 post
in full scan mode, all the indexed documents will be considered as matching. Such queries will still apply filters, sorting, and group by, but will not perform any full-text searching. This can be useful to unify full-text and non-full-text searching code, or to offload SQL server (there are cases when Sphinx scans will perform better than analogous MySQL queries). An example of using the full scan mode might be to find posts in a forum. By selecting the forum's user ID via SetFilter() but not actually providing any search text, Sphinx will match every document (i.e. every post) where SetFilter() would match - in this case providing every post from that user. By default this will be ordered by relevancy, followed by Sphinx document ID in ascending order (earliest first).
4.4. Weighting
search 的結果誰排前面誰排後面, 就是比重量, 不同的search mode 有不同的計算方式
Specific weighting function (currently) depends on the search mode.
There are these major parts which are used in the weighting functions:
1. phrase rank,
2. statistical rank.
比最長的, 最完美match 的
Phrase rank is based on a length of longest common subsequence (LCS) of search words between document body and query phrase. So if there's a perfect phrase match in some document then its phrase rank would be the highest possible, and equal to query words count.
比出現頻率的
Statistical rank is based on classic BM25 function which only takes word frequencies into account. If the word is rare in the whole database (ie. low frequency over document collection) or mentioned a lot in specific document (ie. high frequency over matching document), it receives more weight. Final BM25 weight is a floating point number between 0 and 1.
我們也可以對 filed 預先做加權 X2 或 X3 這樣
In all modes, per-field weighted phrase ranks are computed as a product of LCS multiplied by per-field weight speficifed by user. Per-field weights are integer, default to 1, and can not be set lower than 1.
BOOLEAN mode , weight 都是 1
In SPH_MATCH_BOOLEAN mode, no weighting is performed at all, every match weight is set to 1.
ALL 和 PHRASE 會算加權
In SPH_MATCH_ALL and SPH_MATCH_PHRASE modes, final weight is a sum of weighted phrase ranks.
In SPH_MATCH_ANY mode, the idea is essentially the same, but it also adds a count of matching words in each field. Before that, weighted phrase ranks are additionally mutliplied by a value big enough to guarantee that higher phrase rank in any field will make the match ranked higher, even if it's field weight is low.
In SPH_MATCH_EXTENDED mode, final weight is a sum of weighted phrase ranks and BM25 weight, multiplied by 1000 and rounded to integer.
ALL 和 ANY 以後也可以用BM25, 這對那種1-word queries 特別有幫助
This is going to be changed, so that MATCH_ALL and MATCH_ANY modes use BM25 weights as well. This would improve search results in those match spans where phrase ranks are equal; this is especially useful for 1-word queries.
重點 phrase 比較好
The key idea (in all modes, besides boolean) is that better subphrase matches are ranked higher, and perfect matches are pulled to the top. Author's experience is that this phrase proximity based ranking provides noticeably better search quality than any statistical scheme alone (such as BM25, which is commonly used in other search engines).
4.5. Sorting modes
There are the following result sorting modes available:
好的在前面
* SPH_SORT_RELEVANCE mode, that sorts by relevance in descending order (best matches first);
根據某個attribute value
* SPH_SORT_ATTR_DESC mode, that sorts by an attribute in descending order (bigger attribute values first);
* SPH_SORT_ATTR_ASC mode, that sorts by an attribute in ascending order (smaller attribute values first);
照時間
* SPH_SORT_TIME_SEGMENTS mode, that sorts by time segments (last hour/day/week/month) in descending order, and then by relevance in descending order;
* SPH_SORT_EXTENDED mode, that sorts by SQL-like combination of columns in ASC/DESC order;
* SPH_SORT_EXPR mode, that sorts by an arithmetic expression.
RELEVANCE 會忽略其他的參數, 而其他的sort 也是指定的第一, 然後 weight 第二, id 第三
SPH_SORT_RELEVANCE ignores any additional parameters and always sorts matches by relevance rank. All other modes require an additional sorting clause, with the syntax depending on specific mode. SPH_SORT_ATTR_ASC, SPH_SORT_ATTR_DESC and SPH_SORT_TIME_SEGMENTS modes require simply an attribute name. SPH_SORT_RELEVANCE is equivalent to sorting by "@weight DESC, @id ASC" in extended sorting mode, SPH_SORT_ATTR_ASC is equivalent to "attribute ASC, @weight DESC, @id ASC", and SPH_SORT_ATTR_DESC to "attribute DESC, @weight DESC, @id ASC" respectively.
SPH_SORT_TIME_SEGMENTS mode
照時間排的 很適合blog, 先時間 後 weight
In SPH_SORT_TIME_SEGMENTS mode, attribute values are split into so-called time segments, and then sorted by time segment first, and by relevance second.
The segments are calculated according to the current timestamp at the time when the search is performed, so the results would change over time. The segments are as follows:
* last hour,
* last day,
* last week,
* last month,
* last 3 months,
* everything else.
These segments are hardcoded, but it is trivial to change them if necessary.
This mode was added to support searching through blogs, news headlines, etc. When using time segments, recent records would be ranked higher because of segment, but withing the same segment, more relevant records would be ranked higher - unlike sorting by just the timestamp attribute, which would not take relevance into account at all.
還有其他模式的設定, 偷懶不看了 XD
4.8. searchd query log format
Match mode can take one of the following values:
* "all" for SPH_MATCH_ALL mode;
* "any" for SPH_MATCH_ANY mode;
* "phr" for SPH_MATCH_PHRASE mode;
* "bool" for SPH_MATCH_BOOLEAN mode;
* "ext" for SPH_MATCH_EXTENDED mode;
* "ext2" for SPH_MATCH_EXTENDED2 mode;
* "scan" if the full scan mode was used, either by being specified with SPH_MATCH_FULLSCAN, or if the query was empty (as documented under Matching Modes)
Sort mode can take one of the following values:
* "rel" for SPH_SORT_RELEVANCE mode;
* "attr-" for SPH_SORT_ATTR_DESC mode;
* "attr+" for SPH_SORT_ATTR_ASC mode;
* "tsegs" for SPH_SORT_TIME_SEGMENTS mode;
* "ext" for SPH_SORT_EXTENDED mode.
Sphinx memo(1)
Sphinx 0.9.9 reference manual
3.1 Indexing
3.1 Data source
資料的來源會來自db, 文字檔, html file, 信等等, 對sphinx 而言, 會把這樣不同的data 建立成它自己的structured document, 以SQL的說法而言就是每一列各對應一個document, 每一個column各對應一個field
The data to be indexed can generally come from very different sources: SQL databases, plain text files, HTML files, mailboxes, and so on. From Sphinx point of view, the data it indexes is a set of structured documents, each of which has the same set of fields. This is biased towards SQL, where each row correspond to a document, and each column to a field.
資料不同建index的方式也不同, 這稱為data source driver
Depending on what source Sphinx should get the data from, different code is required to fetch the data and prepare it for indexing. This code is called data source driver (or simply driver or data source for brevity)
目前有Mysql 和 PostgreSQl 的 driver 可以用來跑 query 和拿data 了, 也有一種driver 叫做 xmlpiple 跑他自有的指令和從stdout 拿data
At the time of this writing, there are drivers for MySQL and PostgreSQL databases, which can connect to the database using its native C/C++ API, run queries and fetch the data. There's also a driver called xmlpipe, which runs a specified command and reads the data from its stdout. See Section 3.8, “xmlpipe data source” section for the format description
3.2 attributes
attributes 是準備來給filter 和 sort 時用的
Attributes are additional values associated with each document that can be used to perform additional filtering and sorting during search.
因為我們常需要對某些column 做sort, 或者根據其他的tabel 做 filter, 例如search blog 的時候指針對某些user, Sphinx 准許我們加些attribute 來給每個document
It is often desired to additionally process full-text search results based not only on matching document ID and its rank, but on a number of other per-document values as well. For instance, one might need to sort news search results by date and then relevance, or search through products within specified price range, or limit blog search to posts made by selected users, or group results by month. To do that efficiently, Sphinx allows to attach a number of additional attributes to each document, and store their values in the full-text index. It's then possible to use stored values to filter, sort, or group full-text matches.
attributes 不像field , 他們不是full-text indexed, 只是存在index 裡, 但是不能search 他們
Attributes, unlike the fields, are not full-text indexed. They are stored in the index, but it is not possible to search them as full-text, and attempting to do so results in an error.
Attributes can be used for filtering, though, to restrict returned rows, as well as sorting or result grouping; it is entirely possible to sort results purely based on attributes, and ignore the search relevance tools. Additionally, attributes are returned from the search daemon, while the indexed text is not
一個用attribute 的好例子是討論區的文章, 假設只有標題跟內容需要做search, 但是有時候卻需要限定是某個作者或者是某個分類時, 或者需要針對發表時間來做sort時, 這時我們就需要attributes
A good example for attributes would be a forum posts table. Assume that only title and content fields need to be full-text searchable - but that sometimes it is also required to limit search to a certain author or a sub-forum (ie. search only those rows that have some specific values of author_id or forum_id columns in the SQL table); or to sort matches by post_date column; or to group matching posts by month of the post_date and calculate per-group match counts.
#這個可以被達成, 藉由指定所有被提到的欄位(如作者, 時間等), indexing 他們
This can be achieved by specfying all the mentioned columns (excluding title and content, that are full-text fields) as attributes, indexing them, and then using API calls to setup filtering, sorting, and grouping. Here as an example
attribute 是不分大小寫的
Attributes are named. Attribute names are case insensitive. Attributes are not full-text indexed; they are stored in the index as is. Currently supported attribute types are:
現在的attribute 支援下列的格式
* unsigned integers (1-bit to 32-bit wide);
* UNIX timestamps;
* floating point values (32-bit, IEEE 754 single precision);
* string ordinals (specially computed integers); 可能是 "1st", "2nd", "3rd", "4th" 這樣的string
* MVA, multi-value attributes (variable-length lists of 32-bit unsigned integers).
The complete set of per-document attribute values is sometimes referred to as docinfo. Docinfos can either be
* stored separately from the main full-text index data ("extern" storage, in .spa file), or
* attached to each occurence of document ID in full-text index data ("inline" storage, in .spd file).
當使用extern storage, spa file 儲存了所有的document 和 attribute value 在RAM裡, 因為比較快, 相反地inline storage 不儲存在RAM裡, 但是index size 會變很大, 因為當document ID 有變動時 他就需要copy 所有的 attributes, Inline storage 可能只會在attributes 很少而且在有限的RAM 時侯才會用, 大部分的情況 extern storage 在indexing 和 search 都比較有效率
When using extern storage, a copy of .spa file (with all the attribute values for all the documents) is kept in RAM by searchd at all times. This is for performance reasons; random disk I/O would be too slow. On the contrary, inline storage does not require any additional RAM at all, but that comes at the cost of greatly inflating the index size: remember that it copies all attribute value every time when the document ID is mentioned, and that is exactly as many times as there are different keywords in the document. Inline may be the only viable option if you have only a few attributes and need to work with big datasets in limited RAM. However, in most cases extern storage makes both indexing and searching much more efficient.
計算search 時需要的memory, 這樣算出來的是跑daemon 的memory 不是每次query, 每次deamon 起來會分配這樣的memory 做準備
Search-time memory requirements for extern storage are (1+number_of_attrs)*number_of_docs*4 bytes, ie. 10 million docs with 2 groups and 1 timestamp will take (1+2+1)*10M*4 = 160 MB of RAM. This is PER DAEMON, not per query. searchd will allocate 160 MB on startup, read the data and keep it shared between queries. The children will NOT allocate any additional copies of this data.
3.3. MVA (multi-valued attributes)
可以把一串的value 當成document 的attribute, filtering and group-by 可以對MVA attributes 但是不能用MVA 做sort, MVA 在像article tags 或是 production cateories 很有幫助
MVAs, or multi-valued attributes, are an important special type of per-document attributes in Sphinx. MVAs make it possible to attach lists of values to every document. They are useful for article tags, product categories, etc. Filtering and group-by (but not sorting) on MVA attributes is supported.
MVA 只能由unsigned 32-bit integer 構成, 但是list 的長度沒有限制, 只要RAM 夠用, spm file 儲存的MVA value 會先被cache 在RAM, 後面都不懂= =+
Currently, MVA list entries are limited to unsigned 32-bit integers. The list length is not limited, you can have an arbitrary number of values attached to each document as long as RAM permits (.spm file that contains the MVA values will be precached in RAM by searchd). The source data can be taken either from a separate query, or from a document field; see source type in sql_attr_multi. In the first case the query will have to return pairs of document ID and MVA values, in the second one the field will be parsed for integer values. There are absolutely no requirements as to incoming data order; the values will be automatically grouped by document ID (and internally sorted within the same ID) during indexing anyway.
當filter 時, 如果document 的 MVA attributes 的任何值符合filter 條件, 就算是match, 當想要針對一個MVA attribute 做group 時, 結果會是根據MVA 裡各各value 的group,例如 一個 document 有 tag 的MVA , value 是 5,7,11 而對tag 做group 時會產生三個group 而對他們的groupby key 分別是 5,7,11
注意當對MVAy做group 時 有可能會導致重複的set, 因為每個document 都會帶有許相同的tag value
When filtering, a document will match the filter on MVA attribute if any of the values satisfy the filtering condition. (Therefore, documents that pass through exclude filters will not contain any of the forbidden values.) When grouping by MVA attribute, a document will contribute to as many groups as there are different MVA values associated with that document. For instance, if the collection contains exactly 1 document having a 'tag' MVA with values 5, 7, and 11, grouping on 'tag' will produce 3 groups with '@count' equal to 1 and '@groupby' key values of 5, 7, and 11 respectively. Also note that grouping by MVA might lead to duplicate documents in the result set: because each document can participate in many groups, it can be chosen as the best one in in more than one group, leading to duplicate IDs. PHP API historically uses ordered hash on the document ID for the resulting rows; so you'll also need to use SetArrayResult() in order to employ group-by on MVA with PHP API.
3.1 Indexing
3.1 Data source
資料的來源會來自db, 文字檔, html file, 信等等, 對sphinx 而言, 會把這樣不同的data 建立成它自己的structured document, 以SQL的說法而言就是每一列各對應一個document, 每一個column各對應一個field
The data to be indexed can generally come from very different sources: SQL databases, plain text files, HTML files, mailboxes, and so on. From Sphinx point of view, the data it indexes is a set of structured documents, each of which has the same set of fields. This is biased towards SQL, where each row correspond to a document, and each column to a field.
資料不同建index的方式也不同, 這稱為data source driver
Depending on what source Sphinx should get the data from, different code is required to fetch the data and prepare it for indexing. This code is called data source driver (or simply driver or data source for brevity)
目前有Mysql 和 PostgreSQl 的 driver 可以用來跑 query 和拿data 了, 也有一種driver 叫做 xmlpiple 跑他自有的指令和從stdout 拿data
At the time of this writing, there are drivers for MySQL and PostgreSQL databases, which can connect to the database using its native C/C++ API, run queries and fetch the data. There's also a driver called xmlpipe, which runs a specified command and reads the data from its stdout. See Section 3.8, “xmlpipe data source” section for the format description
3.2 attributes
attributes 是準備來給filter 和 sort 時用的
Attributes are additional values associated with each document that can be used to perform additional filtering and sorting during search.
因為我們常需要對某些column 做sort, 或者根據其他的tabel 做 filter, 例如search blog 的時候指針對某些user, Sphinx 准許我們加些attribute 來給每個document
It is often desired to additionally process full-text search results based not only on matching document ID and its rank, but on a number of other per-document values as well. For instance, one might need to sort news search results by date and then relevance, or search through products within specified price range, or limit blog search to posts made by selected users, or group results by month. To do that efficiently, Sphinx allows to attach a number of additional attributes to each document, and store their values in the full-text index. It's then possible to use stored values to filter, sort, or group full-text matches.
attributes 不像field , 他們不是full-text indexed, 只是存在index 裡, 但是不能search 他們
Attributes, unlike the fields, are not full-text indexed. They are stored in the index, but it is not possible to search them as full-text, and attempting to do so results in an error.
Attributes can be used for filtering, though, to restrict returned rows, as well as sorting or result grouping; it is entirely possible to sort results purely based on attributes, and ignore the search relevance tools. Additionally, attributes are returned from the search daemon, while the indexed text is not
一個用attribute 的好例子是討論區的文章, 假設只有標題跟內容需要做search, 但是有時候卻需要限定是某個作者或者是某個分類時, 或者需要針對發表時間來做sort時, 這時我們就需要attributes
A good example for attributes would be a forum posts table. Assume that only title and content fields need to be full-text searchable - but that sometimes it is also required to limit search to a certain author or a sub-forum (ie. search only those rows that have some specific values of author_id or forum_id columns in the SQL table); or to sort matches by post_date column; or to group matching posts by month of the post_date and calculate per-group match counts.
#這個可以被達成, 藉由指定所有被提到的欄位(如作者, 時間等), indexing 他們
This can be achieved by specfying all the mentioned columns (excluding title and content, that are full-text fields) as attributes, indexing them, and then using API calls to setup filtering, sorting, and grouping. Here as an example
attribute 是不分大小寫的
Attributes are named. Attribute names are case insensitive. Attributes are not full-text indexed; they are stored in the index as is. Currently supported attribute types are:
現在的attribute 支援下列的格式
* unsigned integers (1-bit to 32-bit wide);
* UNIX timestamps;
* floating point values (32-bit, IEEE 754 single precision);
* string ordinals (specially computed integers); 可能是 "1st", "2nd", "3rd", "4th" 這樣的string
* MVA, multi-value attributes (variable-length lists of 32-bit unsigned integers).
The complete set of per-document attribute values is sometimes referred to as docinfo. Docinfos can either be
* stored separately from the main full-text index data ("extern" storage, in .spa file), or
* attached to each occurence of document ID in full-text index data ("inline" storage, in .spd file).
當使用extern storage, spa file 儲存了所有的document 和 attribute value 在RAM裡, 因為比較快, 相反地inline storage 不儲存在RAM裡, 但是index size 會變很大, 因為當document ID 有變動時 他就需要copy 所有的 attributes, Inline storage 可能只會在attributes 很少而且在有限的RAM 時侯才會用, 大部分的情況 extern storage 在indexing 和 search 都比較有效率
When using extern storage, a copy of .spa file (with all the attribute values for all the documents) is kept in RAM by searchd at all times. This is for performance reasons; random disk I/O would be too slow. On the contrary, inline storage does not require any additional RAM at all, but that comes at the cost of greatly inflating the index size: remember that it copies all attribute value every time when the document ID is mentioned, and that is exactly as many times as there are different keywords in the document. Inline may be the only viable option if you have only a few attributes and need to work with big datasets in limited RAM. However, in most cases extern storage makes both indexing and searching much more efficient.
計算search 時需要的memory, 這樣算出來的是跑daemon 的memory 不是每次query, 每次deamon 起來會分配這樣的memory 做準備
Search-time memory requirements for extern storage are (1+number_of_attrs)*number_of_docs*4 bytes, ie. 10 million docs with 2 groups and 1 timestamp will take (1+2+1)*10M*4 = 160 MB of RAM. This is PER DAEMON, not per query. searchd will allocate 160 MB on startup, read the data and keep it shared between queries. The children will NOT allocate any additional copies of this data.
3.3. MVA (multi-valued attributes)
可以把一串的value 當成document 的attribute, filtering and group-by 可以對MVA attributes 但是不能用MVA 做sort, MVA 在像article tags 或是 production cateories 很有幫助
MVAs, or multi-valued attributes, are an important special type of per-document attributes in Sphinx. MVAs make it possible to attach lists of values to every document. They are useful for article tags, product categories, etc. Filtering and group-by (but not sorting) on MVA attributes is supported.
MVA 只能由unsigned 32-bit integer 構成, 但是list 的長度沒有限制, 只要RAM 夠用, spm file 儲存的MVA value 會先被cache 在RAM, 後面都不懂= =+
Currently, MVA list entries are limited to unsigned 32-bit integers. The list length is not limited, you can have an arbitrary number of values attached to each document as long as RAM permits (.spm file that contains the MVA values will be precached in RAM by searchd). The source data can be taken either from a separate query, or from a document field; see source type in sql_attr_multi. In the first case the query will have to return pairs of document ID and MVA values, in the second one the field will be parsed for integer values. There are absolutely no requirements as to incoming data order; the values will be automatically grouped by document ID (and internally sorted within the same ID) during indexing anyway.
當filter 時, 如果document 的 MVA attributes 的任何值符合filter 條件, 就算是match, 當想要針對一個MVA attribute 做group 時, 結果會是根據MVA 裡各各value 的group,例如 一個 document 有 tag 的MVA , value 是 5,7,11 而對tag 做group 時會產生三個group 而對他們的groupby key 分別是 5,7,11
注意當對MVAy做group 時 有可能會導致重複的set, 因為每個document 都會帶有許相同的tag value
When filtering, a document will match the filter on MVA attribute if any of the values satisfy the filtering condition. (Therefore, documents that pass through exclude filters will not contain any of the forbidden values.) When grouping by MVA attribute, a document will contribute to as many groups as there are different MVA values associated with that document. For instance, if the collection contains exactly 1 document having a 'tag' MVA with values 5, 7, and 11, grouping on 'tag' will produce 3 groups with '@count' equal to 1 and '@groupby' key values of 5, 7, and 11 respectively. Also note that grouping by MVA might lead to duplicate documents in the result set: because each document can participate in many groups, it can be chosen as the best one in in more than one group, leading to duplicate IDs. PHP API historically uses ordered hash on the document ID for the resulting rows; so you'll also need to use SetArrayResult() in order to employ group-by on MVA with PHP API.
2009年3月29日 星期日
ultrasphinx memo(2)
File: DEPLOYMENT_NOTES
The configuration files
.conf檔別手動改, 要改的是相對應的base 檔
Please note that the generated .conf file in config/ultrasphinx should not be modified by hand. This is the configuration for Sphinx itself, and includes all the generated SQL. It is never the same as the .base file.
You will want to keep your generated production.conf in your repository. You can get a production.conf by running:
It is important that every checkout of the app has a copy of production.conf. It is not enough to only have it on the server where the search daemon is running—Rails uses it too.
Indexing and monitoring
It‘s easy to keep the search daemon and the indexer running in a production environment. Cronjobs are the best way:
The first line reindexes the delta index every 10 minutes. The second line reindexes the main index once a day at 4am. The third line will try to restart the search daemon every three minutes. If it‘s already running, nothing happens.
Note that you can use ultrasphinx:index:merge in place of ultrasphinx:index:main. It will be faster, but will not remove deleted records from the index.
Of course if you don‘t have any models with deltas, don‘t include the ultrasphinx:index:delta task.
Gotchas
改了model index設定, 要重跑configure, 和 daemon
If you change the field configuration or model set, you will need to rerun rake ultrasphinx:configure to update the .conf file. Make sure to completely stop and restart the search daemon when you deploy a changed .conf. It will not reload it automatically.
The configuration files
.conf檔別手動改, 要改的是相對應的base 檔
Please note that the generated .conf file in config/ultrasphinx should not be modified by hand. This is the configuration for Sphinx itself, and includes all the generated SQL. It is never the same as the .base file.
You will want to keep your generated production.conf in your repository. You can get a production.conf by running:
It is important that every checkout of the app has a copy of production.conf. It is not enough to only have it on the server where the search daemon is running—Rails uses it too.
Indexing and monitoring
It‘s easy to keep the search daemon and the indexer running in a production environment. Cronjobs are the best way:
The first line reindexes the delta index every 10 minutes. The second line reindexes the main index once a day at 4am. The third line will try to restart the search daemon every three minutes. If it‘s already running, nothing happens.
Note that you can use ultrasphinx:index:merge in place of ultrasphinx:index:main. It will be faster, but will not remove deleted records from the index.
Of course if you don‘t have any models with deltas, don‘t include the ultrasphinx:index:delta task.
Gotchas
改了model index設定, 要重跑configure, 和 daemon
If you change the field configuration or model set, you will need to rerun rake ultrasphinx:configure to update the .conf file. Make sure to completely stop and restart the search daemon when you deploy a changed .conf. It will not reload it automatically.
ultrasphinx memo(1)
Ultrasphinx File: README
Class: ActiveRecord :: Base
Class: Ultrasphinx :: Search
Delta indexing
Delta index 讓你不需要每次都對重新整個做index
First, in your .base file, set the indexer option delta to your maximum interval between full reindexes. A day or a week is good, depending. Add a little bit to account for the time it takes the actual index to run:
Now, configure your models for delta indexing in the is_indexed call:
Now you can run rake ultrasphinx:index:delta frequently, and only records that were changed within 1 day will be reindexed. You will need to run rake ultrasphinx:index:main once a day to move the delta contents into the main index.
Association scoping
A common use case is to only search records that belong to a particular parent model. Ultrasphinx configures Sphinx to support a :filters element on any date or numeric field, so any *_id fields you have will be filterable.
For example, say a Company has_many :users and each User has_many :articles. If you want to to filter Articles by Company, add company_id to the Article‘s is_indexed method. The best way is to grab it from the User association:
Now you can run:
If the associations weren‘t just has_many and belongs_to, you would need to use the :association_sql key to set up a custom JOIN.
Query Defaults
Note that you can set up your own query defaults in environment.rb:
Geographic distance
If you pass a :location Hash, distance from the location in meters will be available in your result records via the distance accessor:
Note that Sphinx expects lat/long to be indexed as radians. If you have degrees in your database, do the conversion in the is_indexed as so:
Then, set Ultrasphinx::Search.client_options[:location][:units] = ‘degrees‘.
The MySQL :double column type is recommended for storing location data. For Postgres, use float
Excerpt mode
就是想讓search 的關鍵字高亮的模式
You can have Sphinx excerpt and highlight the matched sections in the associated fields. Instead of calling run, call excerpt.
@search.excerpt
The returned models will be frozen and have their field contents temporarily changed to the excerpted and highlighted results.
There are some other keys you can set, such as excerpt size, HTML tags to highlight with, and number of words on either side of each excerpt chunk. Example (in environment.rb):
Class: ActiveRecord :: Base
Class: Ultrasphinx :: Search
Delta indexing
Delta index 讓你不需要每次都對重新整個做index
First, in your .base file, set the indexer option delta to your maximum interval between full reindexes. A day or a week is good, depending. Add a little bit to account for the time it takes the actual index to run:
Now, configure your models for delta indexing in the is_indexed call:
Now you can run rake ultrasphinx:index:delta frequently, and only records that were changed within 1 day will be reindexed. You will need to run rake ultrasphinx:index:main once a day to move the delta contents into the main index.
Association scoping
A common use case is to only search records that belong to a particular parent model. Ultrasphinx configures Sphinx to support a :filters element on any date or numeric field, so any *_id fields you have will be filterable.
For example, say a Company has_many :users and each User has_many :articles. If you want to to filter Articles by Company, add company_id to the Article‘s is_indexed method. The best way is to grab it from the User association:
Now you can run:
If the associations weren‘t just has_many and belongs_to, you would need to use the :association_sql key to set up a custom JOIN.
Query Defaults
Note that you can set up your own query defaults in environment.rb:
Geographic distance
If you pass a :location Hash, distance from the location in meters will be available in your result records via the distance accessor:
Note that Sphinx expects lat/long to be indexed as radians. If you have degrees in your database, do the conversion in the is_indexed as so:
Then, set Ultrasphinx::Search.client_options[:location][:units] = ‘degrees‘.
The MySQL :double column type is recommended for storing location data. For Postgres, use float
Excerpt mode
就是想讓search 的關鍵字高亮的模式
You can have Sphinx excerpt and highlight the matched sections in the associated fields. Instead of calling run, call excerpt.
@search.excerpt
The returned models will be frozen and have their field contents temporarily changed to the excerpted and highlighted results.
There are some other keys you can set, such as excerpt size, HTML tags to highlight with, and number of words on either side of each excerpt chunk. Example (in environment.rb):
2009年3月27日 星期五
sphinx on rails and ultrasphinx
全文搜尋 Sphinx on Rails
fanuan's ultrasphinx at master
Searching a Ruby on Rails application with Sphinx and Ultrasphinx
Sphinx+MySQL全文檢索架構與安裝
使用libmmseg實現Ruby的中文分詞功能 <= 看起來很強, 可是我還沒用 XD
Sphinx 的特點在於它直接存取 MySQL (或PostgreSQL),完全獨立於 Rails app。它不像 acts_as_ferret 和 acts_as_solr 使用 rails callback 來做新資料的索引動作,而是設定 crontab 定期跑 indexer (如每半小時),大大提高了 search daemon 穩定度(反之亦然),也不太有索引資料損壞的問題(在講ferret?)。目前看到的評價也都是穩穩穩高效能,用過就不想再用 Ferret 了。Ultrasphinx 的作者也有 benchmark 數據可供佐證。
下載, 安裝sphinx
要安裝ultrasphinx, 要先安裝 chronic gem
安裝ultrasphinx
ultrasphinx 也有支援will_paginate
裝好之後就來設定
default.base file 支援中文搜尋,這個答案在官網討論區有
接下來設定就是設定想要索引的部份
接著產生實際要用的設定檔。注意到如果你之後有改 model 的 is_indexed,這個設定檔每次都要重新產生(因為裡面的東西是 SQL 指令)。
rake ultrasphinx:configure
跑索引跟 daemon 很簡單
rake ultrasphinx:index
rake ultrasphinx:daemon:start
RAILS_ROOT/config/ultrasphinx 下的base file 才是讓你編的, 也可以用development, production, test 就像 rails environment 的作法一樣
之後在跑
來產生像對應的 default.conf file 這才是Sphinx 真正在用的
fanuan's ultrasphinx at master
Searching a Ruby on Rails application with Sphinx and Ultrasphinx
Sphinx+MySQL全文檢索架構與安裝
使用libmmseg實現Ruby的中文分詞功能 <= 看起來很強, 可是我還沒用 XD
Sphinx 的特點在於它直接存取 MySQL (或PostgreSQL),完全獨立於 Rails app。它不像 acts_as_ferret 和 acts_as_solr 使用 rails callback 來做新資料的索引動作,而是設定 crontab 定期跑 indexer (如每半小時),大大提高了 search daemon 穩定度(反之亦然),也不太有索引資料損壞的問題(在講ferret?)。目前看到的評價也都是穩穩穩高效能,用過就不想再用 Ferret 了。Ultrasphinx 的作者也有 benchmark 數據可供佐證。
下載, 安裝sphinx
要安裝ultrasphinx, 要先安裝 chronic gem
安裝ultrasphinx
ultrasphinx 也有支援will_paginate
裝好之後就來設定
default.base file 支援中文搜尋,這個答案在官網討論區有
接下來設定就是設定想要索引的部份
接著產生實際要用的設定檔。注意到如果你之後有改 model 的 is_indexed,這個設定檔每次都要重新產生(因為裡面的東西是 SQL 指令)。
rake ultrasphinx:configure
跑索引跟 daemon 很簡單
rake ultrasphinx:index
rake ultrasphinx:daemon:start
RAILS_ROOT/config/ultrasphinx 下的base file 才是讓你編的, 也可以用development, production, test 就像 rails environment 的作法一樣
之後在跑
來產生像對應的 default.conf file 這才是Sphinx 真正在用的
2009年3月26日 星期四
javascript 動態跨domain 要資料, use jquery and rails
JSONP
jQuery抓取跨網域外的資料(cross-domain) 運用JSONP
JSON & JSONP 簡介
javascipt 基本上是沒法跨domain 要資料的, 但是現在有一種協定JSONP
什麼是JSONP:JSONP(JSON with Padding)是一個非官方的協議,它允許在服務器端集成Script tags返回至客戶端,通過javascript callback的形式實現跨域訪問(這僅僅是JSONP簡單的實現形式)
作法是
* 使用script tag載入javascript檔的方式,可以cross domain取得資料。
* server端傳回的資料採JSON(Javascript object notation)格式。
* server端提供一項服務,接受回呼函式(callback function)名稱為參數,將傳回的資料當成回呼函式的參數,傳回對此回呼函式的呼叫(padding 所指的就是此加上回呼函式的填充動作)。當此javascript傳輸完成,client端會自動執行,呼叫此回呼函式,因此在資料傳輸完畢時可以得知。
光是說明還是看不懂的話, 用例子吧
範例
以Flickr提供的服務查詢標籤包含羅平、油菜花、sunset的公開照片,並加入參數format=json、jsoncallback=doPhoto,表示傳回資料為JSON格式,並將其填入回呼函式doPhoto的呼叫中作為參數:
會回傳一個 callback_function_name(JSON格式的資料)
上面的script tag 跑了之後, 就會像下面那樣, 就是由call 一個 function 的script, 就利用由server 拼出call function 的script , 還有被弄成參數 之 我們需要的資料
最後就是準備好你的callback function 來被call, 也就是來處理傳回的資料
我的範例 jquery & rails
Client
Rails Server 端
jQuery抓取跨網域外的資料(cross-domain) 運用JSONP
JSON & JSONP 簡介
javascipt 基本上是沒法跨domain 要資料的, 但是現在有一種協定JSONP
什麼是JSONP:JSONP(JSON with Padding)是一個非官方的協議,它允許在服務器端集成Script tags返回至客戶端,通過javascript callback的形式實現跨域訪問(這僅僅是JSONP簡單的實現形式)
作法是
* 使用script tag載入javascript檔的方式,可以cross domain取得資料。
* server端傳回的資料採JSON(Javascript object notation)格式。
* server端提供一項服務,接受回呼函式(callback function)名稱為參數,將傳回的資料當成回呼函式的參數,傳回對此回呼函式的呼叫(padding 所指的就是此加上回呼函式的填充動作)。當此javascript傳輸完成,client端會自動執行,呼叫此回呼函式,因此在資料傳輸完畢時可以得知。
光是說明還是看不懂的話, 用例子吧
範例
以Flickr提供的服務查詢標籤包含羅平、油菜花、sunset的公開照片,並加入參數format=json、jsoncallback=doPhoto,表示傳回資料為JSON格式,並將其填入回呼函式doPhoto的呼叫中作為參數:
會回傳一個 callback_function_name(JSON格式的資料)
上面的script tag 跑了之後, 就會像下面那樣, 就是由call 一個 function 的script, 就利用由server 拼出call function 的script , 還有被弄成參數 之 我們需要的資料
最後就是準備好你的callback function 來被call, 也就是來處理傳回的資料
我的範例 jquery & rails
Client
Rails Server 端
Block, Proc, lambda, closure
Block,Proc,lambda的理解
Alex學Ruby[詳解 block和Proc對象]
Alex學Ruby[詳解 block和Proc對象 2]
block 就是匿名的一個程式區塊
Proc 也是匿名的程式區塊, 可是Proc 可以把這個block 存在某個變數裡, 真的當變數或參數用
lambda 也是用匿名函數的一種方法跟Proc很像, 只有一點點不同
lambda和Proc是一樣的,除了Proc的return會跳出調用的方法,lambda則不會,它只是返回自己。
Proc 跟 lambda 的其他差別
lambda生成的proc對象必須被call才可被調用。 而lambda方法會檢查傳入的參數個數,和參數類型。
Proc.new生成的對象不會去檢查參數個數, 只是多退少補(nil), 但是會檢查參數數據類型。
block 當參數傳時可以怎樣叫
lambda當變數傳
closure
closure引用的變量是call這個proc對象時產生的變量,在外部修改其值是沒用的
下面這個會用, 可是沒完全懂 XD
例子的map(&:upcase)相當於map{ |x| x.upcase }
這個是如何實現的? 其實Symbol類內部實現了一個to_proc方法:
map 方法只接受代碼塊,通過使用&可傳遞proc,來代替顯示的使用代碼塊。而這個時候&被應用於不是proc的對象,而傳進來的本身是個符號對象(:&upcase),所以解釋器自然而然的就會調用符號類的to_proc方法來得到一個proc。to_proc方法裡的self引用的是對其調用了to_proc的符號
Alex學Ruby[詳解 block和Proc對象]
Alex學Ruby[詳解 block和Proc對象 2]
block 就是匿名的一個程式區塊
Proc 也是匿名的程式區塊, 可是Proc 可以把這個block 存在某個變數裡, 真的當變數或參數用
lambda 也是用匿名函數的一種方法跟Proc很像, 只有一點點不同
lambda和Proc是一樣的,除了Proc的return會跳出調用的方法,lambda則不會,它只是返回自己。
Proc 跟 lambda 的其他差別
lambda生成的proc對象必須被call才可被調用。 而lambda方法會檢查傳入的參數個數,和參數類型。
Proc.new生成的對象不會去檢查參數個數, 只是多退少補(nil), 但是會檢查參數數據類型。
block 當參數傳時可以怎樣叫
lambda當變數傳
closure
closure引用的變量是call這個proc對象時產生的變量,在外部修改其值是沒用的
下面這個會用, 可是沒完全懂 XD
例子的map(&:upcase)相當於map{ |x| x.upcase }
這個是如何實現的? 其實Symbol類內部實現了一個to_proc方法:
map 方法只接受代碼塊,通過使用&可傳遞proc,來代替顯示的使用代碼塊。而這個時候&被應用於不是proc的對象,而傳進來的本身是個符號對象(:&upcase),所以解釋器自然而然的就會調用符號類的to_proc方法來得到一個proc。to_proc方法裡的self引用的是對其調用了to_proc的符號
2009年3月25日 星期三
GeoKit on Rails
Geocoding with the Rails GeoKit Plugin
GeoKit official website
ym4r 提供了可以方便把google map 用的功能
而GeoKit 則是提供了許多 map application 會用到的功能, 例如 兩點之間的距離, 算某個範圍裡面的點等等
安裝
step1: 安裝plugin
step2: 在environment, initializer 的 block 裡加上
step3: 告訴rails 裝上這個gem
最後
在需要用到的controller 頂端 include GeoKit::Geocoders
把地址轉成經緯座標
自動轉經緯
如果你的table 裡有像lat, lng 的欄位, 在生成store object 時, 便會自動去算你設定為地址的欄位, 把對應的經緯填入
算點跟點之間的距離
找到某個範圍之內所有的點
從使用者ip判斷位置
## USING INCLUDES
You can use includes along with your distance finders:
However, ActiveRecord drops the calculated distance column when you use include. So, if you need to use the distance column, you‘ll have to re-calculate it post-query in Ruby:
In this case, you may want to just use the bounding box condition alone in your SQL (there‘s no use calculating the distance twice):
GeoKit official website
ym4r 提供了可以方便把google map 用的功能
而GeoKit 則是提供了許多 map application 會用到的功能, 例如 兩點之間的距離, 算某個範圍裡面的點等等
安裝
step1: 安裝plugin
step2: 在environment, initializer 的 block 裡加上
step3: 告訴rails 裝上這個gem
最後
在需要用到的controller 頂端 include GeoKit::Geocoders
把地址轉成經緯座標
自動轉經緯
如果你的table 裡有像lat, lng 的欄位, 在生成store object 時, 便會自動去算你設定為地址的欄位, 把對應的經緯填入
算點跟點之間的距離
找到某個範圍之內所有的點
從使用者ip判斷位置
## USING INCLUDES
You can use includes along with your distance finders:
However, ActiveRecord drops the calculated distance column when you use include. So, if you need to use the distance column, you‘ll have to re-calculate it post-query in Ruby:
In this case, you may want to just use the bounding box condition alone in your SQL (there‘s no use calculating the distance twice):
2009年3月24日 星期二
YM4R/GM plugin for Rails
Binding JavaScript and Ruby
At initialization time, you can pass arbitrary JavaScript code to the GMap#record_init and GMap#record_global_init.
While center_zoom_init, control_init or overlay_init (and generally all the GMap methods which end in init) are one of the rare convenience methods that do not output JavaScript, the add_overlay does. Actually, if you look at the code of the GMap class, you won’t find any add_overlay method, although in the documentation of the GMap2 class from the Google Maps API documentation, you will find something about the addOverlay JavaScript method. In fact, when you call on a mapping object an unknow method, it is converted to a javascriptified version of it, along with its arguments, and a string of JavaScript code is output
So the @map.add_overlay… above is converted to "map.addOverlay(new GMarker(GLatLng.new(35.12878, -110.578),{title:\"Hello!\"}))", which is then passed to the record_init method of a Ruby GMap object to be later output along with the rest of the initialization code
Any arbitrary JavaScript code can be passed to the record_init method. Note that 2 last lines of the previous code sample are strictly equivalent and since the overlay_init version is a bit simpler, it should be preferred.
Initialization of the map
You can also optionnally pass to the constructor the JavaScript name of the variable that will reference the map, which by default will be global in JavaScript.
The other absolutely necessary initialization step in the controller is the setting of center and zoom:
GMap.header
Outputs the inclusion of the JavaScript file from Google to make use of the Google Maps API and by default a style declaration for VML objects, necessary to display polylines under IE. This default can be overriddent by passing :with_vml => false
You can also pass to this method a :host option in order to select the correct key for the location where your app is currently deployed, in case the current environment has multiple possible key
you can override all the key settings in the configuration by passing a value to the :key key.
GMap#to_html
By default, it outputs the script tags and initializes the map in response to the onload event of the JavaScript window object
You can pass the option :full=>true to the method to setup a fullscreen map, which will also be resized when the window size changes.
So you should have something like the following:
Note that you need to set a size for the map DIV element at some point or the map will not be displayed.
GMarkers
You can pass options to the GMarker to customize the info window (:info_window or :info_window_tabs options), the tooltip (:title option) or the icon used (:icon option).
Update of the map
At initialization time, you can pass arbitrary JavaScript code to the GMap#record_init and GMap#record_global_init.
While center_zoom_init, control_init or overlay_init (and generally all the GMap methods which end in init) are one of the rare convenience methods that do not output JavaScript, the add_overlay does. Actually, if you look at the code of the GMap class, you won’t find any add_overlay method, although in the documentation of the GMap2 class from the Google Maps API documentation, you will find something about the addOverlay JavaScript method. In fact, when you call on a mapping object an unknow method, it is converted to a javascriptified version of it, along with its arguments, and a string of JavaScript code is output
So the @map.add_overlay… above is converted to "map.addOverlay(new GMarker(GLatLng.new(35.12878, -110.578),{title:\"Hello!\"}))", which is then passed to the record_init method of a Ruby GMap object to be later output along with the rest of the initialization code
Any arbitrary JavaScript code can be passed to the record_init method. Note that 2 last lines of the previous code sample are strictly equivalent and since the overlay_init version is a bit simpler, it should be preferred.
Initialization of the map
You can also optionnally pass to the constructor the JavaScript name of the variable that will reference the map, which by default will be global in JavaScript.
The other absolutely necessary initialization step in the controller is the setting of center and zoom:
GMap.header
Outputs the inclusion of the JavaScript file from Google to make use of the Google Maps API and by default a style declaration for VML objects, necessary to display polylines under IE. This default can be overriddent by passing :with_vml => false
You can also pass to this method a :host option in order to select the correct key for the location where your app is currently deployed, in case the current environment has multiple possible key
you can override all the key settings in the configuration by passing a value to the :key key.
GMap#to_html
By default, it outputs the script tags and initializes the map in response to the onload event of the JavaScript window object
You can pass the option :full=>true to the method to setup a fullscreen map, which will also be resized when the window size changes.
So you should have something like the following:
Note that you need to set a size for the map DIV element at some point or the map will not be displayed.
GMarkers
You can pass options to the GMarker to customize the info window (:info_window or :info_window_tabs options), the tooltip (:title option) or the icon used (:icon option).
Update of the map
2009年3月23日 星期一
各種 Height
element.clientHeight
offsetHeight
scrollHeight
[轉]javascript:獲取屏幕高度和寬度等信息 & 製作滾動窗體時遇到的問題
網頁可見區域寬: document.body.clientWidth
網頁可見區域高: document.body.clientHeight
網頁可見區域寬: document.body.offsetWidth (包括邊線的寬)
網頁可見區域高: document.body.offsetHeight (包括邊線的高)
網頁正文全文寬: document.body.scrollWidth
網頁正文全文高: document.body.scrollHeight
網頁被捲去的高: document.body.scrollTop
網頁被捲去的左: document.body.scrollLeft
網頁正文部分上: window.screenTop
網頁正文部分左: window.screenLeft
屏幕分辨率的高: window.screen.height
屏幕分辨率的寬: window.screen.width
屏幕可用工作區高度: window.screen.availHeight
屏幕可用工作區寬度: window.screen.availWidth
窗口是只看得到的地方, 頁面連看不到的地方都算
window.innerHeight, document.body.clientWidth 都是指看得到的區域的高度
offsetHeight 是指真正內容的高度
scrollHeight 是指整個頁面的高度
document.body.scrollTop, window.pageYOffset 是指 捲軸頂端離頁面頂端的距離
圖一
http://www.flickr.com/photos/26011139@N05/3377955143/sizes/o/
圖二
http://www.flickr.com/photos/26011139@N05/3377955145/sizes/o/
圖三
http://www.flickr.com/photos/26011139@N05/3377955149/sizes/o/
從這三張圖, 可以知道
offsetHeight 內容高度都是不變的, 合情合理, 內容多高就是多高
scrollHeight 會隨著你視窗能拉多大而變大, 可是最小值是出現scroll 時的值
還有注意 距離頁面頂端的距離 其實就是 你捲軸的距離加上, 離窗口頂端的距離
offsetHeight
scrollHeight
[轉]javascript:獲取屏幕高度和寬度等信息 & 製作滾動窗體時遇到的問題
網頁可見區域寬: document.body.clientWidth
網頁可見區域高: document.body.clientHeight
網頁可見區域寬: document.body.offsetWidth (包括邊線的寬)
網頁可見區域高: document.body.offsetHeight (包括邊線的高)
網頁正文全文寬: document.body.scrollWidth
網頁正文全文高: document.body.scrollHeight
網頁被捲去的高: document.body.scrollTop
網頁被捲去的左: document.body.scrollLeft
網頁正文部分上: window.screenTop
網頁正文部分左: window.screenLeft
屏幕分辨率的高: window.screen.height
屏幕分辨率的寬: window.screen.width
屏幕可用工作區高度: window.screen.availHeight
屏幕可用工作區寬度: window.screen.availWidth
窗口是只看得到的地方, 頁面連看不到的地方都算
window.innerHeight, document.body.clientWidth 都是指看得到的區域的高度
offsetHeight 是指真正內容的高度
scrollHeight 是指整個頁面的高度
document.body.scrollTop, window.pageYOffset 是指 捲軸頂端離頁面頂端的距離
圖一
http://www.flickr.com/photos/26011139@N05/3377955143/sizes/o/
圖二
http://www.flickr.com/photos/26011139@N05/3377955145/sizes/o/
圖三
http://www.flickr.com/photos/26011139@N05/3377955149/sizes/o/
從這三張圖, 可以知道
offsetHeight 內容高度都是不變的, 合情合理, 內容多高就是多高
scrollHeight 會隨著你視窗能拉多大而變大, 可是最小值是出現scroll 時的值
還有注意 距離頁面頂端的距離 其實就是 你捲軸的距離加上, 離窗口頂端的距離
2009年3月20日 星期五
2009年3月18日 星期三
The Rails Way :CRUD, Advanced Find
1.TRW p177
User model has many gookus
>> u.gooku_ids
=> [1566, 1284, 2088, 2094, 4016, 4956, 8114]
>> u.gookus.map(&:id)
=> [1566, 1284, 2088, 2094, 4016, 4956, 8114]
所以在命名像是checkbox 就可以取成 user[gooku_ids][]
2.query cache TRWp172
Spend less time in the database, and more time outdoors!
3.Controlleing Access to attributes, TRWp178
4.Database Locking
4.1 Optimistic Locking
implement 很簡單, 加上lock_version 的column 之後給rails 處理就好
而race conditions 輸的人會丟出ActiveRecord::StaleObjectError 的 execption
優點是 implement 很簡單
缺點是有點慢, 而且等到update 做了之後才報錯, 使用者又在要作一次
4.2 Pessimistic Locking
需要database 的支援, 在會做到update 時的動作, 利用lock 也把reading 鎖起來, 避免讀到stale data
Pessimistic locking takes place at the database level. The SELECT statement generated
by ActiveRecord will have a FOR UPDATE (or similar) clause added to it, causing
all other connections to be blocked from access to the rows returned by the select
statement. The lock is released once the transaction is committed. There are theoretically
situations (Rails process goes boom mid-transaction?!) where the lock would not
be released until the connection is terminated or times out.
4.3 Consideration
Web applications scale best with optimistic locking, which as we’ve discussed doesn’t
really use any locking at all. However, you have to add application logic to handle failure
cases
就是用optimistic 就對了
5. select, group option
User model has many gookus
>> u.gooku_ids
=> [1566, 1284, 2088, 2094, 4016, 4956, 8114]
>> u.gookus.map(&:id)
=> [1566, 1284, 2088, 2094, 4016, 4956, 8114]
所以在命名像是checkbox 就可以取成 user[gooku_ids][]
2.query cache TRWp172
Spend less time in the database, and more time outdoors!
3.Controlleing Access to attributes, TRWp178
4.Database Locking
4.1 Optimistic Locking
implement 很簡單, 加上lock_version 的column 之後給rails 處理就好
而race conditions 輸的人會丟出ActiveRecord::StaleObjectError 的 execption
優點是 implement 很簡單
缺點是有點慢, 而且等到update 做了之後才報錯, 使用者又在要作一次
4.2 Pessimistic Locking
需要database 的支援, 在會做到update 時的動作, 利用lock 也把reading 鎖起來, 避免讀到stale data
Pessimistic locking takes place at the database level. The SELECT statement generated
by ActiveRecord will have a FOR UPDATE (or similar) clause added to it, causing
all other connections to be blocked from access to the rows returned by the select
statement. The lock is released once the transaction is committed. There are theoretically
situations (Rails process goes boom mid-transaction?!) where the lock would not
be released until the connection is terminated or times out.
4.3 Consideration
Web applications scale best with optimistic locking, which as we’ve discussed doesn’t
really use any locking at all. However, you have to add application logic to handle failure
cases
就是用optimistic 就對了
5. select, group option
網域名稱DNS管理
網域名稱DNS管理
填入主與次DNS.但還有其他的選項需要注意的.那就是SOA、 NS、 A、 AAAA、 PTR、 CNAME、 MX記錄...等.
SOA
Start Of Authority,這種 record 放在 zone file 一開始的地方,每一個記錄檔只能有一個 SOA,而且一定是檔案中第一個“記錄”,它描述這個 zone 負責的 name server,version number…等資料,以及當 slave server 要備份這個 zone 時的一些參數。 緊接在 SOA 後面指定了這個區域的授權主機和管理者的信箱,這裡分別是 "bad.com.tw" 和" root.bad.com.tw",也就是bad.com.tw主機和 root 的信箱。這裡要注意的是我們以"root.bad.com.tw"代表"root@bad.com.tw"
e.g.
@ IN SOA bad.com.tw. root.bad.com.tw. (
1999051401 ; Serial
3600 ; Refresh
300 ; Retry
3600000 ; Expire
3600 ) ; Minimum
在兩個括號中間的選項表示SOA的設定內容,底下會有更詳細的說明。
NS
name server,用來指定操作的DNS伺服器主機名稱,需注意的是不可以IP位址表示。
e.g.
IN NS dns.bad.com.tw.
A
address,將DNS網域名稱對應到IPv4的32位元位址。
e.g.
server IN A 140.123.102.10
AAAA
可將DNS網域名稱對應到IPv6的128位元位址。
e.g.
twnic.net.tw. 86400 IN AAAA 3ffe: :bbb:93:5
PTR
pointer,定義某個 IP 對應的 domain name,即將 IP 位址轉換成主機的FQDN。
e.g.
20 IN PTR mail.bad.com.tw.
CNAME
canonical name,可為同一部主機設定許多別名,例如 mix.bad.com.tw的別名可為 www.bad.com.tw和 ftp.bad.com.tw,因此所設定的別名都會連至同一部伺服器。
e.g.
www IN CNAME mix
MX
mail exchanger,設定區域中擔任郵件伺服器的主機,所有要送往那部機器的 mail 都要經過 mail exchanger 轉送。而數字則是該主機郵件傳遞時的優先次序,此值越低表示有越高的郵件處理優先權。
e.g.
server IN MX 10 mail.bad.com.tw.
填入主與次DNS.但還有其他的選項需要注意的.那就是SOA、 NS、 A、 AAAA、 PTR、 CNAME、 MX記錄...等.
SOA
Start Of Authority,這種 record 放在 zone file 一開始的地方,每一個記錄檔只能有一個 SOA,而且一定是檔案中第一個“記錄”,它描述這個 zone 負責的 name server,version number…等資料,以及當 slave server 要備份這個 zone 時的一些參數。 緊接在 SOA 後面指定了這個區域的授權主機和管理者的信箱,這裡分別是 "bad.com.tw" 和" root.bad.com.tw",也就是bad.com.tw主機和 root 的信箱。這裡要注意的是我們以"root.bad.com.tw"代表"root@bad.com.tw"
e.g.
@ IN SOA bad.com.tw. root.bad.com.tw. (
1999051401 ; Serial
3600 ; Refresh
300 ; Retry
3600000 ; Expire
3600 ) ; Minimum
在兩個括號中間的選項表示SOA的設定內容,底下會有更詳細的說明。
NS
name server,用來指定操作的DNS伺服器主機名稱,需注意的是不可以IP位址表示。
e.g.
IN NS dns.bad.com.tw.
A
address,將DNS網域名稱對應到IPv4的32位元位址。
e.g.
server IN A 140.123.102.10
AAAA
可將DNS網域名稱對應到IPv6的128位元位址。
e.g.
twnic.net.tw. 86400 IN AAAA 3ffe: :bbb:93:5
PTR
pointer,定義某個 IP 對應的 domain name,即將 IP 位址轉換成主機的FQDN。
e.g.
20 IN PTR mail.bad.com.tw.
CNAME
canonical name,可為同一部主機設定許多別名,例如 mix.bad.com.tw的別名可為 www.bad.com.tw和 ftp.bad.com.tw,因此所設定的別名都會連至同一部伺服器。
e.g.
www IN CNAME mix
MX
mail exchanger,設定區域中擔任郵件伺服器的主機,所有要送往那部機器的 mail 都要經過 mail exchanger 轉送。而數字則是該主機郵件傳遞時的優先次序,此值越低表示有越高的郵件處理優先權。
e.g.
server IN MX 10 mail.bad.com.tw.
Decimal vs float
float與decimal的區別
Decimal versus Float Point Type
rails中的Float/Decimal數據類型
The primary use for DECIMAL is money, where floating point would fail
miserably and produce lots of errors (money leak) at the end of the month.
As floating point arithmetic has it's own separated section on your
processor (unless you have a 386 or older computer) it would be *much*
faster than DECIMAL, so use it only if you are absolutely sure you need it.
The difference is that, for example, .01 can be represented exactly in
decimal; but float types are binary, so .01 cannot be represented exactly.
This can lead to all kinds of trouble when doing arithmetic, the errors
accumulate. It's one reason why most people write their loops with "< x + 1"
rather than "= x". That gets past the problem, but if you are adding
together many values the final answer may be wrong.
myslq> create table numbers (a decimal(10,2), b float);
myslq> insert into numbers values (100, 100);
mysql> select @a := (a/3), @b := (b/3), @a * 3, @b * 3 from numbers \G
*************************** 1. row ***************************
@a := (a/3): 33.333333333
@b := (b/3): 33.333333333333
@a + @a + @a: 99.999999999000000000000000000000
@b + @b + @b: 100
The decimal did exactly what's supposed to do on this cases, it
truncated the rest, thus loosing the 1/3 part.
So for sums the decimal is better, but for divisions the float is
better, up to some point, of course. I mean, using DECIMAL will not give
you a "fail proof arithmetic" in any means.
How do you expect to split a dollar 3 ways?
It is not the math you do that determins whether you use float or
decimal, it is what you are modeling that is important.
Dollars are decimal, and dollar calculations must be rounded to the
nearest cent, or mill.
Decimal versus Float Point Type
rails中的Float/Decimal數據類型
The primary use for DECIMAL is money, where floating point would fail
miserably and produce lots of errors (money leak) at the end of the month.
As floating point arithmetic has it's own separated section on your
processor (unless you have a 386 or older computer) it would be *much*
faster than DECIMAL, so use it only if you are absolutely sure you need it.
The difference is that, for example, .01 can be represented exactly in
decimal; but float types are binary, so .01 cannot be represented exactly.
This can lead to all kinds of trouble when doing arithmetic, the errors
accumulate. It's one reason why most people write their loops with "< x + 1"
rather than "= x". That gets past the problem, but if you are adding
together many values the final answer may be wrong.
myslq> create table numbers (a decimal(10,2), b float);
myslq> insert into numbers values (100, 100);
mysql> select @a := (a/3), @b := (b/3), @a * 3, @b * 3 from numbers \G
*************************** 1. row ***************************
@a := (a/3): 33.333333333
@b := (b/3): 33.333333333333
@a + @a + @a: 99.999999999000000000000000000000
@b + @b + @b: 100
The decimal did exactly what's supposed to do on this cases, it
truncated the rest, thus loosing the 1/3 part.
So for sums the decimal is better, but for divisions the float is
better, up to some point, of course. I mean, using DECIMAL will not give
you a "fail proof arithmetic" in any means.
How do you expect to split a dollar 3 ways?
It is not the math you do that determins whether you use float or
decimal, it is what you are modeling that is important.
Dollars are decimal, and dollar calculations must be rounded to the
nearest cent, or mill.
2009年3月17日 星期二
Caching on ruby on rails
Caching with Ruby on Rails
Scaling Rails - 第五章 Advanced Page Caching
Scaling Rails - 第六章 Action Caching
Scaling Rails - 第七章 Fragment Caching
Scaling Rails - 第八章 Memcached
Ruby on Rails Caching Tutorial
Sql cache
Query caches are created at the start of an action and destroyed at the end of that action and thus persist only for the duration of the action.
Rails 2.2.2 經測試, 在action, view, helper 裡default 都已有query cache, 只有在model 沒有, 還是必須使用
query cache
Scaling Rails - 第五章 Advanced Page Caching
Scaling Rails - 第六章 Action Caching
Scaling Rails - 第七章 Fragment Caching
Scaling Rails - 第八章 Memcached
Ruby on Rails Caching Tutorial
Sql cache
Query caches are created at the start of an action and destroyed at the end of that action and thus persist only for the duration of the action.
Rails 2.2.2 經測試, 在action, view, helper 裡default 都已有query cache, 只有在model 沒有, 還是必須使用
query cache
2009年3月16日 星期一
ActionView
1. config.action_view.erb_trim_mode 可以修改預設的寫法
2. Grouped selection lists, 分組, p581
option_groups_from_collection_for_select
3.date and time fields
有些 prefix, order, discard_ooxx, include_blank 等好東西可以用
4 multiple models in a form p585
利用 fields_for
2. Grouped selection lists, 分組, p581
option_groups_from_collection_for_select
3.date and time fields
有些 prefix, order, discard_ooxx, include_blank 等好東西可以用
4 multiple models in a form p585
利用 fields_for
2009年3月15日 星期日
ActionController and Rails(1)
1. hide_action
想把method, 又不想被當成action 存取, 就可以使用hide_action
2. 在controller 的環境有些method 可以用像是
3. render :inline
4. render :content_type
【META http-equiv="Content-Type" Content="text/html; Charset=*】意義詳解
透過content_type 可以指定, Content-Type HTTP 標頭傳給browser 的值
render :text => output.string, :content_type => 'application/x-ployrecover'
5.send_data
6. redirect_to :back
其實就是
redirect_to(request.env["HTTP_REFERER"])
7.flash
flash 基本是下一次的的request 處理後才會結束
所以如果想要在只有在這次request 用就是用flash.now, 而想要替它保留到下次就是用flash.keep
所以常在redirect_to 前用的是 flash, 而在render 前用的是flash.now
因為redirect_to 是告訴brower 請到這個url 去, 所以算是下個request了
而render 只是在說這次requst 要產生啥, 所以是這次的requst
8. 利用after_filter 作壓縮 p535
9. 可以用skip_before_filter, skip_after_filter 跳過某些filter
10. session支援 only, except, if 來設定條件開關session
11. verify 可以對 reqeust 作驗證
verify :only => :post_comment,
:session => :user_id,
:add_flash => { :note => "You must log in to comment" },
:redirect_to => :index
想把method, 又不想被當成action 存取, 就可以使用hide_action
2. 在controller 的環境有些method 可以用像是
3. render :inline
4. render :content_type
【META http-equiv="Content-Type" Content="text/html; Charset=*】意義詳解
透過content_type 可以指定, Content-Type HTTP 標頭傳給browser 的值
render :text => output.string, :content_type => 'application/x-ployrecover'
5.send_data
6. redirect_to :back
其實就是
redirect_to(request.env["HTTP_REFERER"])
7.flash
flash 基本是下一次的的request 處理後才會結束
所以如果想要在只有在這次request 用就是用flash.now, 而想要替它保留到下次就是用flash.keep
所以常在redirect_to 前用的是 flash, 而在render 前用的是flash.now
因為redirect_to 是告訴brower 請到這個url 去, 所以算是下個request了
而render 只是在說這次requst 要產生啥, 所以是這次的requst
8. 利用after_filter 作壓縮 p535
9. 可以用skip_before_filter, skip_after_filter 跳過某些filter
10. session支援 only, except, if 來設定條件開關session
11. verify 可以對 reqeust 作驗證
verify :only => :post_comment,
:session => :user_id,
:add_flash => { :note => "You must log in to comment" },
:redirect_to => :index
2009年3月14日 星期六
Routes and url memo(1)
定義routes時有些參數可以用
1. defaults vs :name => value
最基本的就是 defaults, rails 會提供像這樣的預設值
:defaults => { :action => "index", :id => nil }
如果參數裡有預設值, 就不會秀出來在url 中
1.1 :name => value
如果沒給參數值的話, 這個才會有作用, 不過這會出現在url裡
2. :conditions vs requirement
conditions 相同才會進入該規則去跑, 才有機會跑到requirement
像conditions 就是做出 {:method => :post} 這樣的限定
requiremnet 則是用來要求參數要長怎樣, 符合怎樣的條件
url_for 取預設的原則是盡量採用已經有的, 還有改變較前面的參數, 就不提供較後面的參數
如果就是想改變前面後面不變, 可以使用overwrite_params,
傳入的controller name 如果以 "/" 開頭, 則會根據controller name 產生絕對路徑
傳入其他的則會是在當前url 下的 相對路徑
REST
add custom aciotn
collection => 像list , 抓群體的
member => 像show, 對單個的
new => 像new, 就是生新東西的
1. defaults vs :name => value
最基本的就是 defaults, rails 會提供像這樣的預設值
:defaults => { :action => "index", :id => nil }
如果參數裡有預設值, 就不會秀出來在url 中
1.1 :name => value
如果沒給參數值的話, 這個才會有作用, 不過這會出現在url裡
2. :conditions vs requirement
conditions 相同才會進入該規則去跑, 才有機會跑到requirement
像conditions 就是做出 {:method => :post} 這樣的限定
requiremnet 則是用來要求參數要長怎樣, 符合怎樣的條件
url_for 取預設的原則是盡量採用已經有的, 還有改變較前面的參數, 就不提供較後面的參數
如果就是想改變前面後面不變, 可以使用overwrite_params,
傳入的controller name 如果以 "/" 開頭, 則會根據controller name 產生絕對路徑
傳入其他的則會是在當前url 下的 相對路徑
REST
add custom aciotn
collection => 像list , 抓群體的
member => 像show, 對單個的
new => 像new, 就是生新東西的
訂閱:
文章 (Atom)