DS C3 Inverted File Index
Sol 1: Term-Document Incidence Matrix
Two drawbacks:
- Can not identify the location of each term
- The matrix is sparse
Sol 2: Inverted File Index
Index is a mechanism for locating a given term in a text.
Inverted file contains a list of pointers (e.g
the number of a page) to all occurrences of that term in the text.
每个文件(网页)都对应一个文件ID
文件内容被表示为一系列关键词的集合:词袋假设。
1 |
|
reading a term
Word Stemming
Process a word so that only its stem or root form is left since the form has no influence on meaning.
Stop Words
Some words are so common that almost every document contains them, such as “a”, “the”, “it”. It is useless to index them thus we can eliminate them from the original documents.
accessing a term
- Sol1: Search Trees
- Sol2: Hashing
Hashing is faster for one word, but weak in scanning in sequential order and range operations.
memory
not having enough memory
1 |
|
Distributed indexing
- Sol1:Term-partitioned index
- Sol2:Document-partitioned index
Term(全局索引):
- 查询只需要访问一台机器,不需要后续的聚合;
- loadbalance比较好,若如果查询的关键词分布比较均匀;
- 可靠性差,如果一台机器挂了则所有内容无法被查询;
- loadbalance不好,如果query分布不均匀。
Doc(局部索引):
- 可靠,总有部分结果;
- loadbalance好;
- 容易和分布式爬虫配合。
Improvements
Dynamic indexing
-
Docs come in over time:
-
postings updates for terms already in dictionary
-
new terms added to dictionary
-
- Docs get deleted
Compression
-
Using gaps(Most can be encoded with far fewer than 20 bits) instead of actual indexs in Posting List.
e.g
arrivedamagedeliverfiregoldshipmentsilvertruck
- before compression: 2, 15, 47, …, 58879, 58890, …
- after compression: 2, 13, 32, … …, 11, …
Thresholding
-
Document: only retrieve the top documents where the documents are ranked by weight.
DrawBacks
- Not feasible for Boolean queries
- Relevant documents missing due to truncation
-
Query:
- Sort the query terms by their frequency in ascending order;
- Search according to only some percentage of the original query terms.
Search Engine Measures
- speed of indexing : Number of documents/hour
- speed of searching : Latency as a function of index size
- Expressiveness of query language: Ability to express complex information needs ; Speed on complex queries.
- User happiness:
- Data Retrieval Performance Evaluation (after establishing correctness):
- Response time
- Index space
- Information Retrieval Performance Evaluation: relevancy of anser set
- Data Retrieval Performance Evaluation (after establishing correctness):
Relevance measurement requires 3 elements:
- A benchmark document collection
- A benchmark suite of queries
- A binary assessment of either Relevant or Irrelevant for each query-doc pair.
Relevant | Irrelevant | |
---|---|---|
Retrieved | ||
Not Retrieved |