DS C3 Inverted File Index

Sol 1: Term-Document Incidence Matrix

Two drawbacks:

  • Can not identify the location of each term
  • The matrix is sparse

Sol 2: Inverted File Index

Index is a mechanism for locating a given term in a text.

Inverted file contains a list of pointers (e.g the number of a page) to all occurrences of that term in the text.

  • 每个文件(网页)都对应一个文件ID

  • 文件内容被表示为一系列关键词的集合:词袋假设。

1
2
3
4
5
6
7
8
9
while ( read a document D ) {
while ( read a term T in D ) {
if ( Find( Dictionary, T ) == false )
Insert( Dictionary, T );
Get T’s posting list;
Insert a node to T’s posting list;
}
}
Write the inverted index to disk;

reading a term

Word Stemming

Process a word so that only its stem or root form is left since the form has no influence on meaning.

Stop Words

Some words are so common that almost every document contains them, such as “a”, “the”, “it”. It is useless to index them thus we can eliminate them from the original documents.

accessing a term

  • Sol1: Search Trees
  • Sol2: Hashing

Hashing is faster for one word, but weak in scanning in sequential order and range operations.

memory

not having enough memory

1
2
3
4
5
6
7
8
9
10
11
12
13
14
BlockCnt = 0; 
while ( read a document D ) {
while ( read a term T in D ) {
if ( out of memory ) {
Write BlockIndex[BlockCnt] to disk;
BlockCnt ++;
FreeMemory;
}
if ( Find( Dictionary, T ) == false )
Insert( Dictionary, T );
Get T’s posting list;
Insert a node to T’s posting list;
}
}

Distributed indexing

  • Sol1:Term-partitioned index
  • Sol2:Document-partitioned index

Term(全局索引):

  • 查询只需要访问一台机器,不需要后续的聚合;
  • loadbalance比较好,若如果查询的关键词分布比较均匀;
  • 可靠性差,如果一台机器挂了则所有内容无法被查询;
  • loadbalance不好,如果query分布不均匀。

Doc(局部索引):

  • 可靠,总有部分结果;
  • loadbalance好;
  • 容易和分布式爬虫配合。

Improvements

Dynamic indexing

  • Docs come in over time:

    • postings updates for terms already in dictionary

    • new terms added to dictionary

  • Docs get deleted

Compression

  • Using gaps(Most can be encoded with far fewer than 20 bits) instead of actual indexs in Posting List.

    e.g

    arrivedamagedeliverfiregoldshipmentsilvertruck

    • before compression: 2, 15, 47, …, 58879, 58890, …
    • after compression: 2, 13, 32, … …, 11, …

Thresholding

  • Document: only retrieve the top xx documents where the documents are ranked by weight.

    DrawBacks

    • Not feasible for Boolean queries
    • Relevant documents missing due to truncation
  • Query:

    • Sort the query terms by their frequency in ascending order;
    • Search according to only some percentage of the original query terms.

Search Engine Measures

  • speed of indexing : Number of documents/hour
  • speed of searching : Latency as a function of index size
  • Expressiveness of query language: Ability to express complex information needs ; Speed on complex queries.
  • User happiness:
    • Data Retrieval Performance Evaluation (after establishing correctness):
      • Response time
      • Index space
    • Information Retrieval Performance Evaluation: relevancy of anser set

Relevance measurement requires 3 elements:

  • A benchmark document collection
  • A benchmark suite of queries
  • A binary assessment of either Relevant or Irrelevant for each query-doc pair.
RelevantIrrelevant
RetrievedRRR_RIRI_R
Not RetrievedRNR_NINI_N

Precision P=RRRR+IR\rm Precision\ \it P = \frac{R_R}{R_R + I_R}

Recall R=RRRR+RN\rm Recall\ \it R = \frac{R_R}{R_R + R_N}


DS C3 Inverted File Index
http://example.com/2023/03/21/DS-03/
Author
Tekhne Chen
Posted on
March 21, 2023
Licensed under