DS C3 Inverted File Index

Sol 1: Term-Document Incidence Matrix

Two drawbacks:

Can not identify the location of each term
The matrix is sparse

Sol 2: Inverted File Index

Index is a mechanism for locating a given term in a text.

Inverted file contains a list of pointers (e.g the number of a page) to all occurrences of that term in the text.

每个文件（网页）都对应一个文件ID

文件内容被表示为一系列关键词的集合：词袋假设。

while ( read a document D ) {
    while ( read a term T in D ) {
        if ( Find( Dictionary, T ) == false )
            Insert( Dictionary, T );
        Get T’s posting list;
        Insert a node to T’s posting list;
    }
}
Write the inverted index to disk;

reading a term

Word Stemming

Process a word so that only its stem or root form is left since the form has no influence on meaning.

Stop Words

Some words are so common that almost every document contains them, such as “a”, “the”, “it”. It is useless to index them thus we can eliminate them from the original documents.

accessing a term

Sol1: Search Trees
Sol2: Hashing

Hashing is faster for one word, but weak in scanning in sequential order and range operations.

memory

not having enough memory

BlockCnt = 0; 
while ( read a document D ) {
  while ( read a term T in D ) {
    if ( out of memory ) {
      Write BlockIndex[BlockCnt] to disk;
      BlockCnt ++;
      FreeMemory;
    }
    if ( Find( Dictionary, T ) == false )
      Insert( Dictionary, T );
    Get T’s posting list;
    Insert a node to T’s posting list;
  }
}

Distributed indexing

Sol1:Term-partitioned index
Sol2:Document-partitioned index

Term（全局索引）：

查询只需要访问一台机器，不需要后续的聚合；

loadbalance比较好，若如果查询的关键词分布比较均匀；

可靠性差，如果一台机器挂了则所有内容无法被查询；

loadbalance不好，如果query分布不均匀。

Doc（局部索引）：

可靠，总有部分结果；

loadbalance好；

容易和分布式爬虫配合。

Improvements

Dynamic indexing

Docs come in over time：
- postings updates for terms already in dictionary
- new terms added to dictionary

Docs get deleted

Compression

Using gaps(Most can be encoded with far fewer than 20 bits) instead of actual indexs in Posting List.
e.g

arrivedamagedeliverfiregoldshipmentsilvertruck
- before compression: 2, 15, 47, …, 58879, 58890, …
- after compression: 2, 13, 32, … …, 11, …

Thresholding

Document: only retrieve the top $x$ documents where the documents are ranked by weight.
DrawBacks
- Not feasible for Boolean queries
- Relevant documents missing due to truncation
Query:
- Sort the query terms by their frequency in ascending order;
- Search according to only some percentage of the original query terms.

Search Engine Measures

speed of indexing : Number of documents/hour
speed of searching : Latency as a function of index size
Expressiveness of query language: Ability to express complex information needs ; Speed on complex queries.
User happiness:
- Data Retrieval Performance Evaluation (after establishing correctness):
  - Response time
  - Index space
- Information Retrieval Performance Evaluation: relevancy of anser set

Relevance measurement requires 3 elements:

A benchmark document collection
A benchmark suite of queries
A binary assessment of either Relevant or Irrelevant for each query-doc pair.

	Relevant	Irrelevant
Retrieved	$R_R$	$I_R$
Not Retrieved	$R_N$	$I_N$

\rm Precision\ \it P = \frac{R_R}{R_R + I_R}

\rm Recall\ \it R = \frac{R_R}{R_R + R_N}

Computer Science > Data Structures

#DataStructures

DS C3 Inverted File Index

http://example.com/2023/03/21/DS-03/

Author

Tekhne Chen

Posted on

March 21, 2023

Licensed under

MA3 小提琴协奏曲 Previous

OOP C3 Container Next