Basic Vector-processing Model
Vector-space model:
-
represents both queries and documents by term sets
-
computes global similarities between them
-
simplest to use, most productive (in some ways)
Term vectors:
-
Di
= (ai1,...,ait)
-
Qj
= (qj1,...,qjt)
-
aik
and qjk
represent values of term k in Di
and Qj
-
1/0 when k appears/absents in Di
(Qj)
-
numeric values depending on term importance
-
any vector can be represented as linear combination of linearly
independent term vectors:
Dr
= Si=1,t
ari
Ti
Similarity between vectors x and y in
vector space:
x . y = |x| |y| cosa
|x| ... length of x
a ... angle between vectors
Document-query similarity:
Dr
. Qs =
Si,j=1,t
ari
qsj
Ti .
Tj
-
If terms are uncorrelated - term vectors are orthogonal:
sim(Dr,
Qs)
= Si,j=1,t
ari
qsj
sim(Dr,
Ds)
= Si,j=1,t
ari
asj
Why to generate similarity coeficients:
-
Retrieved documents can be ordered
-
Size of retrieved set can be adapted
-
Queries can be improved using relevance feedback (based
on most similar documents)
Disadvantages of vector-processing model:
-
assumed orthogonality (independence) between terms
-
lack of theoretical justification for vector-manipulation
operations (e.g. vector-similarity measure)
Advantages:
-
simplicity
-
ranked output
-
vector modifications (adaptations)