SEO Old Timer Tips:
An Old Timers Perspective...from SEGuru
Search Engine Old Timer Tips:
Recently a friend of mine bought me a copy of A Theory of Indexing by Gerard Salton. It is a 50 page book from 1975 with lots of charts and math, but in those few pages it has a ton of information about many of the ideas which current search technologies have been built upon.
I am probably going to have to read it again because it was so dense with information and had lots of math that was a wee bit above me the first time around, but to anyone interested in learning about search technology it is a great book...much like Mike Grehan's.
A Theory of Indexing talks about a ton of interesting things like:
- signal to noise
- inverse document frequency
- discrimination value
- and lots of other stuff
Here is a small bit I learned from the last few pages...
If words exist in a high % of the total documents in a document collection then they are not usually going to be good at discriminating which documents are relevant for a particular query (since they appear in too many documents).
If words exist is a low % of the total documents then they are not usually going to be good at discriminating which documents are relevant for a particular query (since they appear in so few documents).
Words with a mid range document frequency are better discriminators.
To make better use of words that appear in a high % of the total documents you can combine the words into word pairs or triples - which will have a lower frequency and may be better at descriminating document relevancy.
To make better use of words that appear in a low % of the total documents you can cluster the words into groups via the use of a thesaurus - which will have the net effect of creating higher frequency word classes / clusters - which may be better at descriminating document relevancy.