How many sales do you lose through poor SEO
Attempting to promote your website online, often against stifling competition, can seem an impossible prospect especially if you only have limited knowledge of the techniques required. Learn what it takes to get to the top.
The following material is based on the paper ‘Patterns in Unstructured Data’ by Clara Yu, John Cuadrado, Maciej Ceglowski and J. Scott Payne. In this paper, the authors present what is probably the best introduction to LSI and search engines currently available, and one that is popular amongst the SEO community. A copy of the paper can be found here:
http://javelina.cet.middlebury.edu/lsa/out/cover_page.htm
Advanced webmasters may wish to consult this paper themselves, although this is not an absolute requirement as the following material provides a simplified version of the research it contains.
Yu et al. begin by pointing out some of the problems faced by current search technology. The internet is growing at an exponential rate, to the extent that, as the authors point out, Google has over 8 billion webpages in its index. This effectively means that more and more users have access to a vast collection of information, and that search engines face the task of indexing and searching this vast reservoir of data to return results that are both relevant to the individual searcher and simple enough for the average user to understand.
The trouble is that, given the sheer size of the Internet and the current state of search engine technology, any relevant information we find will still appear among a ton of irrelevant pages. Therefore search engines today still face the task of coming up with ever better ways of finding relevant results for the individual searcher.
According to Yu et al, there are three main things that people expect from a search engine (what they call the ‘Holy Trinity’ of searching). These things can be defined as follows:
‘Recall’ refers to the ability of the search engine to recall relevant information for a search. This relates to our desire to gain all relevant information that exists on a topic when we search for it. ‘Precision’ relates to the fact that we want the results returned to be precise, i.e. to contain more relevant than irrelevant information. Finally, we expect don’t expect the results returned to be presented to us in a random manner. We expect them to be ranked in such a way that the search engine presents what it perceives to be the most relevant results for our specific search first, and the least relevant results last.
We can use this criteria for judging the efficacy of any current search technology, e.g.. Google. When we use a search engine, we expect it to not omit information (recall), to return relevant results (precision) and to arrange those results in SERPs (ranking). Yu et al. envision that the ideal search engine would be able to quickly search every document on the internet and return up-to-date results quickly while still satisfying this criteria. However, where it is relatively easy for a search engine to increase its scope or speed up its searches as this largely involves investing in additional resources, it is still difficult for a search engine to improve upon the recall, precision and ranking of searches. This is where latent semantic indexing comes in.
Yu et al. outline two main ways in which we can search a collection of documents such as the Internet. For the sake of simplicity we will call these methods:
1) Human.
2) Mechanical
The first type of search is not likely to be exhaustive as nobody has the time or resources to go through a whole collection of documents word by word (can you imagine reading all the pages on the internet!). Human beings are more likely to scan pages for relevant information rather than read pages as a whole to see if they occasionally contain the phrase or information we are looking for.
Although this kind of search is not exhaustive, it is based on a high-level understanding of context, in the sense that human searchers usually know that certain parts of documents - e.g. page titles, headings, indexes - usually contain relevant information regarding what a page is about. Because this kind of search is carried out with an understanding of context, it can also successfully uncover relevant information in unexpected places, e.g. in articles which are not dedicated to the subject we were originally looking for.
The second type of search is exhaustive in the sense that it works methodically and mechanically through an entire collection of documents, noting down every single mention of the topic we are looking for. Computers are particularly good at this kind of task.
Although this second search can find every single instance where a term is mentioned, it has no understanding of context. Without this understanding of context, the computer cannot return documents that are related to our search but that don’t actually our search terms. Alternatively, the search engine returns documents that mention our specific search terms but that are using those terms in the wrong context (see the problems of synonymy and polsemy outlined above).
In short, a human search understands context but remains inexhaustive or sorely incomplete, while a mechanical search is exhaustive but has no understanding of context.
An obvious solution to the difficulty of searching a collection of documents like the Internet would be to find a way to combine the two. That way we would have the best of both worlds, where an exhaustive mechanical search would also display an understanding of context, thereby allowing ‘synonymous’ or related material to be found while cutting out the irrelevant material caused the problem of polysemy.
Yu et al point out that past attempts to combine mechanical searches with a human element have only met with limited success. Attempts to supplement searches by providing a computer with a human compiled list of synonyms to search have not proved successful. Surprisingly, there would also be shortcomings were we to employ a human ‘taxonomy’ or a system of classification such as the systems that have been used by libraries for generations (e.g. the Library of Congress). Under such systems, documents are classified according to different human defined categories (e.g. a library book could belong to categories such as science, natural philosophy, natural history, and so forth).
Even though traditional archivists successfully employ such systems, they might not work so well for the Internet. How, for example, would one find the means and resource to go about placing the billions of pages in the internet into little pigeonholes? What, moreover, would happen if many of these documents were relevant to more than one category (as most will inevitably be), or if your average Internet user didn’t have knowledge of the category their search belongs to?
Latent Semantic Indexing can be said to be a solution to the above problems in that it appears to offer a ‘middle ground’ between the two methods outlined above. LSI offers an exhaustive search that still understands context. Better still, LSI is entirely mechanical and doesn’t require human intervention.
In the last unit of this course, we pointed that a search engine attempts to find relevant results for a search query by finding pages that contain the terms used in that query. For example, a search for ‘mobile phone accessories’ will return pages that actually mention the words ‘mobile’, ‘phone’, and ‘accessories’.
This system is not ideal, as it deems all pages that don’t contain our specific search term as irrelevant, even if those pages potentially contain information that is relevant to our search.
As Yu et al suggest, LSI still takes account of the words a document contains, but it takes the extra step of examining the document collection as a whole to see which other documents contain the same words. If it finds other documents which contains the words, it considers them to be ‘semantically close’. ‘Semantically distant’ documents, by contrast, are documents that don’t have a lot of words in common.
The important thing to note here is that, by calculating the similarity values between documents, LSI can actually find words that are semantically related. For example, if the terms ‘cars’, ‘automobiles’ and ‘vehicles’ appear together in enough documents on the Internet, LSI will consider them to be semantically related. Therefore, a search engine that uses LSI in its index will return pages that mention ‘vehicles’ when you search for ‘cars’.
In short, then, Latent Semantic Indexing enhances our searches by taking account of related terms. By looking at enough documents on the Internet, it can find which words are related to other words, or words that are synonymous with other words. A search engine that uses LSI can thereby return documents that are relevant to but outside of our specific search query.
For more information about our Search Engine Optimisation Training Courses contact Syllabus or call +34 693 475 142.