Semantic search isn’t easy : An understandingPosted: January 12, 2007
I’ve done a lot of work on improving and adding semantic search to engines, and let me tell you something, it’s one of the most difficult concepts to get your head around as a SEA (search engine author). The complexity of trying to teach a machine to learn for itself what a word actually means is high and will drive you insane if you forget your meds.
I found that when I was searching for information on how semantic search works, there were very few papers on the subject online and those that were present were in a word: Crap. Once you understand what’s going on though – the development is actually the simple bit. Here’s some of my research notes (unfortunately minus the solution, hopefully you’ll be inspired enough when you grasp it to create your own).
In order to generate semantically linked results and keyword suggestions I need to incorporate some degree of fuzzy matching to the Optren core.
There will be two tables used for generating semantically linked words. The primary data table is already collated as the keyword density table by the search crawler and this will allow me to compile data that can be used in the following ways:
To build a dictionary that can be based on a particular page, user account or as a master.
To extrapolate words that frequently occur together – for use in a thesaurus.
To calculate by letter percentage similarity between words, to gather plurals, misspellings and variations.
To extrapolate words that frequently appear together is going to be the most difficult part of this engine addon by far. It requires intensive processing to go through each word on a page and all the possible semantic matches that may exist. Trying to discover the ‘sense’ of a word is either never done at all, performed by humans, or some pretty lame equations that try and take the distance between words in a sentence into account as to their relationship. I don’t want to do that – we need a brand new process here in order to allow this to scale effectively.
This is a brief outline of the process I’ll need to run through to gather the required information:
First off, a full master dictionary is compiled from the keyword density table omitting repeated words.
Each word in the dictionary has what’s known as a primary id which is used to map the location of the word in the main dictionary table.
A second table is created, containing words with a high percentage similarity letter by letter. This is the alternative spellings, variations and plurals table.
A third table is created, this is where the related words will be stored, and it is in effect a thesaurus that will be created by semantic context.
The real problem here is creating that third thesaurus table. Think of the problem in human terms, if you’re given the word ‘dog’, how do you deduce that the word ‘kennel’ is related? Obviously, you search your knowledge of the sense of the word dog and you derive that a dog is a small animal and relate that to linked information in your brain.
How can you teach a computer the same trick though? Obviously the missing components are having no sense of any the words in its dictionary and no knowledge to draw links from – both are required if we’re going to map the thesaurus as a human would.
With the emerging web 3.0 it’s becoming easier for computers to learn the sense of a word. Tagging means that links and articles can be attributed keywords that allow the computer to categorise. The knowledge for a computer program increasingly is becoming the Internet itself with faster connections and processors developers can use it like a giant repository of information.
Fuzzy matching websites?
Quite often, you’ll find one website that’s immediately relevant, and a whole host (often millions) of websites that are ridiculously off topic to what you’re looking for. It’s time consuming trying to find other websites and trawling through page by page hoping to get a relevant hit.
A possible solution to this would be to attempt the process of fuzzy matching entire websites, perhaps I should now explain exactly what I mean by that:
Following on from the thesaurus I was talking about earlier, it makes it possible to calculate all possible variations of the words on a specific web page. I don’t believe it necessary that the page is grammatically correct; this is merely to allow other variations of certain key words. As with most search queries the stop words should probably be filtered out to ‘purify’ the data we’re using.
If we now have several pages based on the original one, we generate one long string from all of them.
We match the string against the cache of every page in the index; the pages that return with the highest percentage similarity are our top fuzzy matched results.
I’m not sure if this method has been implemented anywhere before, I don’t recall seeing it and it’s possibly too intensive, if we have a 1000 key word page, with an average of 4 synonyms for each word then the number of permutations of that page is not the question(no factorial calculations needed, phew). As we’re merely calculating a percentage similarity with the result, all we need to add to our page string is a sequential list of synonyms for each word, meaning we’ll generate 4000 extra terms to get our string for fuzzy matching.
Competition and previous research
There is currently only one search engine, still in beta, that carries out any kind of true semantic analysis to retrieve its search results which is www.hakia.com. The founder of the project is a former Nuclear Scientist who worked for the US Government for 10 years and specialises in AI. Here’s how Hakia explain the complexity of the subject:
Creating a meaning-based search engine, in the true sense, is a huge undertaking. The human brain takes almost a decade of training to be fully cognitive in natural languages. Although computers may be considered faster processors of information, the required “training” for them still takes considerable time and resources. In an analogy, the cognitive growth of Hakia this year will be like a baby starting to speak, needless to say, we will be proud parents!
There are other competitors that claim to use semantic analysis, but in truth either don’t – or are nowhere near the level needed to claim that they do.
Other papers have been written detailing that algorithms exist to attempt and draw semantic relationships by distance between words in a string of text. This is open to any number of variables which simply cannot be accounted for and in my eyes is as useful as guessing. Grammar dictates the positioning of words in a sentence and writing style the position of sentences in a paragraph and then page structure and so on and whether the two words are related or not has very little to do with the distance between them.
The probability that they are related if they are used in the same page though, does increase, and that’s a statistical base.
Web crawlers were used to pull information on words from Wikipedia in order to find a word’s sense; this is a better idea that uses a knowledge resource to improve its information. However, knowledge is power and the more data that can be analysed, statistically speaking; the more accurate our thesaurus is going to be. We may have an ineffectual measure of what a word’s sense is – but the more we gather data on the word in question then the more likely the definition will be correct.
So how exactly do you propose we do this?
The answer is mind numbingly simple. I mentioned earlier that humans do this very easily, even small kids who have never heard a word before will eventually deduce what the word means by simple word association.
When I talked about the distances between words and its startling lack of usefulness, I was right. Let’s face it – no one puts two words with the same meaning next to each other. They might put one in the next sentence if they don’t want to appear to be repeating themselves though.
What needs to be done is the following (this would create one way relationships between keywords):
A web page is saved to an array of all the keywords it contains.
Starting with the first word in the array at position 0 (array); I then iterate through every other word in the array and assign each as a child of the first keyword in the thesaurus table with a value of the number of times it occurs on that page.
The previous step is in then repeated for every word in the array.
At this stage, the chances of the words being actually related are extremely slim – and to do it in this fashion continually is going to create a huge database waste of irrelevant and tenuous keyword relationships. What we need here is more information (web pages) to generate cumulative data on each keyword in our thesaurus master table. Statistics suggests that with more data we will have increasingly more accurate relationships between keywords.
There are innumerable ways to optimise and improve this process; this is an attempt to explain a very complex subject as succinctly as possible.
Like the child, this takes a long time to process, comparatively because the program to do this is dedicated then the computer should perform the task relatively much more quickly. Bear in mind a child will take ten years to gain the kind of understanding I’m suggesting. This method is the closest I can figure though to the way a human brain begins to understand. I’d like to introduce extra methods as well though that a human can also use:
The child can be told, and so our search program needs to be able to be told as well. Anything it’s told should be treated as a fact. The limit of occurrences between facts and opinions (as initial relationships will be) will eventually be hard defined as a number.
The child can also ask those that know better. In effect this means using websites with high amounts of content, large vocabulary bases and variation in order to learn as quickly as possible. Sites like Wikipedia, Blogs and News websites are excellent for this.
If a word is heard more than once and not understood then the child is more likely to ask what it means. So keywords on a page with higher keyword densities are actually the fastest way to create quality relationships and allow us to limit the amount/size of strings we’re parsing. Basically, it speeds it up.