Throw a web interface on top of the database queries.
Unless noted otherwise, all applications discussed here are written in
C/C++. Special care was taken to make the code use 64 bit datatypes for
next generation hardware readiness. All together there is about 15K lines
of code. While written primarily for FreeBSD, it also works on Linux
although the Linux filesystem currently is not capable of handling the
sizes of data we deal with nearly as well as FreeBSD.
Get some URLs to visit
Drilling a bit deeper still, we create a large list of urls (literally a
big text file, with one url per line). This list comes from already
screened directories (such as open directory) and/or url lists
submitted to us in the form of "trusted feeds". One of the complex problems
to be solved here is removing duplicates. Some of the nasty problems that
come up here can be url redirects, dns redirects, ip addresses, server side
includes that produce the same pages, relative directory paths, etc.
Essentially, it is possible to have many different urls that all produce the
same page. So we have a "scrubber process" which tries to "clean up"
these lists so we end up with a list of normalized urls. SearchHippo only
indexes text pages today, so things like .zip or .mp3 files (though
not pages referring to them) are also removed.
Spider and parse the URLs
Next, Fluffy our spider is let lose on these urls. Fluffy is an executable
program which one can run like any other program. A user provides a list
of urls that are to be automatically "pulled" (like having a web browser
visit each one in an automated fashion). Fluffy is very easy to distribute
and we usually run about 30 or 40 copies at a time on different chunks of the
urls we want to pull. Each time fluffy retrieves a particular url, it is
checked for certain HTTP response types like redirects or file-not-found
errors. If there is a redirect, we'll go fetch the new page (though we
stop redirecting to avoid infinite loops after some number of redirects).
Assuming we receive an "OK" message back from the remote HTTP server, and
the message content type is text, fluffy will parse the page and produce
a blob of meta data. This meta data includes things such as the title,
meta tags, the raw content of the page and pieces of the page the parser
considers relevant (such as image alt tags). A separate piece of this parsing
process also generates a list of the link that are on each page, whether
it is a "local" link, a frame, etc. and stores this information in a separate
file. Potentially this could be used to retrieve yet more data (yielding
the equivalent of a breadth first search), or in this case, it is an input
to the relevancy ranking mechanism
Yes! Before a query and before anything is indexed, the relevancy ranker
sorts the actual meta data file based on the aforementioned "local/remote"
links pointing to it, data from our log files (how many users clicked on this
url), and url depth (is it a homepage or some other page buried within the
site). All this information is combined together and then used as a key for
sorting this file.
Next additional filters are used on the meta data as an additional cleaning
process. This includes tagging records with "adult content" flags and
removing urls which have been banned from the engine for abuse and/or spamming.
Index the data
Finally we have reached the indexing step. At this point, we literally have
one huge file. The indexer takes this huge file and breaks it up into
four separate ones: a description file (containing the raw description),
an index, a token file (dictionary) and a record pointer file (for faster
access). The somewhat exciting part here is the index. Like many search
engines, an inverted index is used in concept: get a word, and produce a list
of which records have that word. So if we had a simple two record database:
|record 1||a b c|
|record 2||c f b|
Our inverted index would have:
where the first column becomes the aforementioned dictionary and the second
column are pointers to the records containing the keys from the dictionary.
Now one can say "give me the records with 'c' in them" and you get back a list
with record "1" and "2".
It is important to note here that both these lists are intentionally sorted.
Since we can preprocess the data and make it sorted, this enables us to use
search algorithms which are much better than on arbitrary data.
Now one has the cornerstone of the 'database indexer' that can be used to
index most anything. In reality, each piece of data that is indexed by
the system is tagged such that we also know some things about it (for example,
is this word in the title of the page, is it a meta keyword, is it an image alt
The search server takes a query from the user and parses it into the individual
'words'. At the core of the search server is a page cached zig-zag algorithm
for searching the index. One query term (like the example above) is boring -
when a user enters two or more terms, we need to quickly get a bunch of lists
of pointers to records and determine where the matches are. Consider:
We start at the beginning of both lists. We see the first list has a "1" while the second has a "2". If they are the same, we have a match. Else we then
look for the "larger" number, in this case 2, in all the lists. If we don't
find it, we return back the next larger number in the list. So, we search in
the first list, don't find it, and get back a 3. We compare the 3 from the
first list to the 2 in the second and search for 3 in the second. We find it,
so ther is a match. We then look for "4" in each list (which would be the next
possible match). Instead we get a 5 and a 9. Then we get a 9 and a 10. Then
a 10 and 11. Then an 11 and 11. and so on.
Obviously this algorithm can be very nasty when there are many lists that
are very long that do not always intersect frequently. Since these lists
are on the disk, we cache pages of these to make the search faster. That is,
the first xyz numbers in a list can be retrieved and stored in memory for later
use in the short term.
Now that the basic mechanism for searching is there, we search in specific
orders for specific things and assign weights. For example, we weigh the
title of a page more heavily than the content of the page, so these items are
retrieved before arbitrary items within a page. We also may not have
"adult words" appear in a query, so we need to make sure we search for pages
which appear to be clean. There is also a merging process that takes place
here. In the currently implementation, there are actually five different
indexes all of which get merged together at run time in order to produce
results. This makes it easy to expand the index without having to rebuild
everything. It also becomes easy to distribute the data over multiple
machines if necessary because this merge process is effectively "merging sorted
lists" which is a not-very-expensive operation.
A Web Interface
The actual search server itself also is literally an executable program. The
input is the rough equivalent of a url and the output is an XML document. Of
course, with fast-cgi we have a persistant process so we do not re-run and
terminate the program on every query, but rather have it sleep until another
one comes. Our web interface is written in PHP -- each time a query is
submitted through the form (or even our XML interface), it hits a PHP
program which then calls the search server, parses the XML document and builds
the new document to spew out to the end user (or application).
Just the numbers please
As of Oct 8, 2002:
|URLs spidered returning indexable data
(i.e. excluding redirects, 404's, etc.)
|~13M (of 40M known universe)
|Meta data from parse
|Description data file
|Index of meta data into description
|Size of token data file
|Size of record pointer file
|Index building machine
||1xP4-2.54Ghz, 2G RAM, 16x18G SCSI 10K RPM, 4x70G SCSI 15K RPM, 1x60G EIDE, Adaptec RAID 3200S
|Search server configuration
||2.2Ghz P4, 2G RAM, 3x18G SCSI 15K RPM Adaptec 3200S RAID, 1x60G EIDE
||2.0Ghz P4, 2GM RAM, 3x18G SCSI 15K RPM Adaptec 2100 RAID, 1x60G EIDE
||2x1Ghz P3, 4G RAM, 2x60G EIDE
|Web server configuration
||2x2.4Ghz P4, 1G RAM, 1x80G EIDE (cpu needed to support HTTP compression)
|Average queries per day
|Average query response time
|Time to build index
|Maximum (acceptable) load per search server
|Current peak load (across all search servers)
|Average bandwidth utilization (currently a DS1 [1.54Mbps] from Cable and Wireless)
|Peak bandwidth utilization
|Total amount of "hobby money" spent
|What's it look like?
||See pictures of the den.
|How can I run a search engine
||Our partner page has free integration code
Submit/Modify URL - Partnerships -
Query Spy - Highlighter Proxy - FREE Web Services
Copyright (c) 2001-2008, SearchHippo.com (TM)