You may be looking for the About Page or the Search Help Page

How does the SearchHippo search engine really work? ("in english")

The concept of search engines are pretty straightforward:
  • Have an automated program ("robot", "spider") fetch URLs.
  • Extract relevant information from the pages returned by web servers and put it into some sort of database (parse, index)
  • Use some algorithm for executing database queries to return results relevant to the query performed by the user.
  • Throw a web interface on top of the database queries.

    Unless noted otherwise, all applications discussed here are written in C/C++. Special care was taken to make the code use 64 bit datatypes for next generation hardware readiness. All together there is about 15K lines of code. While written primarily for FreeBSD, it also works on Linux although the Linux filesystem currently is not capable of handling the sizes of data we deal with nearly as well as FreeBSD.

    Get some URLs to visit

    Drilling a bit deeper still, we create a large list of urls (literally a big text file, with one url per line). This list comes from already screened directories (such as open directory) and/or url lists submitted to us in the form of "trusted feeds". One of the complex problems to be solved here is removing duplicates. Some of the nasty problems that come up here can be url redirects, dns redirects, ip addresses, server side includes that produce the same pages, relative directory paths, etc. Essentially, it is possible to have many different urls that all produce the same page. So we have a "scrubber process" which tries to "clean up" these lists so we end up with a list of normalized urls. SearchHippo only indexes text pages today, so things like .zip or .mp3 files (though not pages referring to them) are also removed.

    Spider and parse the URLs

    Next, Fluffy our spider is let lose on these urls. Fluffy is an executable program which one can run like any other program. A user provides a list of urls that are to be automatically "pulled" (like having a web browser visit each one in an automated fashion). Fluffy is very easy to distribute and we usually run about 30 or 40 copies at a time on different chunks of the urls we want to pull. Each time fluffy retrieves a particular url, it is checked for certain HTTP response types like redirects or file-not-found errors. If there is a redirect, we'll go fetch the new page (though we stop redirecting to avoid infinite loops after some number of redirects). Assuming we receive an "OK" message back from the remote HTTP server, and the message content type is text, fluffy will parse the page and produce a blob of meta data. This meta data includes things such as the title, meta tags, the raw content of the page and pieces of the page the parser considers relevant (such as image alt tags). A separate piece of this parsing process also generates a list of the link that are on each page, whether it is a "local" link, a frame, etc. and stores this information in a separate file. Potentially this could be used to retrieve yet more data (yielding the equivalent of a breadth first search), or in this case, it is an input to the relevancy ranking mechanism

    Relevancy Rank

    Yes! Before a query and before anything is indexed, the relevancy ranker sorts the actual meta data file based on the aforementioned "local/remote" links pointing to it, data from our log files (how many users clicked on this url), and url depth (is it a homepage or some other page buried within the site). All this information is combined together and then used as a key for sorting this file.

    More filters!

    Next additional filters are used on the meta data as an additional cleaning process. This includes tagging records with "adult content" flags and removing urls which have been banned from the engine for abuse and/or spamming.

    Index the data

    Finally we have reached the indexing step. At this point, we literally have one huge file. The indexer takes this huge file and breaks it up into four separate ones: a description file (containing the raw description), an index, a token file (dictionary) and a record pointer file (for faster access). The somewhat exciting part here is the index. Like many search engines, an inverted index is used in concept: get a word, and produce a list of which records have that word. So if we had a simple two record database:

    record 1a b c
    record 2c f b

    Our inverted index would have:


    where the first column becomes the aforementioned dictionary and the second column are pointers to the records containing the keys from the dictionary. Now one can say "give me the records with 'c' in them" and you get back a list with record "1" and "2".

    It is important to note here that both these lists are intentionally sorted. Since we can preprocess the data and make it sorted, this enables us to use search algorithms which are much better than on arbitrary data. Now one has the cornerstone of the 'database indexer' that can be used to index most anything. In reality, each piece of data that is indexed by the system is tagged such that we also know some things about it (for example, is this word in the title of the page, is it a meta keyword, is it an image alt tag, etc.)

    Search server

    The search server takes a query from the user and parses it into the individual 'words'. At the core of the search server is a page cached zig-zag algorithm for searching the index. One query term (like the example above) is boring - when a user enters two or more terms, we need to quickly get a bunch of lists of pointers to records and determine where the matches are. Consider:

    list 11,3,5,7,8,10,11...
    list 22,3,9,11,...

    We start at the beginning of both lists. We see the first list has a "1" while the second has a "2". If they are the same, we have a match. Else we then look for the "larger" number, in this case 2, in all the lists. If we don't find it, we return back the next larger number in the list. So, we search in the first list, don't find it, and get back a 3. We compare the 3 from the first list to the 2 in the second and search for 3 in the second. We find it, so ther is a match. We then look for "4" in each list (which would be the next possible match). Instead we get a 5 and a 9. Then we get a 9 and a 10. Then a 10 and 11. Then an 11 and 11. and so on.

    Obviously this algorithm can be very nasty when there are many lists that are very long that do not always intersect frequently. Since these lists are on the disk, we cache pages of these to make the search faster. That is, the first xyz numbers in a list can be retrieved and stored in memory for later use in the short term.

    Now that the basic mechanism for searching is there, we search in specific orders for specific things and assign weights. For example, we weigh the title of a page more heavily than the content of the page, so these items are retrieved before arbitrary items within a page. We also may not have "adult words" appear in a query, so we need to make sure we search for pages which appear to be clean. There is also a merging process that takes place here. In the currently implementation, there are actually five different indexes all of which get merged together at run time in order to produce results. This makes it easy to expand the index without having to rebuild everything. It also becomes easy to distribute the data over multiple machines if necessary because this merge process is effectively "merging sorted lists" which is a not-very-expensive operation.

    A Web Interface

    The actual search server itself also is literally an executable program. The input is the rough equivalent of a url and the output is an XML document. Of course, with fast-cgi we have a persistant process so we do not re-run and terminate the program on every query, but rather have it sleep until another one comes. Our web interface is written in PHP -- each time a query is submitted through the form (or even our XML interface), it hits a PHP program which then calls the search server, parses the XML document and builds the new document to spew out to the end user (or application).

    data gathering

    search engine architecture

    Just the numbers please

    As of Oct 8, 2002:
    URLs spidered returning indexable data
    (i.e. excluding redirects, 404's, etc.)
    ~13M (of 40M known universe)
    Meta data from parse ~93G
    Description data file ~3.5G
    Index of meta data into description ~15G
    Size of token data file ~1.7G
    Size of record pointer file ~171M
    Index building machine 1xP4-2.54Ghz, 2G RAM, 16x18G SCSI 10K RPM, 4x70G SCSI 15K RPM, 1x60G EIDE, Adaptec RAID 3200S
    Search server configuration 2.2Ghz P4, 2G RAM, 3x18G SCSI 15K RPM Adaptec 3200S RAID, 1x60G EIDE
      2.0Ghz P4, 2GM RAM, 3x18G SCSI 15K RPM Adaptec 2100 RAID, 1x60G EIDE
      2x1Ghz P3, 4G RAM, 2x60G EIDE
    Web server configuration 2x2.4Ghz P4, 1G RAM, 1x80G EIDE (cpu needed to support HTTP compression)
    Average queries per day ~2.3M
    Average query response time 0.37s
    Time to build index ~2 day
    Maximum (acceptable) load per search server 65 queries/sec
    Current peak load (across all search servers) 45 queries/sec
    Average bandwidth utilization (currently a DS1 [1.54Mbps] from Cable and Wireless) 65%
    Peak bandwidth utilization 90%
    Total amount of "hobby money" spent ~85,000$
    What's it look like? See pictures of the den.
    How can I run a search engine Our partner page has free integration code

    About - Submit/Modify URL - Partnerships - Help! - Privacy - FAQ - New Search
    Query Spy - Highlighter Proxy - FREE Web Services
    Copyright (c) 2001-2008, (TM)