Implementing Site Searching on GeodSoft.com
- 6/1/00
Despite the fact that Index Server was
working months before I got serious about the GeodSoft.com web
site, I could never get it configured to work with the new virtual
site. Swish-e was very easy to set up on Linux and provides fully
functional, though somewhat slow, searching.
Update 6/2/00
Previously I've made references to problems with Microsoft's Index
Server. On the server selection page
problems getting Index Server to work on a virtual site were described
and subsequently Index Server's failure to continue indexing what it had
been doing for months was described on More NT
Quirks. Almost a month later the situation hasn't changed.
Index Server now only returns pages in the virtual site but the
search form is set up in my default site so all the relative URL's are
wrong. They are for the virtual site and don't exist in the default
site. The result is that no searching works on the NT server.
Since I'd already spent more time than I thought was reasonable on Index
Server, I decided to stop banging my head against the wall and see what
new things I could learn. The Apache FAQ page quickly lead me to two
"Open source search engines that are often used with Apache." These were
Swish-e and
DIG. It also had a link to a page of
Web Site Search
Tools which had dozens of links to a wide range of tools from simple
free open source products to incredibly expensive commercial products.
The tools were categorized by development environment, Perl, Java, etc.
I spent a while looking at the descriptions of a variety of tools focusing
mostly on Perl and Java products. The descriptions of Swish-e and DIG were
complete enough that it was not clear from reading product descriptions
how any of the other tools would be superior to either Swish-e of DIG so
I decided to focus on the two products specifically mentioned in the Apache
FAQ. While it's quite possible that there may be superior products, even
free ones available, it's not clear how I can find these without a lot of
investigative work which probably means installing and testing the products.
The most widely used products are usually solid products.
Between the two products, the feature that caused me to select Swish-e
over DIG was Swish-e's ability to index a local file system. Both products
include spidering capabilities, the ability to index multiple sites via
HTTP. While I plan to have multiple sites, since these sites will have
duplicate content, I don't want the sites indexed together as that would
yield duplicate search results. It seems to me that if you don't want
multiple sites that file system based indexing is likely to be much more
efficient than HTTP based indexing.
I began reading and printing Swish-e documentation and after going through
the installation and readme docs downloaded the product as a tar.gz file.
Uncompress on the Red Hat Linux system would not create a renamed
uncompressed file but zcat piped to tar successfully expanded the file.
I had to find the location of gcc and change one line in the Makefile.
Swish-e compiled with a few warning messages the first time and successfully
completed the test indexing job that was described in the install
documentation.
It was very quickly clear that I needed to get a better understanding
of the Swish-e config file and command line options before I would be
able to make practical use of it. I skimmed through the documentation
printing it as I went. I also reviewed the front end tools that were
listed on the Swish-e site for making Swish-e available via the web.
I downloaded three different products but the HTTP downloads of tar.gz
files that first went to my NT workstation were corrupted, probably
because NT saved them as text files. (The NT workstation is still the
only machine I have with an Internet connection. Getting a DSL line
is starting to like a story in itself.)
After about an hour of reading documentation and changing options in
the configuration file, I was ready to try swish-e from the command
line again. The first attempt resulted in a number of config file
syntax errors which were identified by line number. Most were spider
options that the documentation clearly said I had to comment out but
that I had missed. After commenting out these problem lines, Swish-e
successfully indexed my site and the Apache documentation on the second
attempt.
On the screen, printed to standard out was the list of words that
had been excluded from the index because they were too common. I'd
set the configuration to exclude any word that appeared in more than
50% of the files and more than 50 files. I picked the low absolute
number because my site was small and the 1000 file default would mean
there were no stop words. Looking at the excluded words, which had
all the words in my standard page headers and footers, I knew there
were words that I would want to search on, even if they were in every
page on the site.
I changed the config file so there would be no stop words (excluded
words). Thinking about searches I'd done in the past, I knew how
frustrated I'd been at not being able to search for a phrase because
it included stop words. I could not see any drawbacks except index
size and possibly performance due to including common words in the
indexes. Obviously if you put common words in your search criteria
you're going to get large results. The specific words that pushed
me towards indexing everything were "privacy" and "policy" which were
common due their being part of the standard page navigation.
After building the indexes again with no stop words, I tried some simple
searches starting with "privacy policy". As expected, the results
included every GeodSoft web page but the "Privacy Policy" page was
first with a relevance ranking of 1000 with other pages having much
lower rankings going as low as 33. This and subsequent searches made
it clear that swish-e was using the number of search words relative to
the total number of words as part of the relevance ranking. This is
what I expect and want from a search engine.
With it clear that I had functional indexes, I turned my attention to
search front ends for the web. That was when I discovered the corrupt
tar.gz file leaving me only with search.pl by Steve van der Burg.
I put it in my cgi-bin directory and tried it. I got a not authorized
error which went away as soon as I used chmod to make the file
executable. Then I got a server error message which suggested looking
in the error.log. That revealed incomplete http headers. I've mentioned
this before but this is one thing a really do like about IIS and that is
that it displays the script output as text if it's not valid HTTP output.
This may be a security weakness and should have a configuration option
to control but it's really handy when you're debugging CGI scripts.
I ran search from the command line and got a program not found error which
I correctly surmised meant that the first line was pointing to the wrong
location for the Perl executable. After fixing this, I ran the script
from a browser again and got a search entry form. I tried a search that
I knew would have results and got none. It only took about five minutes
of looking at the script to find three configuration variables pointing
to the locations of the swish-e executable, the configuration file and
the index file. As soon as these were fixed, the next search gave a
results page with 20 hits and links to four more results pages with a
"next" link as well.
In less than three hours total time expended, I had functioning full
text searching using products I'd never used before on a platform that
I'd not had full text searching on. The three hours includes all the
time that I was looking at documentation for competing products, the
download and install, reading the Swish-e documentation, and getting a
working front end. Among the options I know how to control are what
directory trees to search and within those what file extentions to
index and which ones to index file name but not contents. This last
is for graphics files if I want them indexed. I know where Swish-e
keeps its default list of stop words in swish.h and how to control
automatic stop words if I decide I want stop word based on actual
indexed content. The biggest searching capability that I have not
found is searching for exact multi word phrases. I don't know if
Swish-e can't do this or I just haven't found it.
I also know that from this point forward, I only need to gain a better
understanding of Swish-e's capabilities to extend what I do with it.
On the front end, there is only search.pl which is less than 400 lines in
a standard language that I understand well. By modifying a Perl script
I'll be able to make the results pages look like the rest of my site
and control how many hits per page. Options on the existing form suggest
that controlling the scope of the search will be straight forward.
I'll add updates to this page as I go.
It was about two years ago that ATLA set up Index Server on its web
site. I delegated this to an assistant who took a few days to get it
to work. The version of Index Server that works with IIS 3 does not
provide control over the files that are indexed. It indexes everything
under the directory trees that you have it index. The scripting language
that comes with Index Server does allow displayed results file types
to be controlled via the forms and scripts. It took me at least another
two days after my assistant gave up to gain control of output file types
and integrate the output with our standard page appearance.
The scripting for Index Server is not particularly difficult but it's
totally proprietary. In fact it's specific to Index Server and contained
in two different file types. The scripts that control the execution
of Index Server are .idq files. The results from .idq files are output
to .htx files. Htx. files are conceptually similar to ASP and ColdFusion
file. They're HTML pages with embedded, Index Server specific tags.
So to control Index Server you need to learn two sets of syntax and
a list of proprietary variables that are passed between .idq and .htx
file. None of this is well documented so there is lot of trial and
error getting these things to work. The output from .htx is standard
HTML so any HTML form can be used to initiate or re invoke a search.
I've seen enough to be reasonably sure that Swish-e is not as powerful
as Index Server. There's surely nothing like Index Server's tight integration
with the OS so that users automatically only see results that they have
rights to see and retrieve. On the other hand Swish-e is immeasurably
easier to set up and gain meaningful control over, at least for an
IT professional with an extensive development background. Perhaps a non
technical user could get Index Server "to work" but no one without a solid
programming background will ever tightly integrate it with an existing
web site and give users meaningful control options specific to the
site. I can't imagine Swish-e exhibiting Index Server's totally bizarre
and unpredictable behavior.
Overall, at this point I'd give a modest lead to Swish-e over Index
Server for public web sites but recognize that some sites will need
capabilities they can find in Index Server but not Swish-e. I've
concluded that all really sophisticated web sites that need granular
security will have to build application level security to control access
to resources within individual scripts. If there is a practical way
to build centralized security functions that can be called from
Apache's authorization modules, from standalone CGI scripts and from
search scripts to control results lists, then open source systems have
a better way of doing something that has been one of NT's strengths.
Update 6/2/00
As I expected, tying the CGI front end into GeodSoft's site design was
simply a matter of Perl programming. By late yesterday (6/01) I had
search.pl integrated with the site. Essentially all that was required
was to add several standard lines to determine the absolute path to
the site root directory and require my function libraries. Then I could
call the standard page top and bottom functions. To produce valid HTML
output I had to perform a couple of minor substitutions on the standard
content because CGI.pm does things in a slightly incompatible manner.
Search.pl was already set up to be able to limit areas of the site
searched. All I had to do was replace the sample data structures with
real relative paths and meaningful descriptions. Somewhat more difficult
was suppressing the output I didn't want. Since search.pl was designed
as a fully functional standalone script, it duplicated options that
I have in the standard search form in the left column. I wanted to
suppress all the options except refine search so that users would
return to the standard form to start a new search, limit the area of
the site to be searched or change the number of hits per page.
This last, changing the number of hits per page was the trickiest part.
Search.pl had a variable that set the number of hits to 20 and this could
easily be changed in source code but I wanted the user to be able to
change this. While I was able to get the first page to display correctly
pretty quickly, subsequently search.pl reverted to the hard coded value.
Later I got all but the last page to work. I had to come up with logic
that would determine if this was the first invocation and calculate
the page size. The calculated size had to be passed to all subsequent
invocations which required finding every place that the script generated
a URL embedded in an output form, i.e. all the "Prev 1 2 3 . . . Next"
links, and change them. In the end this was still much less time than
it's taken me in the past to tie Index Server into a site.
The biggest disappointments with Swish-e and search.pl are the time
delay overhead and no phrase searching. Every search imposes about
a 6 second delay which is a huge CGI overhead. Static pages and even
search.pl without search terms return in significantly less than a second
on my 100Mbps LAN. Every search with any search words take about
6 seconds to return. This is on a tiny site with just over 100 pages.
This is pushing the limit of acceptability. I doubt the delay would
be acceptable on a much larger site. I think this is specific to search.pl
and not Swish-e because I did some of the sample Swish-e searches with
huge results sets over my modem. While they took longer, it wasn't
proportional and most of the time was clearly the page size / download
time via a modem.
I still can't find anything to suggest that Swish-e provides any
phrase or proximity search capabilities. Both are very important
for real text searching of large sites or text databases. These and the
time limitations will probably push me look at alternative tools as time
permits. For now I have fully functional site searching that's
OK for my little site.
I decided to go ahead and set up searching on the Open BSD system to
see how long it took. It only took a few minutes to ftp the files over,
edit the Swish-e config file to account for the different directory
locations on BSD versus Linux, and generate the index for the site.
Then I FTP'd search.pl, the newer Perl library files, and the search
form definition file.
I got a server error when I tried running search.pl. I tried one of the
simple CGI test scripts that I use and got the same error. I'd forgotten
that I didn't yet have CGI working on the BSD system. I changed file
permissions but that did not fix the problem. Then I looked at the
error_log which immediately identified the problem. As soon as I changed
the Apache Options configuration directive for the cgi-bin directory from
None to ExecCGI and restarted Apache, the test script worked. Search.pl
got another error but again the answer was in the error_log. GDBM_File,
which search.pl uses, was not installed with the version of Perl that
I have.
I'm going to continue my discussion of what I encountered trying to add
GDBM_File to Perl on OpenBSD on another page.
While the problems I encountered are specific to the GDBM_File module
of Perl on OpenBSD they are symtomatic of the type of problems that
users of open source systems encounter with some regularity.
Top of Page -
Site Map
Copyright © 2000 - 2014 by George Shaffer. This material may be
distributed only subject to the terms and conditions set forth in
http://GeodSoft.com/terms.htm
(or http://GeodSoft.com/cgi-bin/terms.pl).
These terms are subject to change. Distribution is subject to
the current terms, or at the choice of the distributor, those
in an earlier, digitally signed electronic copy of
http://GeodSoft.com/terms.htm (or cgi-bin/terms.pl) from the
time of the distribution. Distribution of substantively modified
versions of GeodSoft content is prohibited without the explicit written
permission of George Shaffer. Distribution of the work or derivatives
of the work, in whole or in part, for commercial purposes is prohibited
unless prior written permission is obtained from George Shaffer.
Distribution in accordance with these terms, for unrestricted and
uncompensated public access, non profit, or internal company use is
allowed.
|