|
ht://Dig is an excellent search engine to install on your web server. Try it out! See the Features and Requirements page for more information. Check the ht://Dig home page for the latest news and updates. I'm going to cover some additional installation and configuration hints.
Please report any errors or ommissions to me. Suggestions are welcome too. Thank you.
If you are using Red Hat or Mandrake Linux and you are reasonably familiar with using Apache, you might get by by following these Quick Start instructions. Otherwise, use the complete instructions.
Note that the RPM installer created a cron job in /etc/cron.daily that will run /usr/sbin/rundig once a day so that the search index will automatically be updated once a day.
But you still should look over the rest of this documentation.
Before you start, you should look over the Features and Requirements page. Ht://Dig is available in source "tarball" and Red Hat style RPM distributions. The RPM distribution is much easier to install, but the tarball gives you more flexibility in specifying the locations where everything will be installed. Your choice. This document is going to cover installing both the htdig 3.1.5.tar.gz "tarball" and the RPM file. The Where to get it page is the best place to get the most recent version of ht://Dig.
Mandrake 7.2 has ht://Dig on the install CD and might already be installed on your system. Red Hat 7.0 has it on the "Power Tools" CD. You can get other RPM distributions from here. (Or from here.) Download one of these:
htdig-3.1.5-0.i386.rpm (Red Hat 4.2)
htdig-3.1.5-0glibc.i386.rpm (Red Hat 5.x) *
htdig-3.1.5-0glibc21.i386.rpm (for glibc-2.1, Red Hat 6.0, 7.0**)
Put it somewhere on your Linux machine and (as root) type rpm -Uvh htdig*.rpm. Bang, it's installed. Now skip to Where everything is.
* There is a bug with vixie-cron for Red Hat 5.0 and 5.1. The ht://Dig team reccomends upgrading to a newer version of vixie-cron. Look for vixie-cron-3.0.1-37.5.2.i386.rpm. This affects you, because the RPM installer installs rundig as an /etc/cron.daily job. Get the updated vixie-cron from here.
** If you are using Red Hat 7.0 and don't have the Power Tools CD, then you can use htdig-3.1.5-0glibc21.i386.rpm, but it needs some additional work to get it going. You must first install compat-libstdc++-6.2-2.9.0.9.i386.rpm from the first Red Hat 7.0 install CD. The default HTML directory in previous version of Red Hat was /home/httpd/html. It is now /var/www/html. htdig-3.1.5-0glibc21.i386.rpm installs several things in /home/httpd/html. These need to be moved to /var/www/html.
Move search.html and the htdig directory to /var/www/html. You must also move /home/httpd/cgi-bin/htsearch to /var/www/cgi-bin/htsearch. The 'local_urls' variable in /etc/htdig/htdig.conf needs to be modified because it refers to /home/httpd/html.
For the tarball, you should decide where you want ht://Dig to install its programs. You must decide this before you install it, because you can't move it after you have it installed. (Except by deleting the entire installation and re-installing from scratch.) The default is to install in the /opt/www directory. The assorted ht://Dig binaries and configuration files will be located in this directory tree. You must configure your Web server to execute the ht://Dig CGI programs from here. If this is not acceptable, then change these locations during the installation procedure.
OK, now follow the ht://Dig installation instructions. (You probably should open them in a new window so that you can refer to this page.) When you get to the Configure step, you have the opportunity to edit the CONFIGURE script that defines where everything will get installed. If you want to go with the default location, then just continue on through the procedure.
The RPM installation should need no Apache configuration changes, because everything goes in "standard" locations. Assuming that your installation uses the standard locations....
Assuming that you installed ht:/Dig in the default /opt/www directory, here are the configuration changes that you should add to your Apache configuration file(s).
Alias /htdig/ /opt/www/htdocs/htdig/ |
So that you can "point" to assorted graphic files. e.g., <img src="/htdig/htdig.gif"> Also, the default search.html file is located here. It is a real good idea to keep the /htdig/ definition, because the template files that are used to display the search results all refer to htdig/ to locate files. |
ScriptAlias /htdig-cgi/ /opt/www/cgi-bin/ |
Is how you access the htsearch program for searching. e.g., <form method="post" action="/htdig-cgi/htsearch"> |
<Directory /opt/www/cgi-bin/> AllowOverride None Options ExecCGI </Directory> | So that Apache will allow access to the ht://Dig cgi-bin directory. |
After editing your Apache configuration files, type /etc/rc.d/rc.init/httpd restart to restart Apache.
Name | RPM locations | Tarball (Default locations) | Used for |
---|---|---|---|
${CONFIG_DIR} | /etc/htdig | /opt/www/htdig/conf | htdig.conf configuration file |
${COMMON_DIR} | /var/lib/htdig/common | /opt/www/htdig/common | Template files used for search results |
${BIN_DIR} | /usr/sbin | /opt/www/htdig/bin | rundig and other "digging" binaries |
${DATABASE_DIR} | /var/lib/htdig/db | /opt/www/htdig/db | The search index database files. |
${CGIBIN_DIR} | /home/httpd/cgi-bin | /opt/www/cgi-bin | htsearch |
${IMAGE_DIR} | /home/httpd/html/htdig | /opt/www/htdocs/htdig | htdig.gif, and other graphic files |
${SEARCH_DIR} | /home/httpd/html | /opt/www/htdocs/htdig | search.html sample search form |
Name | RPM locations | Tarball (Default locations) | Used for |
Important note for RPM users: The RPM installation program attempts to configure ht://Dig so that it will work "out of the box." They installed the various files in "standard" Red Hat locations. One thing that is never standard, however, is the name of your machine. The ht://Dig RPM installer attempts to glean this information from your existing configuration files and appends new definitions at the end of the htdig.conf file, in addition to the "stock" definitions that are scattered throughout the htdig.conf file. This includes the all important start_url: variable. Variable definitions at the end of the file override earlier definitions. Bear this in mind as you are scrolling through htdig.conf.
Edit ${CONFIG_DIR}/htdig.conf. Scroll down and find the start_url: line. This line defines what ht://Dig will index for searching. The default is to index the http://www.htdig.org/ site. This is not a good site to test with, because it takes a long time to index. Change this to point to a "site" on your own machine. For speed, change the URL to use your machine's IP address, rather than the full domain name. For example, if your machine is addressed as 192.168.1.1, then set start_url: to be http://192.168.1.1/
Start_url: must be specified to be accessed the same way as your web server accesses it.
Because ht://Dig works like a web crawler and accesses your HTML pages the same way as a web browser does. So use a browser to access the site on your own machine. Use the same URL that your browser uses in start_url:.
Using the IP address to refer to the site is a shortcut for testing. This IP address will be returned in the search results, so 192.168.1.1, for example, isn't what you would use when you release the search form to the public. In this case, you either have to set start_url: to the actual domain that the site uses, or (preferably) use two configuration files (one for digging and another for searching) and use the url_part_aliases directive to translate from a local IP address to the real domain. This is more complicated than what you should be doing until you have it working and are familiar with the basic operations.
For an additional speed boost, check out the local_urls: directive that lets ht://Dig access the files through the local filesystem, rather than having to go through the web server. But, again, wait until you have ht:/Dig working and are reasonably familiar with how everything works before you try using this.
You should create a robots.txt file in the server's root directory to specify what you do not want ht://Dig (or any other search engine!) to index. Here is a sample robots.txt file
# robots.txt for http://www.example.com/ User-agent: * Disallow: /cyberworld/map/ # This is an infinite virtual URL space Disallow: /tmp/ # these will soon disappear Disallow: /foo.html
Reference for all configuration file directives
Before you can search you must generate the search index database. Change to ${BIN_DIR}. Use the rundig script to run the ht://Dig programs to index your site. Type ./rundig -v Rundig will run the htdig "digging" (indexing) and htmerge (second step of creating the search index) programs. The -v option tells them to be verbose. Meaning that you should see each file as it is indexed, followed by indications of the merging activity.
This should complete in a reasonable length of time (depending on the size of your site.) If you see prolonged periods of inactivity, then press Ctrl-C to abort the programs and check start_url: in the ${CONFIG_DIR}/htdig.conf configuration file. If indexing is taking too long for testing, consider changing start_url: to only index a subset of your site until you are done wrestling with the configuration file.
Note that you must update the index whenever the site is updated. If your site is large and indexing is time consuming, then you might want to do the indexing in a cron job that is run in the middle of the night.
RPM users should know that the RPM installer creates an /etc/cron.daily job that will automatically run rundig once a day. This may be all that you need.
When you get the configuration file squared away, then use ./rundig -s for a considerably shorter display. Alternatively, if something is giving you problems then try using ./rundig -vvv for an extremely detailed and verbose display. In this case, you would probably want to redirect the output to a file. ./rundig -vvv > debug.txt Then load debug.txt in an editor.
Right now the only way you have to generate the index is by running the rundig (or rundig2) script, which possibly is limiting because generates the whole index from scratch each time that it is run. This has two undesireable side effects: 1., it takes times and machine resources, and, 2., searching returns no results while the rundig script is running.
There are other ways to do the search index database updating to sidestep these issues. You should examine the command line options for the indexing programs so that you can develop an indexing procedure that best suits your site's needs.
More information on the htdig, htmerge, htnotify, and htfuzzy programs that are used to generate the search index database.
Look at ${SEARCH_DIR}/search.html This is your sample search form.
For the tarball installation, you probably have to change one line, because we defined the CGI directory to be htdig-cgi in the Apache configuration file. So change
<form method="post" action="/cgi-bin/htsearch">to
<form method="post" action="/htdig-cgi/htsearch">and save the file.
Now use a browser to access this search form. If the IP address of your server is 192.168.1.1, then enter either 192.168.1.1/htdig/search.html (tarball) or 192.168.1.1/search.html (RPM) as the URL for your browser. You should see the search form. Enter a word that you know is somewhere on your site. Click the search button.
(Fingers are crossed.)
You should see the search results displayed, almost instantly.
More information on the htsearch CGI program that does the actual searching.
If something isn't working right, the first thing to do is to go back and check your configuration and try repeating the above procedures. If this doesn't help, then the ht://Dig site has a lot of valuable reference material. Check the configuration page, check the FAQ. Check the on-line reference section. Most important, make sure to visit the ht://Dig Mailing List Archive. The ht://Dig community provides excellent support. Most (if not all) common "why doesn't this work" type questions have already been asked and answered on the mailing list, or in the FAQ.
Use the search box at the bottom of the main ht://Dig page to search the archives (and the rest of the ht://Dig site.)
Examine ${SEARCH_DIR}/search.html. You use this as a basis for how you want the search forms to look. The search results are defined by the template files that are located in ${COMMON_DIR}. You edit these to change how the search results are displayed.
One tricky part is that ht://Dig totally ignores the template files unless you add a template_map directive to htdig.conf. Like this:
this_base: myweb search_results_header: ${common_dir}/${this_base}/header.html search_results_footer: ${common_dir}/${this_base}/footer.html nothing_found_file: ${common_dir}/${this_base}/nomatch.html syntax_error_file: ${common_dir}/${this_base}/syntax.html template_map: Long builtin-long ${common_dir}/${this_base}/long.html \ Short builtin-short ${common_dir}/${this_base}/short.html \ Default default ${common_dir}/${this_base}/long.html template_name: Default
In this case I defined a new variable, this_base: with a value of myweb. The way I use this is to first create a myweb directory on top of ${COMMON_DIR} and copy all the template files into it before I started editing them. This leaves an untouched set of the template files.
Once this has been done I went through and edited all the template files so that they displayed the way I wanted. e.g., editing ${COMMON_DIR}/myweb/header.html, ${COMMON_DIR}/myweb/footer.html, etc. This method is also valuable if you are indexing (and searching) multiple sites and are using multiple configuration files. You keep each different set of template files in a different directory (defined by the value that is assigned to this_base.)
Optional. You could also separate the database files by defining them like
database_base: ${database_dir}/${this_base}
The database files default to be named like db.docdb, db.word.db, etc. Making the above change would result in the database files being named like myweb.docdb, myweb.word.db, etc. Again, this is important if you are using multiple configuration files to manage multiple search databases on the same machine. If you are only using one search database, then you can ignore defining database_base:.
Add a date_format: command to htdig.conf.
Example: date_format: %m/%d/%Y will display like 01/23/2000.
See man strftime for full reference.
ht://Dig supplies the rundig script that is sufficient to manage some ht://Dig indexing operations. But rundig doesn't support all the possible htdig, htmerge, and htfuzzy command line options. It is also difficult to use when you are specifying a different configuration file, because you have to type in the complete path to the configuration file.
I have modified rundig to address this. The modified script is named rundig2. It now supports all the command line options. It also supplies the path and file extension when you use the -c config file option.
Download either
Download whichever of these is most appropriate. Rename it to be rundig2, check to see that the variables that define locations (DBDIR, etc.) are correct, move it to ${BIN_DIR}, and chmod it to be executable. (chmod 755 rundig2)
Now you can use rundig2 instead of rundig when you are creating the database files. If rundig2 doesn't work for you, for some reason, then go back to using rundig and please let me know about it.
Ht://Dig will index Adobe Acrobat PDF files quite nicely, but it needs some additional configuration. You must download and install a PDF-to-text converter and do some additional configuration. Here's how.
Download the Xpdf package from the Xpdf Download page. Linux Intel users can download the pre-compiled binaries (x86, Linux 2.0 (libc6):) Once you have the binaries, then copy pdftotext and pdfinfo to a suitable location (${BIN_DIR} or /usr/bin, for example)
Alternatively, you can also use one of these Xpdf RPM files. Download one of these files:
Install the RPM (rpm -Uvh xpdf*.rpm) and pdftotext and pdfinfo will be installed in usr/bin (Double check the location with rpm -ql xpdf)
Download conv_doc.pl from here and copy it to your ${BIN_DIR} directory. Chmod it to to be executable. (chmod 755 conv_doc.pl) Then load it in your editor and change the $CATPDF variable to point to where pdftotext is and change $PDFINFO to where pdfinfo is.
Finally, edit ${CONFIG_DIR}/htdig.conf and add
external_parsers: application/pdf->text/html /usr/local/bin/conv_doc.pl
Replace /usr/local/bin/ with the location of where you copied conv_doc.pl More about the external_parsers: directive.
Important note. ht:/Dig must read each PDF file in its entirety in order to index it. This is affected by the max_doc_size: directive in htdig.conf. Make sure that max_doc_size: is set to be larger than your largest PDF file.
pdftotext is pretty nifty. It can also be interfaced to lynx Check /etc/lynx.cfg and ~.mailcap.
Installing a Microsoft Word to text converter is similar to Indexing PDF Files. Follow the procedures there to install and configure conv_doc.pl. The only difference is that you install a Word-to-Text converter, such as catdoc. These go together, so it is almost as easy to install both the Word and PDF converters at the same time. conv_doc.pl is already partially configured to use catdoc. Add
external_parsers: application/msword->text/html /usr/local/bin/conv_doc.pl
to ${CONFIG_DIR}/htdig.conf. If you were installing both the PDF and Word converters, then you'd add
external_parsers: application/msword->text/html /usr/local/bin/conv_doc.pl \ application/pdf->text/html /usr/local/bin/conv_doc.pl
Again, replace /usr/local/bin/ with the location where you have actually installed the conv_doc.pl script.
It is valuable to have a record of what prople are searching for so that you know what they are interested in. This can give you hints on additional content that you need to add to your site.
To log search requests, add logging: true to your configuration file. This will direct the system logging facility to log search requests.
However, you might want to change the default logfile where syslog sends these messages to. (By default it goes to /var/log/messages.) To do this, edit your /etc/syslog.conf file and add this to it:
# Log ht://Dig search requests local5.* /var/log/htdig
Remember to use tabs and NOT spaces in your syslog.conf file. Otherwise it won't work.
The system will now log search requests to both /var/log/messages as well as to /var/log/htdig, so now you have to tell it not to log search requests to /var/log/messages. To do this, add ;local5.none to your /var/log/messages line. It should look something like this:
# Log anything (except mail) of level info or higher. # Don't log private authentication messages! *.info;mail.none;authpriv.none;local5.none /var/log/messages
For the changes to take effect, you'll need to restart your syslog daemon. To do so, just do a
killall -HUP syslogd
That will force syslogd to re-read its config file for the changes to take effect.
See man syslog.conf -S 5 for more information.
Syslog information courtesy of Bruce A. Buhler
Back to the scrounge.org home page.