Searcharoo.net ASP.NET Search Engine

Searcharoo.net: ASP.NET Search with C#

Skip Navigation Links
Home
Version 1
Version 2
Version 3
Version 4
Version 5
Version 6
Links

  

Version 6
Index JPG images, index GPS location data for mapping results, address "No" Trust problem and fix a few bugs. NEW! June '08
Version 5
Remove Binary Serialization to solve Medium Trust problem; index OpenXML document formats.
Version 4
Refactored codebase and ability to index and search Microsoft Word, Excel, PowerPoint and Acrobat PDFs. Little improvements like robots.txt and excluding regions of HTML also added.
Version 3
Adds a "save to disk" for the catalog; feature suggestions, bug fixes and incorporation of code contributed by others from previous versions.
Version 2
Extend Searcharoo to populate its search catalog by Spidering HTML pages - follow links and imagemaps to process both static and dynamicly generated pages! You can also search for multiple words.
Version 1
How to build a simple, extensible search engine using ASP.NET that can crawl files and create a searchable catalog by processing the text from HTML source.
Display Pagerank
Locations of visitors to this page

Search Links

About Search | ASP.NET related | Other products | File formats | Internationalization | SQL-Server | Microsoft: Index Server, CMS, SharePoint and Search

About Search

MIND: Under the Covers: How Search Engines Work Quote: "When you send a request to a smart search engine, it does more than just a lookup and return. Language processing can help an engine uncover what you really meant to find."
January 1997 issue of Microsoft Internet Developer
* * * *
On Search, the Series Quote: "This series of essays on the construction, deployment and use of search technology (by which I mean primarily full-text search) was written between June and December of 2003. It has fifteen instalments not including this table of contents.
This may be a weblog, but the following are not in reverse chronological order, theyre in the order I wrote em, which I suspect is the right order to read em."
Other useful stuff
* * * *
What is Stemming? Quote: "...Algorithms, or stemmers, have been developed, which attempt to reduce a word to its stem or root form..."
The Lancaster Stemming Algorithm
   Java: The Paice/Husk Stemmer Translated from Pascal
   Porter Stemming (inc C#
* * * *
What are Stop Words? Top 1000 English words (stop words???)
Verity info against a long list of stop words: ""
* * * *
Asian Linguistic issues About Encoding, Stemming, Segementation in Japanese, Chinese and Korean * * * *
SearchGuild How search engines work:

Creating and maintaining an inverted index is the central problem when building an efficient keyword search engine. To index a document, you must first scan it to produce a list of postings. Postings describe occurrences of a word in a document; they generally include the word, a document ID, and possibly the location(s) or frequency of the word within the document.
* * * *
Why Writing Your Own Search Engine is Hard also Building Nutch: Open Source Search * * *
SearchEngineWatch http://searchenginewatch.com/sereport/article.php/2220611 http://searchenginewatch.com/searchday/article.php/3307271 * * *
SearchTools What is a Search Tool
and Why Would I Want One?
* * *
WebMonkey Useability & research * *
Lingua :: Stem (Perl) Lingua::Stem takes lists of words an (as determined by the locale) stems them to their root form. This is primarily of use in search applications that need to be able to find conjugated forms of words as well as exact matches.
Also Search :: InvertedIndex in Perl.
* *
Robots.txt Standard info
WikiMedia Robots.txt info
Crawl delay: x also works for MSNBot and here
and on Yahoo
Blocking Altavista's image search with new noimageindex and noimageclick directives for the META ROBOTS tag.

trapping bad robots

RobotCop (a bit out-of-date?)

* * *
From Google: Google Information for Webmasters "Following these guidelines will help Google find, index, and rank your site, which is the best way to ensure you'll be included in Google's results" * * *
Help the Googlebot understand your web site "As a web site author, there are a few simple things you can do to help the Googlebot understand your web site as fully as possible..." * * *

SQLServer Full-text Search

SQL Server "Yukon" Full-Text Search: Internals and Enhancements High-level architecture of full-text-search, followed by code examples.
Linguistic and Unicode Considerations (Index) Useful description of how full-text-search works - covering stemming, breaking, capitalization, phrases, etc.
Implementing a Word Breaker You can implement your own Word Breaker in C++ for SQL Full-text indexing C# String Tokenizer could be the basis of a more complex word-breaker - in particular detecting numbers could be useful to reduce the index size (eg. indexing 10000 10,000 10.000,00 as the same root)
Implementing a Stemmer You can implement your own stemmer in C++ for SQL Full-text indexing
Word Breaker and Stemmer Sample

Microsoft: Index Server, CMS, SharePoint and Search

Using Index Server from .NET Using Index Server from .NET [idunno.org].
Index Server can index websites, but only those it can 'browse' via a UNC/local path (ie. it needs to know the physical AND web address of each page, so it can't crawl dynamic pages with querystrings). That means it can successfully index ASP/ASPX pages that 'exist' in the filesystem (say, your Default.aspx page) but it doesn't parse HTML for links to follow, so it will only crawl News.aspx once, not News.aspx?id=1 and News.aspx?id=2 etc... and it will NEVER find pages accessed by tricks like URL-rewriting, HTTPHandlers that munge or manipulate URLs, etc.

Win2k Indexing Service (2001)
[MSDN] Intro to Indexing Service v3 (2003)
Using Indexing Service with Web Servers

Query Index Server with IXSSO in .NET "There are many articles on the internet about querying Index Server using MSIDXS but few concerning IXSSO in .Net"
Integrating Microsoft SharePoint Portal Search into Microsoft Content Management Server Using SharePoint to provide search capability to MS-CMS sites

Better instructions on CodeProject

Filtershop WMA, MP3, PDF+, StarOffice/OpenOffice IFilter implementations for MS Index Server
SharePoint RTF Filter Tool Microsoft supplied RTF IFilter
Integrating Content Management Server with SharePoint Portal (2004-05-17) "...it does not provide functionality in all the areas that you might need when deploying a web application. Specifically, it is missing an index and search engine, it does not provide eCommerce functionality, and it is not an asset management tool..."

OT: CMSWire - another CMS site

TechNet Chat: Content Management Server Host Guest_Scott_MS:
Q: I can't find the definite answer in the newsgroup about implementing search on a CMS site!
Host Guest_Scott_MS:
A: We have NOT rev'd search for CMS 2002 ... but there is an existing whitepaper on search. Search MSDN for SPS, CMS and search. [here, but it's old]

ASP.NET Articles

lucene.net [Open Source] "Lucene.Net is a complete up to date .NET port of Jakarta Lucene a hight-performance, full-featured text search engine written entirely Java..."
Lucene (in Java) and a 'preview' article in August 2000
Nata1 [Open Source] C# open-source search engine
.Text Search "...Core of the .Text search feature is Lucene.NET..."
SiteSearchEngine on DeveloperFusion and CodeProject Developer Fusion Community Forums : SiteSearchEngine
SoundEx implementation in C# a 'sounds like' search match algorithm
Remove html tags and insert remaining text into variables
Parsing htmlmarkup text using MSHTML Parse HTML by walking the DOM using the 'IE control' and MSHTML docs
Directory Listing
Stripping HTML
Opening a file from ASP.NET
Remove White Space Regex
Practical parsing in Regular Expressions
XML Serialization using C#
Multi-threaded Web Applications - Case I: Search Engine Multi-threading is the ability for an application to perform more than one execution simultaneously. When used properly, it can greatly improve the responsiveness and efficiency of an application. However, multi-threading in Windows was quite difficult and error-prone. But with the support from various .NET base classes in the System.Threading namespace, it is now a relatively easy task. And since ASP.NET pages can be created with any .NET languages, we can build some ASP.NET pages that feature multi-threading. This article is the first of the series of 4. I will demonstrate the use of threading in web applications by implementing a simply search engine.
Yider (ASP3.0) The Yider is a VBScript Spider that allows you to quickly add a search system to your site like the one at the top of this page. It stores data in a Microsoft Access, SQL 7 or SQL 2000 database. The Yider does not require DLLs or COM components to run and works for all languages.
Dynamic (Javascript) find-in-page This DHTML script simulates the Edit> Find In Page feature of the browser to allow your visitors to easily search for a particular text on your page. As in the "Find In Page" feature, it highlights the searched text if found, otherwise, prompts a "Not Found" message.
C# Spider - ASP.Net Version I have been spending a lot of time learning C# and .net. I wanted to share some of the things I have learned...
Apply Hit Highlighting and Keyword Context to Your Search Results VB to highlight matches and text excerpts in search results
Other search engines http://www.developerfusion.com/show/4389/
http://www.aspfree.com/c/a/ASP-Code/Creating-a-Personal-Search-Engine-by-Sixto-Luis-Santos/
http://www.codeproject.com/cs/webservices/omnisearch.asp#xx755770xx
http://www.developerfusion.com/show/4389/7/

Products

Atomz "Atomz provides Web content management, site search engine, and commerce search solutions for enterprises, commerce sites, and media sites."
mnoGoSearch mnoGoSearch (formerly known as UdmSearch) is a full-featured web search engine software for intranet and internet servers. mnoGoSearch for UNIX is a free software covered by the GNU General Public License and mnoGoSearch for Windows is a commercial search software version.
Links to other search products/extensions
Innerprise ES.net full-text search... uses SQL and apparently IFilter - IndexServer-dependent???
IFilter implementations
Teleport Webspiders Designed for creating local copies of web data by spidering, similar but not the same as a search engine (Teleport munges links and paths so that the pages 'work' locally, which a search engine generally isn't going to bother with)
WrenSoft Implements plug-in architecture for non-HTML file formats
DTSearch Desktop, CD-ROM and web engines available (including Linux).
KBroker "KBroker is an integrated suite of search based applications for web, intranet and extranet, including corporate use and e-Government information access."
Crawl-It
Xpdf Xpdf is an open source viewer for Portable Document Format (PDF) files.
also
PDFTron
Dynamic PDF
PJX SourceForge
Perlfect
Alkaline UNIX only
WebGlimpse UNIX only
LexTek "Lextek International supplies advanced information retrieval and natural language processing technology."
Hosted products Atomz, Mondo Search, PicoSearch, Sandy Bay
Verity Enterprise Search Conduct Business Online in Multiple Languages with Verity K2
Thunderstone Search Appliance (hardware) and Webinator: the software version (?)

File Formats

Wotsit's Format? Very complete listing of file format 'specs'
Jakarta POI - Java API To Access Microsoft Format Files "...The POI project consists of APIs for manipulating various file formats based upon Microsoft's OLE 2 Compound Document format using pure Java. In short, you can read and write MS Excel files using Java. Soon, you'll be able to read and write Word files using Java. POI is your Java Excel solution as well as your Java Word solution...."
Convert-Files.com Helps find converters, but not necessarily code to integrate with another app...
like this site - ACCI

Internationalization

NCharDet - Character Set Detection .NET port from Java jchardet (see next item)
jchardet - Character Set Detection (Java) "jchardet is a java port of the source from mozilla's automatic charset detection algorithm. The original author is Frank Tang. What is available here is the java port of that code. The original source in C++ can be found from http://lxr.mozilla.org/mozilla/source/intl/chardet/ More information can be found at http://www.mozilla.org/projects/intl/chardet.html
LISA - Localization Industry Standards Association Not directly related to 'search' per se, but a lot of the issues surrounding segmentation and parsing of mulitple languages for building a search catalog are shared by automated localization/translation tools (eg. Translation Memory)
A european search engine Has some interesting 'statistics' about the words and sites indexed (graphs!)