|
Version 6
|
|
Index JPG images, index GPS location data for mapping results, address "No" Trust problem and fix a few bugs.
NEW! June '08
|
Version 5
|
|
Remove Binary Serialization to solve Medium Trust problem; index OpenXML document formats.
|
Version 4
|
|
Refactored codebase and ability to index and search Microsoft Word,
Excel, PowerPoint and Acrobat PDFs. Little improvements like robots.txt
and excluding regions of HTML also added.
|
Version 3
|
|
Adds a "save to disk" for the catalog; feature suggestions,
bug fixes and incorporation of code contributed by others
from previous versions.
|
Version 2
|
|
Extend Searcharoo to populate its search
catalog by Spidering HTML pages - follow links and imagemaps
to process both static and dynamicly generated pages!
You can also search for multiple words.
|
Version 1
|
|
How to build a simple, extensible search engine using ASP.NET that
can crawl files and create a searchable catalog by processing the
text from HTML source.
|
|
|
|
|
Search Links
About Search
| ASP.NET related
| Other products
| File formats
| Internationalization
| SQL-Server
| Microsoft: Index Server, CMS, SharePoint and Search
About Search
|
|
MIND: Under the Covers:
How Search Engines Work |
Quote: "When you send a request to a smart search engine, it does more than just a lookup and return. Language processing can help an engine uncover what you really meant to find."
January 1997 issue of Microsoft Internet Developer
|
* * * *
|
|
On Search,
the Series |
Quote: "This series of essays on the construction, deployment and use of search technology
(by which I mean primarily full-text search) was written between June and December
of 2003. It has fifteen instalments not including this table of contents.
This may be a weblog, but the following are not in reverse chronological
order, theyre in the order I wrote em, which I suspect is the right order to read
em."
Other useful stuff
|
* * * *
|
|
What is Stemming? |
Quote: "...Algorithms, or stemmers, have been developed, which attempt to reduce a word to its stem or root form..."
The Lancaster Stemming Algorithm
Java: The Paice/Husk Stemmer Translated from Pascal
Porter Stemming (inc C#
|
* * * *
|
|
What are Stop Words? |
Top 1000 English words (stop words???)
Verity info
against a long list of stop words: ""
|
* * * *
|
|
Asian Linguistic issues |
About Encoding, Stemming, Segementation in Japanese, Chinese and Korean
|
* * * *
|
|
SearchGuild |
How search engines work:
Creating and maintaining an inverted index is the central problem when building an
efficient keyword search engine. To index a document, you must first scan it to produce
a list of postings. Postings describe occurrences of a word in a document; they generally
include the word, a document ID, and possibly the location(s) or frequency of the
word within the document.
|
* * * *
|
|
Why Writing Your Own Search Engine is Hard |
also
Building Nutch: Open Source Search
|
* * *
|
|
SearchEngineWatch |
http://searchenginewatch.com/sereport/article.php/2220611 http://searchenginewatch.com/searchday/article.php/3307271
|
* * *
|
|
SearchTools |
What is a Search Tool
and Why Would I Want One?
|
* * *
|
|
WebMonkey |
Useability & research
|
* *
|
|
Lingua ::
Stem (Perl) |
Lingua::Stem takes lists of words an (as determined by the locale) stems them
to their root form. This is primarily of use in search applications that need to be
able to find conjugated forms of words as well as exact matches.
Also Search
:: InvertedIndex in Perl.
|
* *
|
|
Robots.txt |
Standard info
WikiMedia Robots.txt info
Crawl delay: x
also works for MSNBot
and here
and on Yahoo
Blocking Altavista's image search with new
noimageindex and noimageclick directives for the META ROBOTS tag.
trapping bad robots
RobotCop (a bit out-of-date?)
|
* * *
|
|
From Google: Google Information for Webmasters |
"Following these guidelines will help Google find, index, and rank your site, which is the best way to ensure you'll be included in Google's results"
|
* * *
|
|
Help the Googlebot understand your web site |
"As a web site author, there are a few simple things you can do to help the Googlebot understand your web site as fully as possible..."
|
* * *
|
Microsoft: Index Server, CMS, SharePoint and Search
|
|
Using Index Server from .NET |
Using Index Server from .NET [idunno.org].
Index Server can index websites, but only those it can 'browse' via a UNC/local path (ie. it needs to know
the physical AND web address of each page, so it can't crawl dynamic pages with querystrings). That means it
can successfully index ASP/ASPX pages that 'exist' in the filesystem (say, your Default.aspx page) but it
doesn't parse HTML for links to follow, so it will only crawl News.aspx once, not News.aspx?id=1 and News.aspx?id=2
etc... and it will NEVER find pages accessed by tricks like URL-rewriting, HTTPHandlers that munge or manipulate
URLs, etc.
Win2k Indexing Service (2001)
[MSDN] Intro to Indexing Service v3 (2003)
Using Indexing Service with Web Servers
|
|
Query Index Server with IXSSO in .NET |
"There are many articles on the internet about querying Index Server using MSIDXS but few concerning IXSSO in .Net"
|
|
Integrating
Microsoft SharePoint Portal Search into Microsoft Content Management Server |
Using SharePoint to provide search capability to MS-CMS sites
Better instructions on CodeProject
|
|
Filtershop
|
WMA, MP3, PDF+, StarOffice/OpenOffice IFilter implementations for MS Index Server
|
|
SharePoint RTF Filter Tool
|
Microsoft supplied RTF IFilter
|
|
Integrating
Content Management Server with SharePoint Portal (2004-05-17) |
"...it does not provide functionality in all the areas that you might need when deploying
a web application. Specifically, it is missing an index and search engine, it does
not provide eCommerce functionality, and it is not an asset management tool..."
OT: CMSWire - another CMS site
|
|
TechNet
Chat: Content Management Server |
Host Guest_Scott_MS:
Q: I can't find the definite answer in the newsgroup about implementing search on
a CMS site!
Host Guest_Scott_MS:
A: We have NOT rev'd search for CMS 2002 ... but there is an existing whitepaper on
search. Search MSDN for SPS, CMS and search. [here,
but it's old] |
Products
|
|
Atomz |
"Atomz provides Web content management, site search engine, and commerce search solutions
for enterprises, commerce sites, and media sites."
|
|
mnoGoSearch |
mnoGoSearch (formerly known as UdmSearch) is a full-featured web search engine software
for intranet and internet servers. mnoGoSearch for UNIX is a free software covered
by the GNU General Public License and mnoGoSearch for Windows is a commercial search
software version.
Links to other search products/extensions
|
|
Innerprise |
ES.net full-text search... uses SQL and apparently IFilter - IndexServer-dependent???
IFilter implementations
|
|
Teleport Webspiders |
Designed for creating local copies of web data by spidering, similar but not the
same as a search engine (Teleport munges links and paths so that the pages 'work'
locally, which a search engine generally isn't going to bother with)
|
|
WrenSoft |
Implements plug-in architecture for non-HTML file formats
|
|
DTSearch |
Desktop, CD-ROM and web engines available (including Linux).
|
|
KBroker |
"KBroker is an integrated suite of search based applications for web, intranet and extranet, including corporate use and e-Government information access."
|
|
Crawl-It |
|
|
Xpdf |
Xpdf is an open source viewer for Portable Document Format (PDF) files.
also
PDFTron
Dynamic PDF
|
|
PJX |
SourceForge
|
|
Perlfect |
|
|
Alkaline |
UNIX only
|
|
WebGlimpse |
UNIX only
|
|
LexTek |
"Lextek International supplies advanced information retrieval and natural language
processing technology."
|
|
Hosted
products |
Atomz, Mondo Search, PicoSearch, Sandy Bay
|
|
Verity Enterprise Search |
Conduct Business Online in Multiple Languages with Verity K2
|
|
Thunderstone Search Appliance (hardware) |
and Webinator: the software version (?)
|
File Formats
|
|
Wotsit's Format? |
Very complete listing of file format 'specs'
|
|
Jakarta POI - Java API To Access Microsoft
Format Files |
"...The POI project consists of APIs for manipulating various file formats based upon
Microsoft's OLE 2 Compound Document format using pure Java. In short, you can read
and write MS Excel files using Java. Soon, you'll be able to read and write Word files
using Java. POI is your Java Excel solution as well as your Java Word solution...."
|
|
Convert-Files.com |
Helps find converters, but not necessarily code to integrate with another app...
like this site - ACCI
|
|
|
|