|
ASP.NET C# Search Engine (highlighting, JSON, jQuery & Silverlight)Download source code - 285 Kb or with iTextSharp assembly - 1,383 Kb and the seperate Silverlight project - Kb
BackgroundThis article follows on from the previous six Searcharoo samples: Searcharoo 1 was a simple search engine that crawled the file system. Very rough. Searcharoo 2 added a 'spider' to index web links and then search for multiple words. Searcharoo 3 saved the catalog to reload as required; spidered FRAMESETs and added Stop words, Go words and Stemming. Searcharoo 4 added non-text filetypes (eg Word, PDF and Powerpoint), better robots.txt support and a remote-indexing console app. Searcharoo
5 runs in Medium Trust and refactored Searcharoo 6 adds indexing of photos/images and geographic coordinates; and displaying search results on a map. Introduction to version 7The following additions have been made:
Storing the complete document text during indexingBack in October '08 SMeledath asked how the description shown in the results could be taken from the page itself... I proposed an approach but did not have time to implement - until now. In previous versions of Searcharoo the index contains only a 'link' between each word and the URL of documents that contain it. The number of times that word appears or where that words in appears is lost during the indexing process (see version 5 for discussion of the old catalog structure). This made it impossible to display an 'excerpt' on the results page since the index only stores the first 350 characters (or the META description tag) - mainly because it was much easier to program. Version 7 significantly alters the 'structure' of the index to store more data: for each word-document pairing we also store the positions of that word in the source document. For example: after parsing out punctuation and whitespace each word is assigned an index, with the first word given position zero and each subsequent word adding one. We also store the complete text of the document and can therefore extract any given part of the text. The key differences between the old and new catalog serialized file
(called
BUT there's more - there is a NEW file called
Highlighting matches in resultsThe majority of the code ignores the
Once we've loaded the file contents from the cache (into an array), we loop through it with some funky positioning to find the first matching word in the content, grab around 100 words around it, then loop through those 100 words and highlight ALL matches.
If it sounds like a hack: it is (kinda). Google results often identify multiple parts of the document where matches appear, and display more than one (seperated by an ellipsis...) - but I will leave that for a future version (or someone else to try)...
Enhanced PDF indexingCodeProject user inspire90 asked about
displaying the PDF 'title' in search results but I didn't really have a solution straight away. Another user brad1213 provided a working code snippet using iTextSharp.
brad1213's code was added direct to
Incorporating this behaviour into the object model required some refactoring of the
PDF indexing process so that PDF documents are treated a little differently to other
file types that require the Version 7 now has a
There was a minor problem with this new subclass however -
I can't believe I wrote that! To subclass this would basically require re-implementing
... so the
Although it's not perfect, the refactored code does allow the subclass to take advantage of 'Default' document handlingPatrick Stuart asked about
a problem he was having with 'duplicate' results
- turned out to be the To fix this problem, additional code has been added to manipulate the 'already visited' list - when a URL matches one of the 'default document' patterns,
we add ALL possible 'default document' combinations to the
As indexing progresses, any variation of the URL is 'already visited', thus prevent the duplication in the catalog (and the results). The updated code looks like this (notice the three different "conditions" where a different URL can be pointing to the same 'default' page):
Set the default document for your website in <!-- Default document filename: served in folder roots [v7] -->
A future/further enhancement could be for the code to be on the lookout for ANY case where a particular page has the exact same content as another page and do some automatic de-duplication... but for now this URL comparison seems to fix the most common bug. JSON results 'service'I saw this article about Silverlight-enabled Live Search and decided to try and enable Searcharoo in the same way. Unlike the article, I decided to try using JSON so I could build a jQuery front-end as well. JSON (or JavaScript Object Notation)
is an mechanism to represent data (like a serialized object graph) using just the Javascript 'object literal' notation: it looks like a simple set of key-value pairs
(with nesting and 'collections' grouped in []). Transforming the
To create this output, we can use the same
jQuery JSON 'client'Given that JSON output (accessible via a simple URL, like
/SearchJson/New%20York.js or
/SearchJson.aspx?searchfor=New%20York), we can now very simply
access the results using Javascript, or the excellent jQuery library (now
'supported' by Microsoft).
The HTML page below can consume the JSON (using jQuery): there is a text input and button which captures the search term and buids
a Url, the jQuery
The result below might look similar to the 'standard' ASPX page - but as you can see from the HTML above, the page is
almost entirely generated by jQuery using the JSON results. Look for the
Silverlight 2.0 JSON 'client'The JSON 'service' can also supply results to a Silverlight 2.0 application, using the We will be binding a class to the
The C# code is shown below. The important elements are
(note: you need to manually Add References to
And this is what the resulting Silverlight 2.0 application looks like (with a search for dollar results showing). Because
we used the Silverlight
The Silverlight 2.0 project is a seperate download that can be opened with Visual Web Developer 2008 Express
(the rest of the Searcharoo code is still .NET 2.0 and can be opened in Visual Studio or Express 2005).
Look for the Bug fixesPossible duplicate indexing when page is redirectedbrad1213 (who has contributed to Searcharoo a couple of times) helped out with
an additional 'error condition' related to the Follows links in Html that have been commented outbrad1213 also identified a solution to the problem of links inside HTML comments (ie within htmlData = Regex.Replace(htmlData
, @"<!--.*?[^" + Preferences.IgnoreRegionTagNoIndex + "]-->"
, ""
, RegexOptions.IgnoreCase | RegexOptions.Singleline);
Surrogate Pair error (PDF indexing)Member 4130814 reporting an error serializing the catalog after indexing PDFs. I was able to reproduce it and (I think) fix it with this simple statement to remove 'nulls' from the string. this.All += sb.ToString().Replace('\0', ' ');
Not 100% sure why those nulls were creeping into the searched text though. ConclusionThis article has been a mix of 'requested features' (keyword highlighting, duplicate removal) and 'new toys' (JSON, jQuery and Silverlight). You can learn more about jQuery, and why JSON is an alternative to XML on the web. |