03 noviembre, 2016
Google’s Custom Search (CSE) provides a cost effective and simple means to deliver a quality site search experience. The CSE API provides XML and JSON formats that let you query the Google engine just for your site.
With yearly costs as low as a few thousand dollars, it can be a great bridge until your site requires a more advanced search or budgets for a full search implementation are available. In this post we cover some tips to get the most out of CSE on your Sitecore site.
Start with good HTML
I’d like to think this goes without saying, but just in case: Google CSE is essentially the GoogleBot, so no bad HTML markup practices. Pretty please.
Semantic Markup
CSE will process structured data and make this available in the API for filtering, sorting and ranking/ biasing. This allows your search UI to deliver features like filters and narrow the scope of the search.
CSE supports PageMap, meta tags, Page Dates and Rich Snippet Data. There are pages and pages of content on the CSE guides, but what you need to think about here is what page level fields will authors be requested to input and what will be generated by the Sitecore solution from other data sources.
Let’s take a look at one format: PageMaps. PageMap data is XML that can be:
- Embedded directly on the page, or
- Included in the XML sitemap
Here is a simple example of a PageMap snippet.
<PageMap xmlns="http://www.google.com/schemas/sitemap-pagemap/1.0">
<DataObject type="document" id="hibachi">
<Attribute name="name">Dragon</Attribute>
<Attribute name="review">3.5</Attribute>
</DataObject>
</PageMap>
If one wanted to then use the review attribute as part of a filter or other search operation – like finding all pages with a rating of 5 - the CSE call changes to something like:
https://www.google.com/cse?cx=[CSEID]&output=xml&q=[my query]+more:pagemap:document-review:5
There is a ton more to read on filtering structured data, which can be found here.
For now, just know CSE can filter, sort and bias and you will need to decide what is warranted in your search UI and what structured data you may need to power filters, sorts and ordering.
Excluding Content
On certain pages, there will be some pieces of content that should be excluded from the search consideration.
Dealing with repetitive boilerplate and navigation elements would be the first issue to address. Any content on the page that is inside an HTML tag with a nocontent class is ignored from a keyword perspective but links are still crawled. So if you want content (like navigation links) to be evaluated as content simply wrap them up like so:
<div class="nocontent">
<!-- It's fine to combine class="nocontent" with other classes in this div -->
<!-- The area to exclude -->
</div>
There will be certain places in layouts or components that you know need this nocontent class, add it. To enable the nocontent feature, the CSE admin needs to change a configuration file.
This developer-level setup works well when you are dealing with standard site elements like navigation, but in a component driven CMS like Sitecore, an author may need more flexibility. This is why we build all of our Sitecore solutions to allow the author to add a set list of CSS classes to the wrapping tag of any component. In the following screen shot you are seeing the fields of a component that define available classes. Note that this example was also configured to allow for control of screen reader and mobile visibility.
Excluding Pages
There may be entire pages you do not want to include. I think this is best addressed by defining some fields on each page to emit the standard noindex, nofollow robots meta tag markup. Just keep in mind this impacts all legitimate search engines.
The above settings made by a content author result in the following robots meta tag on page.
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
One final option: If you want to allow public search results to include the page but not in CSE, this is where you will need to configure very specific page inclusion rules in Google’s CSE console. In the following example we would add any pages under products. As you can see, by being careful with inclusions, you could exclude pages from the possible results.
Synonyms
Not an exciting part of search tuning but an important one. Google CSE allows up to 2000 synonyms per search engine, which should be more than enough for most sites. It does not get much easier than this. Go to the CSE console, select Search features, Synonyms. Add away.
Why Not Sitecore Content Search?
Inevitably someone will ask “Why pay for Google CSE? Doesn’t Sitecore have custom search already?” The short answer: with Google CSE I can get a better quality search for less time and money. Here are some specifics:
- Sitecore content search works at the Sitecore item scope level, not pages. This means that search results by default would be individual items and not pages of content.
- Building a true page level view of content requires some heavy lifting by a programmer and it will take some time to match all the features CSE has to exclude content
- The content search API is powered by Lucene or Solr, which are great indexing platforms, but getting actual search capabilities like synonyms takes yet more development effort
- If you want to check out pricing for yourself, see here.
Conclusion
There is much more that can be done with Google CSE, so for the complete story we encourage you to check out the Google CSE site.