Last November, Google released Cloud Search with third-party connectivity. While not a direct replacement for the Google Search Appliance (GSA), Google Cloud Search is Google’s next-generation search platform and is an excellent option for many existing GSA customers whose appliances are nearing (or past) their expiration. Fishbowl is aiming to make the transition for GSA customers even easier with our new XML Feed Connector for Google Cloud Search.

One of the GSA’s features was the ability to index content via custom feeds using the GSA Feeds Protocol. Custom XML feed files containing content, URLs, and metadata could be pushed directly to the GSA through a feed client, and the GSA would parse and index the content in those files. The XML Feed Connector for Google Cloud Search brings this same functionality to Google’s next-generation search platform, allowing GSA customers to continue to use their existing XML feed files with Cloud Search.

Our number one priority with the XML Feed Connector was to ensure that users would be able to use the exact same XML feed files they used with the GSA, with no modifications to the files required. These XML feed files can provide content either by pointing to a URL to be crawled, or by directly providing the text, HTML, or compressed content in the XML itself. For URLs, the GSA’s built-in web crawler would retrieve the content; however, Google Cloud Search has no built-in crawling capabilities. But fear not, as our XML Feed Connector will handle URL content retrieval before sending the content to Cloud Search for indexing. It will also extract the title and metadata from any HTML page or PDF document retrieved via the provided URL, allowing the metadata to be used for relevancy, display, and filtering purposes. For content feeds using base-64 compressed content, the connector will also handle decompression and extraction of content for indexing.

In order to queue feeds for indexing, we’ve implemented the GSA’s feed client functionality, allowing feed files to be pushed to the Connector through a web port. The same scripts and web forms you used with the GSA will work here. You can configure the HTTP listener port and restrict the Connector to only accept files from certain IP addresses.

Another difference between the GSA and Google Cloud Search is how they handle metadata. The GSA would accept and index any metadata provided for an item, but Cloud Search requires you to specify and register a structured data schema that defines the metadata fields that will be accepted. There are tighter restrictions on names of metadata fields in Cloud Search, so we implemented the ability to map metadata names between those in your feed files and those uploaded to Cloud Search. For example, let’s say your XML feed file has a metadata field titled “document_title”. Cloud Search does not allow for underscores in metadata definitions, so you could register your schema with the metadata field “documenttitle”, then using the XML Feed Connector, map the XML field “document_title” to the Cloud Search field “documenttitle”.

Here is a full rundown of the supported features in the XML Feed Connector for Google Cloud Search:

  • Full, incremental, and metadata-and-url feed types
  • Grouping of records
  • Add and delete actions
  • Mapping of metadata
  • Feed content:
    • Text content
    • HTML content
    • Base 64 binary content
    • Base 64 compressed content
    • Retrieve content via URL
    • Extract HTML title and meta tags
    • Extract PDF title and metadata
  • Basic authentication to retrieve content from URLs
  • Configurable HTTP feed port
  • Configurable feed source IP restrictions

Of course, you don’t have to have used the GSA to benefit from the XML Feed Connector. As previously mentioned, Google Cloud Search does not have a built-in web crawler, and the XML Feed Connector can be given a feed file with URLs to retrieve content from and index. Feeds are especially helpful for indexing html content that cannot be traversed using a traditional web/spidering approach such as web applications, web-based content libraries, or single-page applications. If you’d like to learn more about Google Cloud Search or the XML Feed Connector, please contact us.

Fishbowl Solutions is a Google Cloud Partner and authorized Cloud Search reseller.