Book a call

Unstructured data: why does it matter for your search experience?

test

--> Uncover the reasons why conventional structured data search solutions fall short in surfacing the continuously expanding volumes of unstructured content.
Stéphane Recouvreur

Stéphane Recouvreur 03 Aug 2023

The vast majority of search solutions available on the market cater primarily to the e-commerce industry.

Why? Money.

E-commerce is a gigantic sector, pulling billions of dollars in revenue every year, that was the first to see the importance of offering advanced search capabilities to customers. Search software providers were simply too eager to fill that need, focusing all their attention and efforts to the detriment of others.

The problem of course is that not everyone is running an e-commerce business. Some legitimate industries (governments, education, etc) with large amounts of information have been neglected over the years.

While every enterprise search engine is busy offering solutions catering to structured data sets (remember this term — we'll get back to it later!), there is a growing need to offer search solutions specifically designed for unstructured data in order to improve information discoverability and user experience further.

What is structured data?

Structured data refers to information that comes in an organized and predefined format — think rows and columns in a spreadsheet or a relational database.

Structured data comes with a predefined data model, meaning that information must fit within a predefined structure or categories (.i.e. size, color, availability, etc). Such data can commonly be found in data warehouses, data lakes, or simple spreadsheets. These databases help organizations categorize and store product and service information in a systematic manner – optimizing search experiences for customers.

How does structured data search work?

Let's take the e-commerce example.

Products are organized within a structured database so that each listing contains specific fields such as product name, description, price, size, color, availability, and location.

By applying filters, customers can easily narrow down their search results to find products that meet their specific requirements, enhancing the overall shopping experience.

Structured data search

Structured data also lends itself to personalized digital experiences, as it can be presented across different channels and platforms to customers, based on third-party or first-party data insights.

What are the limitations of ‘structured’ data search platforms?

There are four main challenges and limitations that come from building a search platform that prioritizes structured databases.

  • Inflexible format – Predefined format creates consistency in processing and analyzing data. Great for e-commerce, but not so great for other unstructured content. This format restricts the ability to surface information from varying data structures and can limit the ability to capture complex or unanticipated unstructured information during a search.
  • Lack of interoperability – Searching across multiple databases with different formats (e.g., microsites, apps, social media) becomes challenging when your search is built mainly for structured data. Data from multiple sources often requires ‘normalization’ and/or ‘reconciliation’, which isn’t always easy to achieve when your data structure follows a predefined data model.
  • Limited understanding of unstructured data – Search platforms built for structured databases simply struggle to comprehend and handle data that doesn't live in a database. Unstructured data, such as long-form articles, blog posts, or Wikipedia pages for example, cannot be easily transformed into structured or semi-structured data formats. Many organizations manage large amounts of unstructured data, leading to inefficient search experiences for both customers and employees.
  • Ineffective search for unstructured documents – Searching within long, unstructured documents presents a unique challenge. It's not enough to simply display relevant results. You need to highlight the specific sections of content that match up with a query. This granularity offers a more meaningful search result and a better user experience.

Site search and unstructured data

What is unstructured data?

Unstructured data refers to information that is not organized in a pre-defined format. It includes a wide range of different data types, including text, rich media, document collections, social media posts, analytics, Internet of Things data, and more.

Unstructured data is often complex and variable in nature, while also accounting for the majority of data managed by organizations across different digital ecosystems. Importantly, it can’t be easily searched with a structured query language.

Unstructured data sources

The majority of data created today is unstructured and comes from a wide range of sources.

  • Text – this can be from multiple sources, such as text documents, social media, online articles, research papers, PDFs, transcripts, and forums.
  • Rich media – this refers to imagery, video, and audio files from different platforms and sources, such as social media, DAMs, and podcasts.
  • Analytics – web data, such as scrapings or zero-party data like that from surveys and forms.
  • Internet of Things – sensor data from any device linked to the web, such as medical/health tracking devices, agricultural and environmental tracking devices, weather tracking devices, and more.
  • Communications data – emails and online chat data (e.g., from tools like Slack).
  • Hosted/owned data – think of things like Google reviews, which may be technically owned by Google but are harvestable databases and include a range of content.

How does unstructured data search work?

Unstructured data search like our Squiz Search capability, goes through a complex process to turn disparate data into a searchable index, which is simply not possible with structured data search solutions.

1) Index unstructured data

When using an unstructured data search platform, indexing is crucial for centralizing and surfacing relevant content from disparate sources. It integrates search with any database, directory, social media, or website via API or automated crawling.

The steps a search platform takes during indexing include:

  1. Crawling your data sources, such as a document or web page.
  2. Running the content through a text-processing tool to remove stop words, such as “and”, “or”, etc. This can be automated through natural language search, which allows users to search using colloquial language and receive accurate results.
  3. Proceeding with tokenization, which splits text into manageable ‘tokens’. A tokenizer identifies the longest groups of contiguous characters and creates a token for each group. In the end, we obtain a long list of 'tokens' or words that can be found in a document, and their recurrence. this process can be run both during indexation and in real-time during a query.
  4. The search tool then creates an inverted index, where individual ‘tokens’ or words can surface content from multiple sources. This is how a search query can bring up information from multiple sources. For example, a university website search for ‘timetable’ will rely on its merged, inverted index to surface all data that branch off from this search term, such as an individual’s student portal, support for understanding timetables, and any other content that might relate.

Indexing unstructured data

2) Search unstructured data

After a search platform has indexed data from all sources, it can then search this unstructured data during a query. The process generally includes:

  1. A user enters a keyword.
  2. The search platform uses stemming to ensure it looks for all possible synonyms/related terms, e.g., searching on a university website for ‘biology’ would also surface terms like ‘biologist’, and might surface data/content related to professors or research papers.
  3. The platform will look through the index, grab information that matches the keyword, and then uses machine learning to ‘tune’ the relevance of the search algorithm.

Metadata is a way to search unstructured data based on other criteria like document type or its source/platform.

Searching unstructured data

Why should you move to an unstructured data search tool?

Not all your data and information live neatly in a structured way.

According to Tom Foremski, in today's digital landscape, "every company is a media company." In essence, this implies that each organization will generate unstructured data, encompassing articles, videos, podcasts, and more. As a result, they face the critical challenge of effectively presenting this content to their target audience.

It doesn’t matter what industry or sector your content lives in – even e-commerce – if you want to surface relevant content to your end users every time, then moving to an unstructured data search tool is a must.

Solely focusing on structured data search exposes you to unanswered queries, simply because they don't fit the mold of your content, with users feeling frustrated and seeking answers elsewhere.

Taking unstructured site search further with machine learning and AI

Technologies, such as machine learning and AI, allow automated, precise management of unstructured data, with these advances developing at an incredible rate.

With tools like natural language processing (NLP), search platforms increasingly possess the ability to comprehend text with machine learning algorithms in a manner akin to human understanding and create seamless, personalized digital experiences that start to blur the barriers between search and other experiences.