Friday, 11 March 2016

Dynamism of Fuzzy Search in SAP HANA

This blog is about one of the feature that SAP HANA provides, FUZZY SEARCH.

Now the question arises, what is Fuzzy search?!... So, Fuzzy search is the technique of finding strings that match a pattern approximately (rather than exactly). It is a type of search that will find matches even when users misspell words or enter only partial words for the search. It is also known as approximate string matching.

According to Fuzzy Search Reference guide, Fuzzy Search is a fast and fault-tolerant search feature for SAP HANA. The term “fault-tolerant search” means that a database query returns records even if the search term (the user input) contains additional or missing characters, or other types of spelling error.

Fuzzy search can be used in various applications, like:
  • Fault-tolerant check for Misspelled words and typos
  • Fault-tolerant search in text columns
  • Fault-tolerant search in structured database content
  • Fault-tolerant check for duplicate records
The best real world example of such fault-tolerant search is when you type “The United States of Amerika” in the Google Search, it automatically displays result for “The United States of America”.

In SAP HANA, Fuzzy Search can be called by using the CONTAINS() predicate with the FUZZY() option in the WHERE clause of a SELECT statement.

The basic SYNTAX is:
     
SELECT * FROM <tablename> WHERE CONTAINS (<column_name>, <search_string>, FUZZY (x))

Where, x is an argument that defines fuzzy threshold. It ranges from 0.0 to 1.0 and defines the level of error tolerance for the search. A search with FUZZY(x) returns all values that have a fuzzy score greater than or equal to x.

Fuzzy Search can only be applied for:
  • Column Table
  • Attribute View
  • Also on SQL views (created with the CREATE VIEW statement), and on joins of multiple tables and views, in some cases
          having column types as:
  • String (VARCHAR, NVARCHAR)
  • Text (TEXT, SHORTTEXT, FULLTEXT INDEX)
  • DATE 
The CONTAINS() predicate can be used in the WHERE clause of a SELECT statement. It performs:
  • A free style search on multiple columns
  • A full-text search on one column containing large documents
  • A search on one database column containing structured data
The type of search it performs depends on its arguments.

The SCORE() Function

The fuzzy search algorithm calculates a fuzzy score for each comparison, the SCORE() function can be used to retrieve the score. This is a numeric value between 0.0 and 1.0.

The score defines the similarity between the user input and the records returned by the search. A score of 1.0 means the strings are identical. A score of 0.0 means that there is no similarity. The higher the score, the more similar a record is to the search input.

We can request the score in the SELECT statement by using the SCORE() function. You can sort the results of a query by score in descending order to get the best records first (the best record is the record that is most similar to the user input). When more than one CONTAINS() is given in the WHERE clause or multiple columns is used in a SELECT statement, the score is calculated as a weighted average of the scores of all columns.

For example, consider we have a column table with two fields (ID integer, TXT TEXT) having values like different variations of word ‘hello world’. Then the Fuzzy Search for word ‘hello’ with Score will return the following:

SELECT TO_DECIMAL(SCORE(),3,2) AS score, * FROM <table_name> WHERE CONTAINS(txt, 'Hello', FUZZY(0.8))
ORDER BY score DESC;

SAP HANA Certifications and Material

Here, the words ‘hello’ and ‘Hello’ are having the score as 1, since the string matches completely. Whereas, word ‘ello’ is having the lowest score.

We can specify additional search options that change the default behavior of the fuzzy search as an additional string parameter for the FUZZY() function.

There are so many possible combinations of search options available. Lets try out combination of FUZZY() with similarCalculationMode.

Step 1. Create 1 column table as:

create column table <table_name>(  
ID integer,  
TXT varchar(20));  

Step 2. Run following commands to Insert values into the table:

insert into <table_name> values(1,'hello');  
insert into <table_name> values(3,'hell');  
insert into <table_name> values(4,'hel');  
insert into <table_name> values(5,'ello');  
insert into <table_name> values(7,'hello world');  
insert into <table_name> values(8,'hell world');  
insert into <table_name> values(14,'helloworld');  
insert into <table_name> values(15,'hellworld');  
insert into <table_name> values(16,'HelloWorld');  
insert into <table_name> values(17,'HELLO');  
insert into <table_name> values(21,'world');  
insert into <table_name> values(22,'word');  

Step 3. Perform string search with option similarCalculationMode

SELECT TO_DECIMAL(SCORE(),3,2) AS score, * FROM <table_name>  
WHERE CONTAINS(txt, 'Hello', FUZZY(0.8,'similarCalculationMode=compare'))  
ORDER BY score DESC;  
    
We will get the output as:

Dynamism of Fuzzy Search in SAP HANA

Here, the FUZZY() compares all the strings in the table with the search string and gives the best matching results having SCORE() greater than 0.8

SELECT TO_DECIMAL(SCORE(),3,2) AS score, * FROM <table_name>  
WHERE CONTAINS(txt, 'Hello', FUZZY(0.8,'similarCalculationMode=search'))  
ORDER BY score DESC;  
    
We will get the output as:

Dynamism of Fuzzy Search in SAP HANA

Here, the FUZZY() searches all the strings in the table with the search string and gives the best matching results having SCORE() greater than 0.8. Notice the difference between search and compare here. As the part of result it also includes the strings composed of two words.

SELECT TO_DECIMAL(SCORE(),3,2) AS score, * FROM <table_name>  
WHERE CONTAINS(txt, 'Hello', FUZZY(0.8,'similarCalculationMode=substringsearch'))  
ORDER BY score DESC;  
    
We will get the output as:

SAP HANA Certifications and Material

Here, the FUZZY() searches all the strings in the table having substring ‘hello’ as the search string and gives the best matching results having SCORE() greater than 0.8.

Similarly, we can try other Available Properties of FUZZY SEARCH mentioned in the reference guide. Also, we can try different combinations of these properties to get the best possible result as per the requirement.

Source: scn.sap.com