Leveraging Fuzzy String Matching In Competitive Intelligence

Leveraging Fuzzy String Matching In Competitive Intelligence

Product comparison is one of the crucial aspects of competitive intelligence (CI). There are two modes of  product comparison:

  1. Comparison of own product price across multiple channels
  2. Comparison of own product price with suitable competitor product

But, the greatest challenge in this journey is how to get the correct comparison product!!

In this article we would explore how an NLP technique, Fuzzy String Matching (FSM), can help in accomplishing the former, especially for price tracking in e-commerce. FSM is sometimes also called as Approximate String Matching.

A product is sold across multiple online channels/retailers by numerous resellers. One thing that becomes almost impossible to standardize is name of the product displayed on the website. Though it would be same product but different websites and resellers have their distinctive way of representing the product. For e.g. the product – Tomato Chilli Chutney, by Kitchens of India, can be written as

  1. Kitchens of India Tomato Chilli Chutney, 300gms
  2. Tomato Chilli Chutney-300g, Kitchens of India
  3. Tomato Chilli Chutney
  4. Chilli Tomato Chutney by Kitchens of India, 300g

As humans, we can easily state that these products are nothing but one. As humans, we have limits. It would become rather impossible to classify such cases as one, manually, when the list increase to tens of thousands for just one website. Imagine the quantum when, this matching has to be done across 10s of websites and for 100s of product categories!

In such scenarios, FSM comes quite handy with multiple string matching algorithms. Each algorithm has its own specific utility and fitment which can be verified during model building phase. Like Clustering, where categorization is done on the basis of distance between two instances, sometimes using Euclidean distance, FSM also matches strings based on distance calculating algorithms. They are categorized as

  1. Edit-based distance
  2. q-grams based distance (also called as n-grams)
  3. Heuristic distance

Edit-based distance algorithm, distance is the count of changes, like substitution, insertion, deletion, or even transposition of characters, to make the two strings match. E.g. to match aaa with aba, only one substitution is required, so the distance is 1.  Algorithms falling in this categories are – Hamming, generalized Levenshtein, Longest Common Substring, optimal string alignment, and generalized Damerau-Levenshtein

q-grams distance is the count of q-character sized packets which are common between both the strings. Larger the count, better the match. E.g. for ‘aaa’ & ‘aba’ for q-gram with length of 2 characters, the vector for first string would be {aa} and for second (ab,ba}. Since there is no match so the distance is Zero. Algorithms in this category are q-gram, Jaccard, and cosine

Heuristic distance is more of a user based application, with no specific mathematical base. Heuristic Jaro distance between two words is the minimum number of single-character transpositions required to change one word into the other. The other advancement to Jaro is Jaro-Winkler distances.

R and Python offer multiple packages to implement FSM.

R – Stringdist, fuzzywuzzyR

Python – fuzzywuzzy, python-Levenshtein


Leveraging Fuzzy String Matching In Competitive Intelligence

5 Questions To Ask When You Get A Data Dump To Analyse

5 Questions To Ask When You Get A Data Dump To Analyse

As an analyst, our first reaction when we get the data to analyse is to ask an ETA. Then we plan to attack the data left-right-centre and crush the maximum juice out of it with an exhaustive list of analysis techniques.

Oh dear, we got it all wrong!

This approach might get 50 slides to present the analysis but would miss on the most crucial aspect of any analytics engagement: a buy-in from the business folks. Because, since the start, the target of the whole analysis was data, and not the business problem for which you got the data in the first place.

To avoid such situation, it is quint essential to look out for the purpose and not just the means. Having burnt my fingers fiddling with the data bare hands, I realized some rules and checks should be set as a standard practice before seizing the hot cauldron of boiling data.

1. What is the objective of analysis

Without an objective, you would be like a swimmer in the mid of river who is unable to figure out which bank to head for. When you get a data set to analyse, especially in client-vendor situation, ask if the objective is analytics capability showcase or there is an unspoken/explicit business goal. Ask what is expected out of the analysis. The whole premise set at the start would define the course of your journey for next couple of hours or days to come.  

2.  Understand all the data columns

Never ever accept a data dump without understanding the data columns. 7 of the 10 times you would find some similar columns but a different context. For example, columns like ‘Input_ASIN’ & ‘ASIN’ will leave you gaping wide for a metadata file for more context. It becomes even more challenging when there are 60-70 odd columns to work on. So, don’t presume..ASK!

3.  Who the target audience is

A very crucial piece of information about an analysis is: who is the target audience? Knowing certain information about who would consume this analysis makes a whole lot of difference:

  • What is the designation: Higher in hierarchy, lesser tactical and more strategic orientation would be. As put by Avinash Kaushik, senior managers tend to be less interested in data and more in the story or bigger picture.

  • Which business division: The focus of analysis changes with the division. Whether to focus on dispatch lead time, discount, promotions, product variety, profit or revenue… everyone has different interest.
  • Know-how of analytics: To present the analysis, you need to understand how comfortable the people are with analytics

4.  If any analysis has been done in past on the same data set    

There is nothing more frustrating to hear – ‘I know this. What else you got!’ after you put so much hard labor in preparing the analysis. If possible, enquire beforehand what analysis has been presented till date or what the client already knows about. It could be tough at times to know this, but whats the harm in asking.

5.  How will it be shared with client

Not just preparing the analysis, it is imperative to know how it would be shared with the client. Would it be an Excel file, a PowerPoint presentation or through text in email. Whether someone would present the analysis or it would be just dropped-the-other-side. Depending on the case, text amount and context would be presented in the analysis.  

Last word: While presenting the finding, talk the language of the business and less of maths and statistics spo that business folks can follow you rather than just listening. For e.g. instead of saying the median of prices of products sold decreased, say, the focus of customers has shifted towards cheaper products J


5 Questions To Ask When You Get A Data Dump To Analyse