Ordr.interns Michael: Finding popular food with NLP
[Here at ordr.in we love our interns! They’re amazingly talented and far more accomplished than I was. For their first week at ordr.in we ask them to build anything they want as long as it uses our API. They present it to the company that friday. We push them right in - not a lot of time to get settled and get going! We give them whatever help and resources we can (we’re sitting about 3 feet from them, how can we help but look over a shoulder every now and again :), but this is totally their project. The creativity and technical wizardry we’ve seen is breathtaking, I hope you’ll agree.
Michael joins us from CMU and he’s returned for his second tour at Ordr.in having done his first internship with us last summer. We’re so glad he decided to come back! His project involves menus, NLP and fuzzy matching, oh my! --felix]
I wanted to build something that would simplify menus by limiting the number of items in a menu. I really liked the idea of filtering the items in an Ordr.in menu by popularity - the challenge was defining popularity. I decided to look through user recommendations/reviews and extract the items people were talking about and try to match them to our menus. The end goal of this would be to have a server that makes a standard restaurant details API request, processes the results and then returns those results in the normal ordr.in format - so anything that was built to look at an ordr.in menu could use my output, too!
I call review_matches on each review and it returns all the matches that pass a score threshold:
def extract_words(text): return ''.join(re.findall(r"[a-zA-Z0-9&\s]", text)).lower().split() def keyword_match(menu_item, keyword): item = tuple(extract_words(menu_item)) words = tuple(extract_words(menu_item)) match = max((words[i:j] for j in range(i+1, len(words)+1) for i in range(len(words)-1) if re.match(r'.*\b'+' '.join(words[i:j])+r'\b', ' '.join(item))), len) return (match, len(words)) def best_match(keyword_list, menu_item): return max((keyword_match(menu_item, keyword) for keyword in keyword_list), (lambda x : x)) def review_matches(menu, review): encoded_review = unicode(review).translate(punctuation).encode('ascii', 'ignore') keywords = etree.fromstring(alchemy.TextGetRankedKeywords(encoded_review)) keyword_list = [keyword.find("text").text for keyword in keywords.find("keywords").getchildren()] return (review, filter((lambda match : match > 0), (best_match(keyword_list, menu_item) for menu_item in menu.values())), keyword_list)
I started by splitting each review and each menu item into words and finding the longest partial subsequence matches between them. This was encouraging because it significantly reduced the number of menu items without eliminating all of them. The issue was that too many items matched words that were not actually supposed to refer to the food. For example, a review that used the word “baby” for emphasis matched a baby spinach dish.
The next step was to use Alchemy, an awesome natural language processing API. I used the same algorithm but with keywords that Alchemy extracted from the reviews instead of the entire text of the reviews. Testing this on the same menus as the first attempt showed a great increase in accuracy. It successfully reduced false positives with irrelevant words while maintaining the relevant matches. One issue is that I make a separate Alchemy API call for each review in order to take into account multiple recommendations of a single dish. This leads to long response times: tens of seconds in bad cases. For the proof of concept, though, this was totally acceptable!
Alchemy itself was pretty easy to use in my code. With their Python library, it took only a few lines of code to get the keywords from a text string:
from AlchemyAPI import AlchemyAPI, etree alchemy = AlchemyAPI() alchemy.setAPIKey(api_key) keywords = etree.fromstring(alchemy.TextGetRankedKeywords(text))
There are some improvements I’d like to make next. In order to further refine match accuracy I want to start building a list of generic words, like “grilled,” and ignore menu items that match reviews only on those words. I’ll also experiment with making the Alchemy API calls in parallel or concatenating the reviews before sending them to the API, in order to avoid the total API response time that is currently linear in the number of reviews. Ultimately, I’ll probably also check out nltk since my needs are relatively simple (if you can consider any NLP simple!).