scikit-learn 0.17 is out!

Scikit-learn 0.17 adds features and improvements that might help me:

  • stochastic average gradient solver for logistic regression is faster on big data sets
  • speed and memory enhancements in several classes
  • ensemble classifier that supports hard and soft voting as well as hyperparameter tuning of the components in grid search
  • robust feature scaler does standard scaling but excludes outliers from the standard range

The full changelog is here. I’ve been testing the changes to see how they’ll impact my work in predicting match winners in League of Legends.

Stochastic average gradient

Like lbfgs and newton-cg, sag supports warm_start so it works well in conjunction with LogisiticRegressionCV to tune the regularization parameter.

First I tried on the 200k match dataset with 61 features. I repeated the tests for better accuracy.

Solver Training time Accuracy
lbfgs 0.5 min 66.59%
lbfgs 0.5 min 66.59%
sag 1.7 min 66.60%
sag 1.8 min 66.60%
newton-cg 2.4 min 66.60%
newton-cg 2.6 min 66.60%

sag is faster than newton-cg but still about 3x slower than lbfgs. It does eke out that last 0.01% accuracy though.

sag is designed for large data sets so I also tried on the 1.8 mil x 61 dataset:

Solver Training time Accuracy
lbfgs 7.2 min 66.07%
sag 45.8 min 66.07%

It’s over 6x slower and achieves the same accuracy. Maybe sag’s benefit really shines on datasets with a large number of features: sklearn team testing used 500 features and 47k features.

Conclusion: Staying with lbfgs.

Other performance

The patch notes briefly mentioned speed and memory improvements in random forests and gradient boosting.

RandomForestClassifier

Tried this on the 200k x 61 dataset:

Version Training time Accuracy
Random Forest 0.16.1 14.1 min 66.34%
Random Forest 0.17 13.7 min 66.39%

The training time and accuracy fluctuations could just be differences due to randomization; random forests tend to fluctuate more than other methods from test to test. In the worst case, it doesn’t seem that much has changed. In the best case there are slight improvements.

Conclusion: Random forest is about the same, but I didn’t test memory usage.

GradientBoostingClassifier

Gradient boosting trains much more slowly than other methods so I started on the 50k x 61 dataset. I ran some tests multiple times to be certain of the results.

Version Training time Accuracy
Gradient Boosting 0.16.1 7.6 min 66.08%
Gradient Boosting 0.16.1 with feature scaling 8.8 min 66.10%
Gradient Boosting 0.17 11.0 min 66.17%
Gradient Boosting 0.17 11.6 min 66.34%
Gradient Boosting 0.17 with feature scaling 11.7 min 66.14%
Gradient Boosting 0.17 with feature scaling 11.7 min 66.17%
Gradient Boosting 0.17 presort=False 14.0 min 65.94%
Gradient Boosting 0.17 max_features=auto 11.2 min 66.19%

Gradient boosting is clearly slower in 0.17 and generally a tad more accurate. The default presort setting is good for runtime and accuracy. Feature scaling doesn’t really help. Adjusting the max_features setting seems to help a touch (should reduce variance and improve training time).

I also tested on the 200k x 61 data:

Version Training time Accuracy
Gradient Boosting 0.16.1 43.9 min 67.66%
Gradient Boosting 0.17 62.3 min 67.75%

Again it’s slower but more accurate. I’ve opened a ticket and right now it’s under investigation. It sounds like a change in the error computation may be the culprit.

Conclusion: Gradient boosting 45% slower but a little more accurate, fix is being investigated.

VotingClassifier

In the previous post I described possible directions to get from 67.9% accuracy up to 70.0% and suggested that an ensemble of the best classifiers may be a fruitful direction but may take a bit of time to code.

Well, two things changed. First off, I found a great guide on making an ensemble in scikit-learn. I implemented a simple ensemble and improved my best results from 67.9% accuracy to 68.0% accuracy by a soft-voting ensemble of gradient boosting and neural networks. It’s not as much as I expected but it’s progress.

The second change is that scikit-learn 0.17 added VotingClassifier, implemented by Sebastian Raschka (who wrote the guide and implementation I found earlier). I ported my ensemble code to scikit-learn and it works great (though I had to change my neural network wrapper to return two columns rather than one for binary classification).

That said, I wish it had a flag to perform calibration of the probabilities of the individual classifiers. I’m currently looking into calibrating but not finding that it helps; gradient boosting has more skewed probabilities than neural networks which leads to more weight on gradient boosting. That’s an unintentionally good decision: putting more weight on the stronger classifier.

Conclusion: VotingClassifier is easy and works like a charm.

RobustScaler

In general using the robust scaler seems like an easy solution to save time in preprocessing your data.

I tried it with logistic regression because it’s so sensitive to feature scaling. But after several tests I didn’t find any difference in either scaling+training time or accuracy.

Thoughts

I bolded the main point of each section so I won’t summarize. But I like the direction the scikit-learn is taking.

Advertisements

Feature scaling is important, but not how I expected

Currently I’m getting up to speed with the Keras library for neural networks. After about a day and a half of effort I have a neural network that’s tied with my best results ever for predicting the winner of League of Legends matches.

Keras doesn’t provide exactly the same methods as a scikit-learn model so I have to write some code to fit it into the cross validation and hyperparameter tuning frameworks. Then I’ll have an apples-to-apples comparison.

I learned that feature scaling is critically important for Keras models. Some models won’t even improve beyond random predictions without it! I knew that feature scaling was good for speeding up convergence but didn’t think modern optimizers would suffer total failure without it.

If you aren’t familiar with feature scaling, it’s just making sure that all your features have a similar range of values. The scaling I’m using subtracts the average value and divides by standard deviation so most values will fall in the -1 to 1 range after scaling (called standardization or a z-score). It’s also possible to scale each feature so that the minimum is 0 and maximum is 1.

Feature scaling is important for machine learning models that use some form of gradient descent, like logistic regression and neural networks. In previous experiments I’d tried logistic regression with and without scaling and it had only minor impact though. However, I only ran those tests after finding a reasonable range of regularization constants to tune. Unfortunately I learned that hard way that the optimal regularization constant is radically different for scaled vs. unscaled data (1).

What I learned today

  • Optimal regularization constant (C) is dependent on feature scaling.
  • Feature scaling speeds up optimization by about 6x (including the time to scale).
  • scikit-learn has three optimizers: liblinear, lbfgs, and newton-cg
    • With feature scaling the three optimizers give almost identical results for each C value. lbfgs is consistently worse than newton-cg and lib linear but only by about 0.01%.
    • Without feature scaling, liblinear is consistently better than newton-cg or lbfgs. The best result from liblinear is about 0.02% better than the best from newton-cg and about 1.66% better than the best result of lbfgs. For my problem, 1.66% is about the gain I can get with a couple weeks of feature engineering.
    • lbfgs is drastically faster than newton-cg or liblinear
  • LogisticRegressionCV is about 2x faster than GridSearchCV for lbfgs. The ratio is more dramatic if you’re tuning over more than 10 C values because it initializes the weights for all instances closer to the optimum aka warm start.

Tests within LogisticRegressionCV at 50k samples

  • lbfgs: 0.2 min
  • newton-cg: 1.0 min
  • liblinear: 1.4 min

Tests with GridSearchCV at 50k samples

It’s tuning the same C values just doing it a different way.

  • lbfgs: 0.6 min
  • newton-cg: 1.7 min
  • liblinear: 1.4 min

Note that GridSearchCV and LogisticRegressionCV are the same speed with liblinear because liblinear doesn’t support a “warm start” in the optimization.

Tests at 200k samples

  • lbfgs LogisticRegressionCV: 1.2 minutes [what I’m switching to]
  • newton-cg LogisticRegressionCV: 4.8 minutes
  • lbfgs GridSearchCV: 2.9 minutes
  • liblinear GridSearchCV: 7.5 minutes
  • liblinear GridSearchCV without feature scaling: 33.0 minutes [what I’ve been using for weeks :( ]

After these optimizations it’s 27.5x faster.

I didn’t even try tests with 1.8 million samples. From memory that took about 3 hours for liblinear GridSearchCV without feature scaling and tuning 5 C values instead of 10. If the same speedup holds it should train in about 6.5 minutes.

I also checked the accuracy at the optimal C value for both. The fastest run found an optimum of 66.37% accuracy. The slowest (no feature scaling) found an optimum of 66.26% accuracy. So it’s not that lbfgs is cutting corners here; we’re actually gaining accuracy due to better optimization of C value with feature scaling.

Why does this matter?

When I’m trying out a new feature I don’t know if it’s useful or not. So I test it at 50k samples. But sometimes the feature isn’t reliable in a small data set. I may find that it fluctuates the accuracy by 0.01 at 50k samples but gives a 0.2 gain at 2 million. Big improvements are clear even at 50k though; this is mostly helping me to quickly make decisions about small changes.

It really matters because speed of iteration is life. What I mean is being able to test ideas quickly.

Notes

  1. The optimal C value will differ when the useful features have radically different scales. It happens because the scale of the optimal weights is affected by the scale of the inputs and therefore affects the regularization value whether it’s L2 or L1.
  2. These tests tried 10 C values each. In my previous experiments I dropped it down to 4-5 values to try and speed up my experiments.
  3. For all tests I used n_jobs=3 on a quad-core cpu. They all use the exact same 10-fold cross-validation splits and the splitter is initialized with random seed 4.
  4. Comparing LogisticRegressionCV and GridSearchCV is reasonable if you’re only tuning the regularization constant C. If you’re tuning other parameters like the max number of iterations or trying L1 vs L2 then you have to use GridSearchCV.

Question processing for factoid search

Searchify was a project to enable quick factoid lookup on mobile advertisements. A full screen ad would have a search box and you could get quick answers without leaving the ad or even the app. Previously I’ve written about building synonyms for automotive in Searchify.

One problem we faced is similar to web search: Will users prefer keyword search (e.g., “price”) or full question search (e.g., “How much is it?”)?

I’ll cover this in three sections: 1) trying to get user data 2) separating questions from keyword search and 3) extracting useful info from full questions.

Will people actually enter questions?

Previously we worked on Swype; we’re intimately familiar with mobile text input. Usually people cut corners and write short messages when possible. If you’re typing then keyword search is easier. But if you’re using speech recognition it’s less clear. It’ll recognize full sentences better and there may be a preference for full questions.

If possible it’s best to look at data. But without a live system there’s no data coming in. So I made a mock image of the system and used Mechanical Turk to solicit example searches. I iterated on this and learned several tips for Mechanical Turk

Mechanical Turk HIT for Toyota Prius Searchify ad

I tried to keep the directions short and tried not to lead users into giving questions vs keywords. I also ran a second version specifically asking for questions rather than searches.

This task was extremely informative. Notes:

  • 10% of responses (101/991) were questions with general directions “What would you search for?”
  • 98% of responses (646/658) were questions when users were specifically asked for questions. “Ask questions about a car ad”
  • Some small percentage didn’t follow the directions regardless of what I did. Things like blank fields, including “Toyota Prius” in the search, etc.

Unfortunately, this sample is biased towards people typing on a computer. We can’t make definitive conclusions except that probably we should handle both keywords and questions.

Identifying questions

Full questions sometimes work in ElasticSearch because any stopwords are filtered out. But not all of the question words are in standard stopword lists. A word like “how” might be filtered but “much” not.

Keyword searches can be sent directly to ElasticSearch without processing. But any questions need to be preprocessed and the keywords should be sent to ElasticSearch. In other words, we need to first identify questions.

At first we just looked for a question mark, but people don’t always enter one. Then we expanded to a list of unigrams/bigrams at the start of the search. I couldn’t find evaluation numbers for this, but I remember there were a few unexpected cases like “car safety rating?” or “miles per gallon?”.

We implemented a simple system so that we could spend our time on more important problems. In the context of a prototype this was the right decision. But if we expected to quickly scale to different domains and languages I would’ve set it up as logistic regression or another simple classifier.

Classifying questions

Given a question like “How much is it?” we need to know to look up the price. None of the words in the question clearly indicate this though.

The closest academic work we found was by Li and Roth, who provide their question classification dataset online. The nice thing is that each question has a two-part answer type, like NUMERIC/money or NUMERIC/count. The unfortunate part is that the answer types don’t cleanly align to our data.

Experiments on the Li and Roth data

Several academic publications use this dataset and achieve good accuracy. We unfortunately didn’t have enough time to compete with such strong approaches but it gave us a reasonable upper bound on performance.

We used two very simple approaches to classify questions. In both we lowercased and tokenized the question roughly following Penn Treebank tokenization. We also added a period as the first token to handle start of question.

The first approach was to extract unigrams, bigrams, and trigrams from the question and use them as binary features in logistic regression using scikit-learn. This is extremely simple approach and would be more of a baseline in academic work.

There were concerns that this would be too slow or overfit the data so we also had a model that used the first word, two words, three words, four words of the question. The goal was to classify purely based on parts like “how much”.

We split the data 80/20 and evaluated on the held out data. We also evaluated on the training data to help understand how much we’re overfitting.

Two-part classes (50 classes) On test data On training data
Predict majority 17.3% 17.7%
Predict by first word only 33.5%
Logistic reg on uni/bi/tri 76.2% 97.9%
Logistic reg on start of question uni/bi/tri 56.5% 62.5%
One-part classes (6 classes) On test data On training data
Predict majority 22.9% 22.9%
Predict by first word only 50.4%
Logistic reg on uni/bi/tri 84.0% 98.3%
Logistic reg on start of question uni/bi/tri 68.5% 73.7%

Even these simple approaches are significantly better than predict majority and predicting using just the first word. But they’re clearly overfitting the training data: there’s a big gap between accuracy on the training and testing data.

Domain adaptation to car factoid search

We also have a smaller corpus of questions for Searchify solicited from ourselves as well as Mechanical Turk workers. One quick test is to run the classifier on our questions.

Unfortunately we found a lot of questions labeled as HUMAN:individual. This is the most common question class in the Li and Roth data set so it makes sense for the classifier to default to the training set majority.

But those labels on our searches had 0% accuracy. Many others were labeled as DESCRIPTION:manner which was incorrect. Some classes were very accurate though, like NUM:count, NUM:date, NUM:money, NUM:period (period of time), NUM:volsize, and LOCATION.

We had someone annotate our question data for answer types and found a disconnect; many are yes/no questions but the Li and Roth data doesn’t have that type. So we added a few answer types.

The second problem is that our annotator didn’t use Li and Roth labels when appropriate, so some questions were assigned “number” without the second part label or “description”. To address this problem, we ran the classifier from Li and Roth training data over our dataset and if the annotation was “number”, we would find the most probable NUM:… label and relabel the data.

We also combined our data with the Li and Roth data but duplicated our questions to balance the size of the data sets. It’s a cheap trick to make sure the classifier isn’t skewed towards the Li and Roth data (which was the larger data set).

Even still, many errors remained. HUMAN:individual was always an incorrect classification. There were also problems in wording, such as “What’s the curb weight?”. This was annotated as a number but the classifier gave ENTITY:term (without domain adaptation) or description (with domain adaptation). The problem is that “What’s the” in the Li and Roth data is usually asking for definitions or terms.

I didn’t get the chance to do a nice evaluation because I was mostly dealing with bug reports like “how much does” should give NUMERIC/money and similar things. Things got hectic and I ended up shifting to extracting keywords.

Keyword extraction

Ideally we need to extract keywords from the question and send them to ElasticSearch. For example, given “What type of engine does it have?” we want to search for “engine” or maybe “type of engine”.

I did a proof-of-concept test to see if we could solve it with shallow methods:
1) Extract tail (the end of the question)
2) Extract middle
3) Inverse document frequency over the questions

(If I had to solve this again I might start with sequence labeling because it can represent more complex ways of extracting keywords.)

What I found was that the relevant keywords could be extracted successfully by each method:
Extract tail: 34/52
Extract middle: 43/52
IDF*: 50/52

* I can’t figure out what IDF threshold I used so this may be optimistic.

Overall one thing I felt is that IDF had the potential to mostly solve the problem if we compute the IDF score over the set of example questions (we’d need to hold out data for evaluation though).

A few examples with the keywords we want underlined:

How much horsepower does it have?
What’s the miles per gallon?
How many miles does it get per tank?
How big is the battery?

Note that there’s some interaction in system design between question classification and keyword extraction. In the case of “How big is the battery?” we may not want the keyword “big” if the question type is NUMERIC/size.

Using the non-question data to help the question data

When I started to write code to productize keyword extraction, I realized is that we wanted to convert the question data to be more like the non-question data.

So I tried another approach: What if I only extract keywords from questions that commonly appear in keyword searches? This worked surprisingly well!

There was one problem though: Toyota and Prius were often labeled as important keywords because they sometimes appeared in junky keyword searches. It helped to remove those.

Some examples of good labeling with this method:

what is the <terms>engine</terms> ?
what 's the <terms>gas mileage</terms> of a prius really ?
what is the typical yearly <terms>maintenance cost</terms> ?
what are the most <terms>common repairs</terms> for this vehicle ?
what are the details of the <terms>warranty</terms> ?

Even in the good cases sometimes the nuance of the question is lost. When asking what that real mileage is, they’re saying that they doubt the official numbers. Similarly when asking for details of the warranty, they’re asking for more than just the basic info.

Some examples of bad labeling:

how many people does this <terms>car</terms> seat ( is it a 2 door or 4 door ?)
how many cylinders does the <terms>engine</terms> have ?
how much horse <terms>power</terms> does it have ?

Sometimes the question isn’t something you can easily ask in a keyword search (like the first one). In the cylinders case, our keyword data is just too sparse. In the third example, usually people spelled horsepower without a space so it wasn’t identified.

More data would help this approach. If I could do this again I would’ve solicited more keyword searches from Mechanical Turk. If there were lots more time, formal evaluation would allow experiments in other approaches. I bet you could get far by indexing the keyword searches in LSI with gensim and using the questions to search the keyword database. Then send the best keyword search to the real ElasticSearch.

Conclusions

This work was fun but we didn’t put in enough effort to build a good question-handling system. Things we should’ve done to achieve good results:

  • Design a superset of Li and Roth classes to cover our domain
  • Annotate a larger data set for question types
  • Sequence labeling for keyword extraction (but with so little data, the features may have needed to be more general than words)
  • We didn’t look into how to integrate this into the full system enough. For instance, any NUM question should boost up numeric results. NUM:money would just add keywords like “price” and “MSRP”. This should have been all automated.

Synonyms for factoid search, Part 3

In the previous two posts I described 1) our problem and initial simple approaches and 2) WordNet-based solutions. Now I’m finally writing up our best solutions: gathering a domain-specific corpus and learning word associations.

Quick reminder: The goal is to look up facts related to a car advertisement. But there’s a huge problem in that the data only has one or two words per fact and people often use different words when searching. So we need a good source of synonyms or related words for the automotive domain.

Note: This post was much longer than I expected :(

Problems with previous approaches

The sources often lack the synonyms that we really need. I can’t fix that with better ranking; the data just isn’t there. The other problem is that both Wikipedia redirects and WordNet really need good word sense disambiguation and that’s hard.

Scraping a domain-specific corpus

Goals:
1) Automated
2) Scale to other domains

The second goal is tricky: I wanted something that would work when we’re dealing with ads for phones, drugs, homes, credit cards, etc.

There’s a subreddit for almost anything. A quick search for auto subreddits shows /r/cars, /r/Autos, /r/MechanicAdvice, /r/Cartalk, and tons of manufacturer-specific subreddits.

So I wrote a reddit scraper using a Python module named scrapy. The setup was a bit involved:
1) Get scrapy installed. On Windows use Anaconda to make this easier
2) Run scrapy command to create a default project
3) Make classes for the items you want to scrape
4) Write a spider with starting URLs, rules to extract URLs that you want to follow next, rules to filter URLs to follow, and rules to extract the actual items from the pages
5) Add a pipeline to clean up the items (strip any leftover html, clean up whitespace, etc)

Generally you can follow the tutorial. Tips:

  • Use scrapy XPath selectors to pick the parts of the HTML you want to process. When possible prefer to extract content based on CSS classes and ids. Figure out the XPath you need by right-clicking in Chrome and doing “Inspect element”
  • Be strict about only following “Next” links and links to detail pages. I had bugs where it’d follow links to “Previous” which had a different URL so it fooled scrapy into visiting the page again (normally the framework will track sites it’s visited and only visit once). I also had bugs where it’d follow links to different sort orders and just lead to cycling the same content.
  • Set a DOWNLOAD_DELAY to 1 sec or more. This ensures that you aren’t hitting their servers much.
  • Decide whether you want a depth-first search or breadth-first. They have memory tradeoffs but depth-first tends to go off into weird deep sections of the internet. See this.
  • When running your scraper from the command line, you can store the scraped items as Json which is great for downstream processing.
  • I used the HTML stripper from NLTK (nltk.clean_html). It’s higher quality than writing your own.
  • If I could do it again I’d probably spend a day and see if I can set up two levels of Kimono scrapers to accomplish something similar.

For this project I really just need a collection of documents. I wasn’t sure at first whether I wanted each comment to be a document or each thread so I made sure I could experiment with both. So I have a file format like:

{"title": "Thread 1 title", "url": "http://blah", "body": ["Post 1", "Post 2", ...], "links": ["http://...", "http://...", ]}
{"title": "Thread 2 title", "url": "http://blah", "body": ["Post 1", "Post 2", ...], "links": ["http://...", "http://...", ]}
...

With a rate limit of about one page per second I let it run for about a day and got ~100mb worth of text.

Checking data quality

This is likely a topic for a dedicated post but don’t assume that you have clean data. Some simple checks I did for this project:

  • Read through the actual Json for 20-30 minutes. I always end up finding duplicates, cut off content, bare URL lists, and other weirdness. It’s counterintuitive in a world where we focus on automation but it’s worth it.
  • (NLP-related) Tokenize the text and make a unigram distribution. Inspect it any time it’ll change.
    • Are common words at the top like the, a, an, I, you, and? (Don’t read into it too carefully cause the specific order is domain-dependent)
    • Is capitalization reliable or unreliable? (doesn’t matter if you lowercase)
    • How is your tokenizer handling words like “it’s”, “your’s”, “don’t”, “I’m”?
    • In the automotive domain, I check that all the terms I need are in there, such as Toyota, Prius, car, cars, truck, trucks, engine, hybrid, mileage, milage, safety, air bag, etc.
    • If there are any weird words in the top 100, search the actual Json. They may come from duplicate documents that result from a scraping bug.

Using domain-specific corpus for filtering

One of the early things I tried was to filter the Wikipedia redirects and WordNet synsets by how frequent the terms were in the reddit automotive corpus. It’s simple and effective but ends up filtering the lists a bit too much.

Filtering WordNet

color: colour
security: protection
option: choice, alternative
grey: gray
warranty: guarantee, warrant, warrantee

They’re pretty good but I couldn’t use examples from the previous post. Almost all of those lists were empty after filtering!

Filtering Wikipedia redirects

security: the security, securities, security systems, marketable, securing
option: options, configurable
grey: gray, dark gray, grayish, greyish
warranty: warranties, lifetime warranty, warrenty, car warranty

Similar to WordNet we’re getting most of the bad synonyms out. It’s clearly more precise with filtering but the lists are much smaller. And it doesn’t solve the problem of missing the words we need.

Synonyms from latent semantic analysis

A common solution for finding synonyms came from information retrieval research in the 90’s. The rough approach is to build a giant matrix of word counts per document, apply singular value decomposition, then drop the least important dimensions.

Using the resulting decomposition in information retrieval is called latent semantic indexing. Usually when it’s applied to synonyms it’s called latent semantic analysis. And more broadly it’s a machine learning technique called principal components analysis.

The result is that you get three matrices: mapping of terms to semantic dimensions with weights, weights for the dimensions, and a matrix mapping documents to semantic dimensions. For synonyms you can use the term-to-semantic-dimension matrix to compute semantic similarity. (1)

Previously I tended to stay away from LSI/LSA because they use compiled tools with little documentation. But we found a great Python module called gensim, which provides a nice wrapper around LSI/LSA as well as related approaches like LDA and word2vec.

There are a couple really nice things about using LSA:

  1. No word sense disambiguation (2)
  2. Continuous similarity scores between terms, can set a threshold to tune precision vs recall

We treated each Reddit thread as a document because the comments were too short to learn useful cooccurrences.

To look up synonyms I make one “document” for each word in the vocabulary. Then I used gensim’s LsiModel and MatrixSimilarity to “search” for the most related terms for any input.

Examples

I didn’t run the exact same terms as previous tests in my old emails sorry! I’ll just pick a few examples.

black: matte, stain, vinyl, plasti, plastidip, paint, ...
color: colors, blue, colour, repaint, wrap, ordered, vinyl, vin, painted, ...
windows: tint, tinted, va, dark, pink, darker, glass, windshield, ...
security: bedroom, house, steal, thief, stolen, theft, ...
wheels: rims, bbs, diameter, offset, spokes, locking, ...
doors: door, sedan, coupes, coupe, pillar, hatchback, sedans, hatchbacks
tires: tread, tire, rears, michelin, rubber, sidewall, compound, rotated, ...
prius: environment, hybrids, insight, priuses, viewed, hybrid, unplugged, ...

They’re excellent with a few caveats:

  1. When I don’t recognize a word I can’t tell whether it’s an acronym or just a term I don’t know
  2. Sometimes it’s unclear why words are related, such as “viewed” for prius or “pillar” for doors.

Side track: Using web crawl

Reddit is great but somewhat limited. So I tried taking all the outbound links from the reddit auto crawl and doing a general web crawl to build up a larger corpus. Unfortunately a ton of the links were used car sites or detailed manuals. I went through the process of training LSA and generating synonyms but they had lots of general terms mixed in randomly. So I didn’t continue the investigation.

Side track: LSI vs LDA

Latent dirichlet allocation (LDA) seems to be more common than LSI/LSA these days. We only very briefly tried it and the results weren’t much different with the exception that you could get a clean summary of each topic in the model. With LSI/LSA the topics are less defined by particular dominant words and more defined by patterns of cooccurrence so the groupings aren’t human readable.

Using LSA synonyms in ElasticSearch

Even though we have similarity scores between pairs of terms, ElasticSearch is a boolean synonym-or-not type of system. So we had to pick a threshold for similarity (around 0.35 seemed good). And our generated synonyms aren’t symmetric but ElasticSearch synonyms are.

Side track: Issues in the overall system

Unlike web search, sometimes in factoid search it’s best to return no results. For example we had an issue where a query for “sunroof” would return the info for “technology package”. This happened because there was no information at all for sunroof, certainly no “sunroof”: false in the data and happened because “sunroof” and “package” were often mentioned together in the corpus. We didn’t have a good solution for this except to set the synonyms threshold to be more restrictive.

This would’ve been a good area for further work had Searchify continued.

Alternate approach: Pseudo-relevance feedback

The relevance feedback is another technique from information retrieval. Imagine a thumbs up/down button on each search result. If you did that, then the system can redo the search to prefer good documents and penalize bad documents. Some of the background is in the Rocchio algorithm.

An interesting tweak is pseudo-relevance. The system assumes that documents with good scores are mostly good and documents with bad scores are mostly bad. Then it finds words that occur more in good documents than bad, adds them to the query, and redoes the search.

It can also generate lists of related terms. We’d search our set of documents for a term, say “color” and assume the top 20 documents are relevant. And assume all other documents are irrelevant to “color”. Then we compute the probability of each word in those two sets and take the difference in probabilities.

What we get are scores like “paint” is 0.005 more probable in documents related to “color”. Some old debug output:

color color: 0.0094 he: 0.0055 paint: 0.0050 blue: 0.0041 his: 0.0029 bmw: 0.0028 interior: 0.0021 guy: 0.0017 laguna: 0.0015 he's: 0.0015

price price: 0.0076 sell: 0.0046 dealer: 0.0032 me: 0.0032 dealers: 0.0027 selling: 0.0025 they: 0.0024 them: 0.0023 then: 0.0023 dealership: 0.0022

cost tax: 0.0062 cost: 0.0044 we: 0.0037 parts: 0.0035 pay: 0.0033 taxes: 0.0022 shop: 0.0020 us: 0.0019 price: 0.0019 by: 0.0018

engine engine: 0.0326 starter: 0.0045 check: 0.0044 timing: 0.0038 battery: 0.0036 belt: 0.0030 when: 0.0027 start: 0.0027 may: 0.0024 light: 0.0020

transmission transmission: 0.0285 fluid: 0.0203 shop: 0.0028 filter: 0.0027 change: 0.0026 drain: 0.0024 fill: 0.0022 flush: 0.0021 tell: 0.0021 do: 0.0020

cargo truck: 0.0069 people: 0.0042 trucks: 0.0036 drive: 0.0028 haul: 0.0026 because: 0.0023 when: 0.0022 had: 0.0018 we: 0.0016 need: 0.0015

space park: 0.0094 next: 0.0066 parking: 0.0045 door: 0.0043 parked: 0.0030 do: 0.0027 traffic: 0.0025 lane: 0.0025 splitting: 0.0023 see: 0.0022

efficiency torque: 0.0059 speed: 0.0039 engine: 0.0022 highway: 0.0021 rpm: 0.0020 horsepower: 0.0016 we: 0.0016 low: 0.0015 drag: 0.0015 high: 0.0013

Some of those common terms are really inappropriate so I weighted by inverse document frequency to fix that:

color blue: 0.475 seca: 0.368 laguna: 0.352 green: 0.166 reminds: 0.160 countach: 0.156 black: 0.134 racing: 0.129 metallic: 0.129 paint: 0.124

price kbb: 0.493 trade: 0.305 credit: 0.262 offers: 0.233 told: 0.196 dealers: 0.189 negotiate: 0.185 promise: 0.174 sell: 0.172 me: 0.168
 (kbb = Kelley Blue Book)

cost parts: 0.797 shop: 0.634 labor: 0.561 he: 0.257 bearings: 0.251 vanos: 0.248 shops: 0.248 labour: 0.241 charge: 0.234 struts: 0.222

engine starter: 0.462 battery: 0.275 firing: 0.188 bay: 0.170 mclaren: 0.169 timing: 0.156 crank: 0.155 coil: 0.145 spark: 0.135 bolts: 0.135

transmission fluid: 0.820 flush: 0.140 filter: 0.119 drain: 0.118 pan: 0.112 atf: 0.103 shop: 0.101 hydraulic: 0.100 fill: 0.096 transmissions: 0.094

cargo truck: 1.000 trucks: 0.809 haul: 0.592 hauling: 0.302 people: 0.280 tow: 0.277 fit: 0.274 bed: 0.245 utility: 0.233 towing: 0.191

space park: 1.000 splitting: 0.815 lane: 0.638 cones: 0.611 traffic:
 0.490 parking: 0.482 parked: 0.287 next: 0.253 ding: 0.245 cart: 0.228

efficiency torque: 1.000 rpm: 0.871 drag: 0.697 mpg: 0.689 highway: 0.682 gearing: 0.645 engine: 0.590 epa: 0.492 force: 0.419 pushrod: 0.390

Side track: Using multiword queries

When looking up synonyms for “sun roof” we could use a bigram-aware IR engine for pseudo-relevance feedback. This would prefer documents that have “sun roof” rather than “sun” and “roof” in different parts of the document.

This improved synonyms for phrases but unfortunately there wasn’t a way to use it in ElasticSearch synonyms. We could’ve used it in the overall system by running an ElasticSearch query against the web crawl data first and then expanding the original query before sending to the fact database. However, this would add an additional search to the latency of the system. (May have been quick enough but we didn’t get to look into it.)

How I’d improve more

We stopped working on Searchify a while back but I thought it’d be nice to lay out what I saw as the path forwards in generating synonyms for ElasticSearch.

  • Make a gold-standard evaluation of the synonym component. Get human annotated data from tasks on Mechanical Turk. Something like “Which of these words is most related to X when talking about cars?”
  • Compare results from LSA, LDA, word2vec, pseudo-relevance feedback. Tune any parameters (such as the number of topics)
  • Train a combination approach. This tends to be effective in machine learning and may be a way to combine general-purpose synonyms in WordNet/etc with domain-specific synonyms from LSA and pseudo-relevance feedback.
  • Once the system was live we could have mined the data for query reformulations: When you search for “price” then get no results and rephrase to “MSRP” that could be used as part of the synonym engine.

What’s next?

I’ll do one more post about Searchify and the text classification component for queries like “How much is it?” or “How big is it?”

Notes

(1) Calling it semantic similarity is a bit misleading. You’re computing something more like “How often do these terms appear in documents with the same words?”

(2) For us this is helpful but it can be bad if you need to differentiate between different senses of a word. It’s a common problem with co-occurrence methods.

Finding a learning curve for Over 9000

Edit 5/22: I made two large mistakes in the evaluation so I’ve redone all the graphs. The findings are mostly the same. See notes at bottom for more info. (2)

Background

For Over 9000 I’m estimating the number of torrent downloads per show/episode at 7 days from release. If I have enough data I can compute that by interpolating points. But usually I need to extrapolate from the first few days of downloads.

I’ve focused on fitting log curves with scipy and I’m starting to get setup for more rigorous machine learning.

Learning curves

Learning curves are a useful diagnostic to assess your model (see Andrew Ng Coursera lecture link). If you’re plotting training and validation error then you can see whether you’re overfitting or underfitting. If you’re overfitting, it can help to reduce the parameter space or add more data. If you’re underfitting then you might need more features or a more complex algorithm.

Learning curves for extrapolating number of downloads

Unfortunately it isn’t easy for my problem. Each episode is handled as an independent machine learning problem with usually 2-6 datapoints for training. But what I can do is assess prediction quality using the first 2 days of data, first 3 days, etc. I’m comparing multiple extrapolation functions so I can plot the error of each:

Learning curve with basic 3 functions

Note: Title is incorrect. This should say prediction ERROR not accuracy.


The three lines are each different curves fit to the data, where x is the number of days since the episode was released and we’re predicting the number of downloads at that point in time.

  • log function with 3 parameters (log 3p): b * log(x + a + 1) ^ c
  • log function with 1 parameter (log 1p): b * log(x + 1) ^ 0.5
  • asymptote function with 2 parameters (asymptote 2p): b * x / (x + a^2)
    • a is squared cause I needed to force it to be positive

The y-axis shows the average error from prediction target: abs(y – y’) / y. It’s percent error over 100.

What does this graph show? Especially when we have fewer days of data to extrapolate from, functions with fewer numbers of parameters to fit are much better at predicting.

But when we have 7-8 days of data the testing data is used to train the curve fit and the trend is the reverse: the fit is better with more parameters (overfitting). This is what a normal learning curve shows but it’s represented differently here.

Learning curve with basic 3 functions using backoff

Note: Title is incorrect. This should say prediction ERROR not accuracy.

In the past I had evaluated on at least 8 days of data to see how well each function was capable of fitting. But I mistakenly selected log3p, which is good at overfitting but poor at generalizing from very few days of training data.

Also note that log3p has 100% error at 1 day of training data. That’s because scipy fails to fit a curve at all and the model defaults to predicting zero. Aysmptote2p also fails to fit the data sometimes too.

In the context of Over 9000 when the curve fails to predict you’ll see “DLs at 7d: ???”.

Backoff

One easy way to solve this is to fall back to log1p when another curve fails to fit. Also log1p does a better job with less data anyway so when there are fewer points than parameters I also back off to log1p. That leads to these curves (y-axis scale adjusted):

Learning curve of basic 3 equations with backoff

Note: Title is incorrect. This should say prediction ERROR not accuracy.

Compared to the previous run many fewer tests are predicting zero. The functions with backoff do a much better job now. The reason they aren’t identical to log1p is that the threshold is based on number of datapoints and there are sometimes 2-3 points per day.

Log3p+backoff isn’t quite as good as I’d like in part because I set the cross-over point too low (4 points isn’t enough). I’ll investigate this more.

We’ve made great progress here: extrapolation from 3 days of data has gone from 25% error on log3p (used in production) to 14% with asymp2p+backoff.

But much of the progress is made in handling cases that previously failed to predict. To look into it, I generated histograms of the errors using pandas.

Here’s a histogram of errors for log3p without backoff:

model_scores_3_log 3p

Over 30 episodes have no prediction at all so the prediction defaults to zero (100% error). Otherwise most predictions are under 20% error.

Now compare the asymptote 2p with backoff:

model_scores_3_asymptote 2p backoff

The asymptote function with backoff almost never predicts zero and the predictions it gives tend to have lower error. So the overall progress is both the result of fewer default predictions as well as lower error for non-default predictions.

Model combination

Machine learning tells us that ensemble methods are very strong: you combine multiple models with different kinds of errors and the combined model will be better. One of the things I noticed in looking over the predictions is that the inverse function tends to predict too low while the log functions tend to predict too high. (1)

I’ve wanted to take the predictions of each model and use them as features for linear regression. As a first step I’ll just average the predictions of log3p+backoff and inv2p+backoff.

Learning curve of 3 equations with backoff and an average predictor

Note: Title is incorrect. This should say prediction ERROR not accuracy.

Even a simple average gets us from 14% error at 3 days to 11%. Here’s the histogram of errors at 3 days:

Histogram of errors for average of the two functions

There are fewer episodes with high error.

Weighting the training data

There’s one trick I found in the scipy documentation: You can provide uncertainty weights on the data points. So you can force least squares regression to fit some points closer than others.

When we have a two-hump curve like some of the weird cases in the previous post it’s important to prefer fitting the later data.

This is what I used for weights (Python): [1 / (x + 0.5) for x in x_data]

The 0.5 is there to prevent divide-by-zero. The values are supposed to represent uncertainty, so lower means more weight in fitting.

Note: Title is incorrect. This should say prediction ERROR not accuracy.

Note: Title is incorrect. This should say prediction ERROR not accuracy.



It’s hard to compare to the previous graph unless you put them side by side (I like to open both, max the windows, then command-tilde really fast). Generally we’re removing another 1-2% error, more with more data points.

Now at 3 days of training data we’re a 10% error.

Conclusions

It’s great to be able to apply the regular rules of machine learning to this problem! It really meant I could approach it more methodically and begin to fix the overfitting.

I feel strongly that further improvement is possible, perhaps 5% error from 3 days of data.

The simplest improvement would be to poll the source data hourly rather than daily.

Custom regularization would be nice: Say if I penalize for parameters that deviate too much from typical values. Partially sharing data between different episodes of the same series would also probably help. In any case, tons more progress is possible but I’ll take a detour to push these models to production first.

Notes

(1) It’s strange but some of the data looks to clearly have an asymptote so x / (x + a) is appropriate for that data. Other data clearly doesn’t have an asymptote so a log curve fits better. Most data is somewhere between though.

(2) I made two mistakes originally. First, I was filtering the data to the first X days of data and sometimes that meant I only had one point: the fake point (0, 0) which isn’t enough to learn from. It’s also unrealistic: we never attempt to extrapolate a series until we have one real data point. So filtering that data gave a more realistic picture.

The second mistake was in the averaging function. I used numpy.mean which doesn’t throw an error on 0 inputs; it returns NaN. Then I was putting the result into a pandas Series so when I computed the mean it was filtering NaN. That’s why I originally had zero predictions with 100% error.

Curve fitting and machine learning for Over 9000

One of my current projects is Over 9000, a visualization that shows which anime series are currently popular. I get the data by scraping a popular anime torrent site every day and come up with a single number that represents the popularity of a show.

The number I use is the average number of downloads per episode, interpolated or extrapolated to 7 days since release.

Current approach

Each episode of each show is processed independently; I’m treating this purely as a curve-fitting problem. Typically I get one datapoint per day per episode. If the episode’s been out for more than 7 days then I interpolate. Otherwise I use scipy to fit a, b, c in this equation:

downloads = b * log(days_since_release + a) ^ c
For more info on the background see previous post.

Problems

Occasionally during development I’d get 3-4 different datapoints at hour intervals. When they were very close to the release date the equation would fit a nearly vertical curve and estimate downloads in the trillions or more. But after it got another day of data the estimate would stabilize to something more reasonable. As a hack I put in a rule that the estimated downloads at 7 days couldn’t be more than double the current downloads.

The evaluation was problematic also. I’d estimate downloads at 7 but I didn’t actually have evaluation data, so I’d interpolate between the closest two points.

Machine learning problem

I’ll continue to predict each episode independently. When an episode has been available for 4 days typically it means I have 3 data points. And I also add a fake point of 0 downloads at the release date.

This is somewhat different than a regular regression problem in a few ways:

  1. Much less data. We’ll normally have 2-6 training examples (number of downloads on a given day)
  2. Extrapolation not interpolation. In a normal regression problem, the prediction has a value similar to those seen in training. But in this problem, we’re predicting a larger value than any seen in training.
  3. Multiple parameter sets. Each episode has one parameter set; we’re really learning hundreds of models.

#1 means that we want a model with pretty high bias. We want to force the system to only consider hypotheses in a limited range. Fitting a curve is a great approach: the above curve has only 3 parameters (similar to linear regression with 2-3 features).

When I was struggling with crazy predictions I should’ve stopped and thought about it: the model was fitting the training data well but giving terrible performance on the testing data. It’s overfitting!

There are many solutions to overfitting but in one way or another they’re reducing the hypothesis space: 3 variables is too many when we have so few training samples.

Aside: Remember linear algebra?

In the distant past I had to solve systems of equations. When there was only one variable you just move the parts around to solve. When you have two variables you needed two equations. And even then they couldn’t be multiples of one another. Three variables, needed three equations. Obligatory Wikipedia link.

That said, typically it isn’t possible for find parameters that perfectly fit this data. Real data is noisy. So using a solver isn’t the right approach but I should’ve remembered the basic principle that you need at least N equations to solve for N parameters.

Sometimes scipy.curve_fit throws errors when there’s fewer training examples than parameters. Sometimes not. In cases when it doesn’t, it extrapolates very poorly.

Mistakes were made

I picked the function above because it was able to closely fit a complete download curve. This was a horrible mistake.

In a regular machine learning problem you’d plot a learning curve to show the quality of your system on training data and on testing data. And you’d pick the best model on testing or development data. The best model for your training data is probably overfitting and won’t generalize well.

Simple solution

When trying to learn a model, if it has 4+ datapoints I use the 3 param equation. If not I use a 1 param equation that works pretty well:

downloads = a * log(days_since_release + 1) ^ 0.5

That only requires one non-zero data point to fit and doesn’t get too crazy. This solution is pretty similar to backoff in language modeling: fall back to a model that generalizes better when you lack data.

Fixing evaluation

Another issue is what we’re testing against. It used to interpolate the downloads at day 7 and use that to evaluate. But then I’m chaining errors from the interpolation into my evaluation.

The real problem with that is that I’m drawing a straight line between the closest point before 7 and closest point after 7. But the true data isn’t a straight line so I could be penalizing a model for a correct prediction.

Now I search for any data points between 6.5 and 7.5 days since release and use those are testing data.

Looking at weird data

Sometimes we fit the data poorly because we pick poor parameters for our model.  Other times it’s because the model can’t represent the data.

In experimenting with various models I found that a small number of episodes had download curves that couldn’t be fit well by any of my models. In the development data it was 8/398 episodes, about 2%. It’s interesting to look at them:

Aldnoah.Zero, Episode 21

Aldnoah Zero, Episode 21. It’s tough to see but the first 3 days follow a log curve (just a much smaller one).

Gundam G no Reconguista, Episode 24

Gundam G no Reconguista, Episode 24. Another interesting one and again it looks like a smaller log curve started off at first then a bigger one is added.

Sidonia_no_Kishi_Daikyuu_Wakusei_Seneki_2

Knights of Sidonia Season 2, Episode 2. This clearly looks like two curves added.

Looking over them I’d say that probably a less popular fansub group is occasionally doing an earlier release than the more popular one. Or it’s plausible that the show had a really good episode and tons of new fans started watching a few days later. It’s also plausible that there was a processing bug and I was missing data for the first few days. But I’d expect that to be only missing data for a single day.

Here’s a normal curve for reference:

Naruto Shippuuden, Episode 408

Naruto Shippuuden, Episode 408. What a smooth curve!

Wrapping it up

Sometimes a machine learning problem is staring you right in the face but you just don’t recognize it. Also it’s interesting to run into a problem that’s difficult with ordinary classification or regression.

Next steps

  • Improve evaluation framework
  • Make something similar to learning curves
  • Finalize the new equations
  • Explore linear regression using predictions from each equation as features

Synonyms for factoid search, Part 2

The previous post described the problem and attempted to find synonyms for Elastic Search using a thesaurus, Wikipedia redirect groups, and Bing’s related searches.

Problem recap

We were adding a search box to ads to look up quick facts, initially focused on car ads. The big issue is that queries such as “ac” or “a/c” need to match a Json object containing only “air conditioner”. Searches like “cost” need to match “MSRP” or at least “price”.

Toyota Prius Ad with Search Box

Synonyms from WordNet

WordNet is a great resource for word information in English. It’s particularly good for nouns and indicates synonymy as well as hierarchy. For example, you can find that tire is a type of hoop or ring. And you can find synonyms, in this case just tyre. There’s a web version you can test out without needing to install/setup.

But it’s much more complex than this: relationships such as synonyms or is-a (hypernym/hyponym) only have meaning between specific meanings, called word senses. The string tire has five senses in WordNet 3.1: 1 noun sense and 4 verb senses. The kinds of tire we care about for cars is the noun one. If we restrict just to nouns we can say that tire and tyre are synonymous in Elastic Search and that’ll improve our search for British users.

What about a term like mileage? Looking up in WordNet online there are three senses:

  1. mileage, milage (distance measured in miles)
  2. mileage, fuel consumption rate, gasoline mileage, gas mileage (the ratio of the number of miles traveled to the number of gallons of gasoline burned)
  3. mileage (a travel allowance at a given rate per mile traveled)

The senses are usually in order of most common to least common. Sense #2 is the one we want but it’s not so bad to take all the synonyms of all senses {mileage, milage, fuel consumption rate, gasoline milage, gas mileage}.

Detour: Handling phrases

WordNet has many phrases as well as single words. For instance, there’s an entry for air conditioning. If we’re generating synonyms for an air conditioning factoid it’s better to use the phrase. However, remote starter doesn’t have an entry.

I used a very simplistic approach that works for our data: if the full phrase isn’t in WordNet, strip one word from the left and try again. If that’s not found, strip another word and so on. Keep the leftmost words intact and generate alternatives just for the rightmost words. This works because we’re working with noun phrases in English and there weren’t many prepositional phrases.

Taking synonyms of all synsets

The simplest approach is to avoid word sense disambiguation and just join all synonyms. Like before I didn’t have evaluation data so mostly I eyeballed synonyms and considered bug reports.

Input: air conditioning
Output: air conditioner

Input: heat
Output: heating system, ignite, oestrus, fire up, heating plant, inflame, stir up, heating, heat up, high temperature, passion, wake, heat energy, hotness, estrus, warmth, rut, hot up

Input: mileage
Output: mileage, fuel consumption rate, milage, gasoline mileage, gas mileage

Input: safety
Output: prophylactic, safety device, safe, base hit, rubber, guard, condom, refuge

Input: cost
Output: be, price, monetary value, toll

Overall the output is much better than previous approaches. Although air conditioner is useful, the stemming in Elastic Search would already handle it. Likewise for heat/heating.

And then there are frankly bizarre results if you consider the automotive domain, like estrus for heat. It’s the sense of heat “applies to nonhuman mammals: a state or period of heightened sexual arousal and activity”. Likewise in the automotive domain you probably don’t need condom to be related to safety. But rubber related to safety? That might be good for rubber on the bumpers of the car… maybe.

Word sense disambiguation

The simplest approach to word sense disambiguation are the Lesk-like algorithms: look for keyword overlap between the definition (called a gloss) and words in your domain. In this case I can group all keywords from our fact set for disambiguation. And for scoring I use cosine similarity.

Input: air conditioning
Output: air conditioner

Input: heat
Output: high temperature, hotness

Input: mileage
Output: fuel consumption rate, gasoline mileage, gas mileage

Input: safety
Output: (no synonyms)

Input: cost
Output: price, monetary value

The crazy stuff is gone but we’ve lost “safety device” for “safety”, “milage” for “mileage”, and “warmth” for “heat”. Let’s look at disambiguation scores for “safety”.

  1. safety (the state of being certain that adverse effects will not be caused by some agent under defined conditions)
    score=0.106600
  2. condom, rubber, safety, safe, prophylactic (contraceptive device consisting of a sheath of thin rubber or latex that is worn over the penis during intercourse)
    score=0.065795
  3. safety, refuge (a safe place)
    score=0.056980
  4. base hit, safety ((baseball) the successful act of striking a baseball in such a way that the batter reaches base safely)
    score=0.032141
  5. safety (a score in American football; a player is tackled behind his own goal line)
    score=0.000000
  6. guard, safety, safety device (a device designed to prevent injury or accidents)
    score=0.000000

The sense we want is #6 but it has a score of zero meaning no overlap at all. “prevent”, “injury”, and “accidents” aren’t anywhere in our list of car facts. Perhaps a corpus of automotive text would disambiguate to that sense.

Looking through old emails I found a great example of horrible WSD failure. We had tons of facts like “solar roof package” and it disambiguated “package” to “software package” leading to this unfortunate set:

solar roof package: solar roof software, solar roof
software program, solar roof computer software, solar roof
software system, solar roof software package, solar roof
package

Detour: Wolfram Alpha

Later in the project someone pointed me to the Wolfram Alpha API because it provides a synonym service. The nice part is that you just specify the word. Another nice part is that you can test it via web search without setting up code for the API.

But I found that it’s just giving back WordNet synonyms (as of 9/2014). There’s no support that I saw for word sense disambiguation either. My best guess at their algorithm is that it iterates through the senses in order and tries to get 10 synonyms. But if it finds 9 and the next sense will add 3 it’ll return 12 synonyms. I could replicate their synonyms closely from WordNet with a few slight differences:

For “interior” Wolfram Alpha filtered “Department of the Interior”. For “heat” it filtered “estrus” and “oestrus”. For “safety” it filtered the condom sense. It’s plausible that the differences are due to WordNet version or that they have some additional filtering rules.

Another interesting bit is that their search is case-sensitive. “or” shows the real word and “OR” shows Oregon. “no” shows the regular word and “NO” shows nitric oxide.

Long story short I wouldn’t recommend building a project on this service unless it’s revamped. Just use the appropriate WordNet API for your language and it’ll be much faster and more customizable.

Subjective issues aka bug reports

We had tons of issues because we didn’t just need synonyms. I specifically remember feedback that “speed” didn’t give 0-60 acceleration numbers. WordNet can’t help too much because they aren’t synonyms just closely related.

Another type of issue was common automotive acronyms like mpg, msrp, and a/c. “Cargo space” was another difficult one.

Colors were also difficult – searching for a specific color should show the list of available colors.

Occasionally we dealt with searches that didn’t exactly have a keyword, such as “how much is it” but we addressed that with a question-answer type classifier.

Thoughts

The issues with WordNet can be loosely grouped:

  • The info is in WordNet but my WSD is failing
  • The automotive word sense isn’t in WordNet (such as hybrid)
  • Needed related words, not just strict synonyms
  • Real language is more colloquial and involves abbreviations, acronyms, and informal variations

Collecting an automotive text corpus seemed to be the right direction. If we had thousands of users already I would have just mined the query rewrites from actual user searches much like Google. But that isn’t sufficient to achieve traction with users, customers, or investors.

If we were searching documents rather than factoids I’d say we should ditch Elastic Search and adopt latent semantic indexing of some form. But the combination of factoid search and Elastic Search put us in a tight spot.

What did I need in a corpus?

  • Domain-specific, preferably a method I could use on other domains if we started expanding from car ads.
  • The stuff real people say, not just dictionary terms.
  • Up to date with current automotive trends.

Next post I’ll describe the approach: crawling automotive subreddits and using gensim to get LSA synonyms vs homemade pseudo-relevance feedback for synonyms.