Prior Shift Using the Ratio Estimator

Abstract

Several machine learning applications use classifiers as a way of quantifying the prevalence of positive class labels in a target dataset, a task named quantification. For instance, a naive a way of determining what proportion of people like a given product with no labeled reviews is to (i) train a classifier based on the Google Shopping reviews to predict whether a user likes a product given its review, and then (ii) apply this classifier to Facebook/Google+ posts about that product. It is well known that such a two-step approach, named Classify and Count, fails because of dataset shift, and thus, several improvements have been recently proposed under an assumption named prior shift. Unfortunately, these methods only explore the relationship between the covariates and the response via classifiers. Moreover, the literature lacks in the theoretical foundation to improve these techniques. We propose a new family of estimators named Ratio Estimator which is able to explore the relationship between the cov ariates and the response using any function g:X→R and not only classifiers. We show that for some choices of g, our estimator matches standard estimators used in the literature. We also explore alternative ways of constructing functions g that lead to estimators with good performance, and compare them using real datasets. Finally, we provide a theoretical analysis of the method.

Publication
In Proceedings of Bayesian Inference and Maximum Entropy Methods in Science and Engineering