How Much Can a Retailer Sell? Sales Forecasting on Tmall
Total Page:16
File Type:pdf, Size:1020Kb
How Much Can A Retailer Sell? Sales Forecasting on Tmall Chaochao Chen?, Ziqi Liu?, Jun Zhou, Xiaolong Li, Yuan Qi, Yujing Jiao, and Xingyu Zhong Ant Financial Services Group Hangzhou, China, 310099 fchaochao.ccc, ziqiliu, jun.zhoujun, xl.li, yuan.qi, yujing.jyj, [email protected] Abstract. Time-series forecasting is an important task in both aca- demic and industry, which can be applied to solve many real forecasting problems like stock, water-supply, and sales predictions. In this paper, we study the case of retailers' sales forecasting on Tmall|the world's leading online B2C platform. By analyzing the data, we have two main observations, i.e., sales seasonality after we group different groups of re- tails and a Tweedie distribution after we transform the sales (target to forecast). Based on our observations, we design two mechanisms for sales forecasting, i.e., seasonality extraction and distribution transformation. First, we adopt Fourier decomposition to automatically extract the sea- sonalities for different categories of retailers, which can further be used as additional features for any established regression algorithms. Second, we propose to optimize the Tweedie loss of sales after logarithmic trans- formations. We apply these two mechanisms to classic regression models, i.e., neural network and Gradient Boosting Decision Tree, and the ex- perimental results on Tmall dataset show that both mechanisms can significantly improve the forecasting results. Keywords: sales forecasting · Tweedie distribution · distribution trans- form · seasonality extraction. 1 Introduction Time-series forecasting is an important task in both academic [4] and industry arXiv:2002.11940v1 [cs.LG] 27 Feb 2020 [16], which can be applied to solve many real forecasting problems including stock, water-supply, and sales predictions. In this paper, we study the fore- casts of retailers' future sales at Tmall.com1, one of the world's leading online business-to-customer (B2C) platform operated by Alibaba Group. The problem is essentially important because accurate estimation of future sales for each re- tailer can help evaluate and assess the potential values of small businesses, and help discover potentials for further investment. ? Equal contribution 1 https://en.wikipedia.org/wiki/Tmall 2 Chaochao Chen et al. However, accurately estimating each retailer's future sales on Tmall could be way challenging in several reasons. First of all, naturally different goods or prod- ucts are sold with strong seasonal properties. For example, most of fans are sold in summer, while most of heaters are sold in winter, i.e. we can observe strong seasonal properties on different groups of retailers. Secondly, the distribution of sales among all retailers demonstrates an over-tail power law distribution , i.e., the retailers' sales spread a lot. Naively ignore such issues could make the performance much worse. In this paper, we analyze and summarize the characteristics of the sales data at Tmall.com, and propose two mechanisms to improve forecasting performance. On one hand, we propose to extract seasonalities from groups of retailers. Specif- ically, we characterize the seasonal evolutions of sales by first clustering retailers into groups and then study the seasonal series as decompositions from a series of Fourier basis functions. The results of our approach can be utilized as fea- tures, and simply added to the feature space of any established machine learn- ing toolkits, e.g., linear regression, neural network, and tree-based model. On the other hand, we place distribution transformations on the original retailers' sales. Specifically, we observe that the distribution over retailers' sales follows a Tweedie distribution after we transform retailers' sales by logarithmic. Thus, we propose to optimize Tweedie loss for regression on the logarithmic transformed sales data instead of other losses on the original ones. Empirically, we show that the proposed two mechanisms can significantly improve the performance of pre- dicting retailers' future sales by applying them into both neural networks and tree based ensemble models. We summarize our main contributions as follows: { By analyzing the Tmall data, we obtain two observations, i.e., sales seasonal- ity after we group different categories of retailers and a Tweedie distribution after we transform the original sales. { Based on our observations, we design two general mechanisms, i.e., seasonal- ity extraction and distribution transform, for sales forecasting. Both mech- anisms can be applied to most existing regression models. { We apply the proposed two mechanisms into two popular existing regression models, i.e., neural network and Gradient Boosting Decision Tree (GBDT), and the experimental results on Tmall dataset demonstrate that both mech- anisms can significantly improve the forecasting results. 2 Data Analysis and Problem Definition In this section, we first describe the sales data and features on Tmall. We then analyze the seasonality and distribution of sales data. Finally, we give the sales forecasting problem a formal definition. 2.1 Sales Data and Feature Description Tmall.com is nowadays one of the largest business-to-customer (B2C) E-commerce platform. It has more than 180,000 retailers. Among of those retailers, there could How Much Can A Retailer Sell? Sales Forecasting on Tmall 3 6 5.5 5 4.5 Sales 4 3.5 3 2.5 Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec Month Fig. 1. GMV on Tmall. The horizontal axis denotes the months between Jan. 2015 to Dec. 2016, and the vertical axis denotes the GMV on Tmall where we omit the scale. be giant retailers like Apple.com, Prada, and together with small businesses. The platform is selling hundreds of thousands products in diverse categories, e.g., `furniture', `snack', and `entertainment'. Besides category information, the other features of retailers on Tmall can be mainly divided into three types: (1) The basic features that are able to reflect the marketing and selling capability of each retailer. For example, the amount of advertisement investment, the number of buyers, the rating/review given by the buyers, and so on. (2) The high-level features that are generated from historical sales and basic feature data. Suppose a retailer i generates a series of sales data, e.g., yi;t−2; yi;t−1; yi;t; yi;t+1. We are currently at time t and want to forecast yi;t+1. Then yi;t can be taken as a feature which indicates the sale amount of previous period, yi;t −yi;t−1 is a feature that indicates the increasing speed of the sales, and (yi;t −yi;t−1)−(yi;t−1 −yi;t−2) is a feature that denotes the accelerated speed of the sales. Similarly, we can generate other high-level features, e.g., the number of buyers, using the basic features available. (3) The seasonality features that are generated by using other machine learning techniques, which aim to capture the seasonal property of different retailers. We will present how to generate these features in Section 3.1. 2.2 Seasonality Analysis The retailers' sales tend to have different seasonality due to the seasonal items they sell. Take vegetables for example, tomatoes and cucumbers are usually sold more in summer, while celery cabbage is likely sold more in winter. Although the GMV demonstrate seasonal properties, as is shown in Figure 1, the analysis of the seasonality related to the gross merchandise volume is relative meaningless for the prediction on each retailer. In contrast, the seasonality analysis on each single retailer makes the analysis cannot generalize well in the future. Instead, we further investigate the seasonalities in different groups of retailers. By analyzing the sales data on Tmall, we observe different seasonal patterns on different categories of retailers. Figure 2 shows the sales of four different categories of retailers, i.e., `Women's Wearing', `Men's Wearing', `Snack', and `Meat', where we use two year's sales data from January 2015 to December 2016. We can observe that, `Women's Wearing' and `Men's Wearing' show quite similar seasonal patterns, i.e., they both reach peak in summer (July or August) 4 Chaochao Chen et al. 6 4 5.5 3.5 5 3 4.5 4 2.5 3.5 2 3 2.5 1.5 Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec (a) category=`Women's Wearing' (b) category=`Men's Wearing' 4.5 4.5 4 4 3.5 3.5 3 3 2.5 2.5 2 1.5 2 1 1.5 Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec Jan Feb Mar Apr May June July Aug Sep Oct Nov Dec (c) category=`Snack' (d) category=`Meat' Fig. 2. Sales seasonality of different categories on Tmall. The horizontal axis denotes the months between Jan. 2015 to Dec. 2016, and the vertical axis denotes the total sale among of retailers in each category where we omit the scale. and decline to nadir in winter (January). On the contrary, `Snack' and `Meat' show different seasonal patterns. In summary, the seasonalities under different categories could differ quite a lot. Thus if we can somehow partition the retailers into appropriate groups, the shared seasonality among retailers in one group could be statistically useful for characterizing each retailer in the group. Given a group of retailers, how can we characterize the seasonality for the group remains to be solved.