StratLytics

FINTECH AND BANKS: CON...

Part I - The Fintech Challenge

In a 2 part blog series, the first part enumerates the dawn of a new generation of tech companies primarily devoted to financial services space which leverage advanced data analytics as a strategic and powerful tool that can make sense of a world drowning in data creating more efficient and customer-centric platforms and therefore have caught the incumbent traditional banks completely off guard.

Introduction to FinTech

We are in the midst of a disruptive revolution brought about by Financial Technology companies (“FinTech”) that have taken the financial services space by storm radically changing the way business is run and perceived. Credit to this recent upsurge in a new breed of start-ups , that are an intersection of technology based companies which play in the financial sector space, goes to the explosion of digital data and multiple technological advances via cloud, mobile technologies, apps, intelligent/smart networks resulting in opportunities for innovators and disruptors to move in. With the exponential rise of digital data (both customer profile and transaction), annual investments in FinTech space has burgeoned1 10x since 2010 and has grown 58% yoy in 2015 to $19bn with venture capital infusing the major chunk of funding (69%). North America2 primarily the US, continues to be the centre of major investments and advancements, setting the trends for the rest of world in FinTech, and accounts for 60%+ of global investments. As of 20153, the FinTech was valued at $98bn and consisted of 46 unicorns (cos valued at $1bn+) primarily in the lending and payments sub-space (70%+). The FinTech maxtrix can be further differentiated into 5 sub-matrices depending on sector (bank, insurer) business process (payments, financing, asset management, insurance etc), customer segment (retail, corporate private, life insurance etc), interaction form (C2C, B2C, B2B) or market position (Bank/insurer, Non-bank/insurer – bank/insurer – cooperation etc.).

Complex Battlefield

The financial services landscape has witnessed a continuous sea change, which started off with a first mover’s advantage for start-ups with a mindset of technological companies that began to vie for space ceded by traditional big banks (“Banks”) that were technologically handicapped. Historically, Banks controlled end-to-end processing but since the financial crisis and increasing regulatory pressures have led to Banks hiving off their mundane processing operations that were less profitable. This gave rise to best-of-breed service providers that used technology as a weapon to create platforms that targeted these less remunerative operations with greater efficiency giving rise to newer pools of profit. Currently, an increasing number of FinTech start-ups are now attacking core functions of payments, lending, investing, advisory and forcing Banks to cede core-banking space. For example: OnDeck Capital provides faster loans to SMEs while Square offers card services to micro merchants whereas eToro offers professional trading strategies for retail investors at discounted price. McKinsey4 projected in its 2015 report that origination, sales and distribution businesses, representing ~60% of global banking profits (or $620bn) with an ROE of 22%, are at risk from FinTechs. Of the 5 major retail banking business (consumer finance, mortgage, SME lending, payments, asset management), McKinsey believes 20 to 60% of their profits will be at risk by 2025. Furthermore, PwC5 survey shows 28% of Banking and Payments business are at risk of being lost by Banks to FinTech by 2020 and 22% for Asset Management and Insurance business.

Banks by their very nature of being bureaucratic and conservative have been slow to react and also suffer from legacy IT systems that were in no position to compete with this new class of dynamic market participants. However, start-ups are still fragmented and don’t have easy access to data, zealously guarded by the behemoth Banks, that is the fuel to their analytics based platforms. Gradually, in order to overcome the size problem, M&A activity is gathering steam as FinTechs realign amongst themselves or with Banks/Corporates to acquire the competitive edge (example: PayPal & Xoom, AliBaba & Paytm, BBVA & Simple , Samsung & LoopPay. Also, Banks are beginning to push back by indulging in FinTech investments of their own or internal R&D investments or joint ventures/partnerships with other FinTechs.

The battlefield has gotten even more complex as iconic brands like GAFAA (Google, Apple, Facebook, Amazon, AliBaba) have made forays into the space realising that they can set high benchmarks in digitally enabled customer experience and satisfy specific needs. Moreover, they have the best of worlds since they have the DNA of a tech company and have acquired humungous amounts of quality consumer data over the years that FinTechs lack. Amazon makes loans to small businesses through Amazon Lending, while Google Wallet allows customers to make online purchases via email whereas Apple has integrated payments service into its new authentication devices and Facebook has also launched its free Friend-to-Friend payments service.

Analytics leveraged by FinTech

Richer Credit Scoring

Analytics helps credit scoring models to go beyond quantitative factors of data e.g. repayment history and enables assessing of qualitative concepts like behaviour, willingness, ability etc. Richer data can be accessed by media analytics that uses social media to create a personality profile of a person that gives more depth to analytics based credit scoring model .This gives FinTechs a huge advantage over Banks as 73% of world population is not scored by credit bureaus because lack of quantitative data. In traditional credit scoring there is a catch 22 situation where you need to have a loan history to get your first loan. On the other hand , reliance on qualitative factors gives Fintechs a much wider spread of people to conduct credit analysis on and greater probability of sustaining low default rates. Predictive analytics can also be used to predict client’s upward financial mobility adding another aspect to credit scoring. Example: Kreditech scores with little or no credit history as long they are registered in social networks places focus on underlying personality.

Unstructured Data is the New Oil

As the data explosion continues, a unique challenge is to make sense of the majority of data that is in unstructured format and cannot therefore be easily classified or quantified. Text and Media analytics give FinTechs the ability to gain better insights in the form of sentiment, customer pain-points from unstructured data forms such as social media comments, customer complaints, call centre notes, support forums postings. University of Texas study6 says Fortune 1000 companies can gain $2bn a year by increasing usability of data (increasingly a precious asset like oil) by internal staff.

Wider Data Access

Analytics helps FinTech access multiple sources of data from social media to IoT devices to smartphones etc. Open Banking APIs help FinTech get access to client data that are considered as a valuable asset for Banks and resistant to share it to third parties and is closely guarded within their confines. Also, data that is accessed via Banking APIs are of prime quality as it is verified by the trusted party i.e. the Bank. Such open APIs help FinTechs to feature product on behalf of the Bank which can be offered to Bank’s customer whereas FinTech earns revenues via licensing and also bypasses credit agencies for credit scoring. For example Kontomatic banking API enables linking up to KYC and transactional data of customers.

Customer acquisition

Cost of acquisition has dropped as customers have begun to move in droves to digital channels which are also a cheaper and less time consuming option for clients instead of going to physical locations. Consequently offerings can be much more targeted and customised to each sub segment with appropriate pricing. Also more contextual and personalised engagement between FinTech and customer is possible on the digital media via analytics be it advertising, loyalty programs or discount offerings. Higher touch points with consumer also facilitates more cross-sell and up-sell opportunities identified via analytics which results in higher conversion rate.

Asset/Wealth management

Analytics facilitates the introduction of robo-advisors that replicate human-like decisions through advanced algorithms that generate real-time investment advice. Key to wealth management is being an attentive listener, and analytics by its very nature can be used as a strategic tool to uncover deep customer insights that were missed by human senses. Automated investment advice via analytics helps in reducing asymmetry of information between small and large-scale investors thereby enabling greater affordability and wider access to investors (e.g. retail) and creating newer avenues for profit. Dragon Wealth offers financial advisors easy-to-use apps to acquire more investors whereas Sherelt, social trading platform, helps non-profession identify expert traders to follow/copy them.

Risk Management

Predictive analytics utilises social media interactions, transaction patterns, device identification, biometrics, behaviour analytics as major driving factors to provide leading indicators to potential losses and fraud. Analytics enables integration of structured and unstructured data that can help leverage traditional risk management tools for risk adjusted pricing, risk capital management, portfolio risk management etc. Bankguard provides 2 Factor Authentification (2FA) for fraud attacks.

Example of Analytics used as a powerful tool

Let’s see an example to appreciate the power of analytics tool that can profoundly change lives of people in myriad ways and increase touch points between financials and consumers in their daily lives creating multiple marketing opportunities. Raj and his wife have bought a home with a mortgage from BigBank. So Raj being an avid cook is enthusiastic about doing-up his kitchen and so buys a set of chef’s knives, microwave oven etc. using his BigBank card. With each event/transaction like mortgage or purchase orders, BigBank’s data analytics system instantly analyses Raj’s financial data, spending pattern, savings balance, loans taken (e.g. home mortgage), available credit etc. Analytics also analyses social media activity and identifies his love for cooking, regular frequents to restaurants, blogs about his dining experience etc. Based on the available data, analytics can leverage five business potentials for BigBank:

Potential 1: Using predictive analysis BigBank can anticipate home related purchases and can offer to extend credit on Raj’s smart phone which he accepts at the tap of his finger.

Potential 2: Analytics notices Raj’s back-to-back purchases and suggests a free service to digitally store purchase receipts in a digital vault. Raj uploads the microwave oven receipt and BigBank uses content analytics via OCR (optical character recognition technology) to recognise appliance and offers to extend warranty.

Potential 3: Behavioural and media analytics is applied to past behavioural patterns to identify Raj has opted for SME-rewards program (discounts from bank’s SME customers). BigBank sends Raj personalised offers with promotions to various restaurants that are its clients.

Potential 4: BigBank offers a free service called SmartInvest that uses market analytics to tease out wealth management insights to Raj to make more informed investment decisions based on his monthly income vs expenses, savings record and after eliciting additional info like financial goals and risk profile.

Potential 5: Aggregate analytics helps BigBank compare Raj with his peers based on similar spending habits, geographical location, income and age bracket. Consequently BigBank can make Raj “people like you” offers based on a number of criteria like lifestyle, personal preferences, geo-demographic factors. These offerings then have a higher chance of take-up rate as opposed to cold calls.

SPORTING THE ANALYTICS

In simple terms, Sports Analytics1 is all about being right in making a decision, both on and off field, mostly the latter. At times, such decisions may coincide with the gut feeling or human experience from the past and why not? The science is all about predicting based on the statistics from the past. This idea has been always there; with the advent of stronger efficiencies and computation speed in processing engines, the possibilities have hit the realities of today. Sports Analytics involves management of data, developing Analytics models and show up the results on the Information Systems for the end users to consume, who, more often than not, are decision makers 2. At least in the US, every major professional sports team either has an analytics department or analytics expert on staff 3.

Sports Analytics is also deployed for prediction of games, just like any other high-profile event. A company called Gracenote sports is hitting all the right notes in the same area. By acquiring sports data firms 4, Infostrada Sports and SportsDirect, they have been providing gold standard data, analytics and reporting tools to NOCs (National Olympic Committees), thus enabling these NOCs to take critical decisions such as selecting athletes to develop and choosing sport competitions to pursue. According to analysis done by Gracenote 5, they predicted that in 2016 Rio games, Asia will clinch one-third and Europe will account for 47% of all the medals.

Sports fans, all around, are getting a sneak peek into information of their favorite team before a game begins. Fans are given prediction on matches, identifying how weaknesses and strengths of an opponent team will play out against your team’s strengths and weaknesses, how weather conditions can affect the course of the game etc 6 . FiveThirtyEight.com, which started as a polling aggregation website in 2008, evolved to become one such result predictor, which predicts before a game has been played, based on numbers as opposed to gut instincts.

As IoT (Internet of Things) is gaining popularity in everyday’s data-dependent decisions, Sports Analytics is not far behind in leveraging from the same. IoT started, majorly, with the FoB (fitness of beings) with Health app from Apple or Nike+ Running, Runtastic mobile apps etc 7. Increasingly, now, lots of wearables and streaming cameras, which are connected to internet and used to measure every move and make meaning out of it, are pushing IoT to the front seat of Sports Analytics movement. The accuracy is far from guts instincts; a scout uses such infrastructure to corroborate his decisions, a coach uses the same to understand and place players in vital positions of the game, a player using the same to improve his playing style and hence performance.

The History

Companies like Sportradar, keeper of play-by-play data and later delivering the same to companies in media, technology and sports, are pivotal in gathering game statistics and acting as the source engine for Sports Analytics. Sportradar, after building upon its success in Europe, in recent times has secured exclusive partnership with the NASCAR, NHL and NFL. The GSIS (Game Statistics Information System), used by NFL (National Football League), speaks for itself. It is a Windows-based tool to capture play-by-play game data 8. The chips used by NFL, on shoulder pads of players, provides real-time information of player stats like x-y-z coordinates, top speeds and acceleration that opens up a new array of possibilities in broadcasting, sponsorship and engaging with the fans 9.

Atlanta United FC, a Major League Soccer (MLC) expansion franchise, is in news these days as they use Sports Analytics to build an expansion roster, or in simple terms use technical scouting to build the team 10. The franchise, from the past experiences of its management & coaching staff and numerous visits by its front office to other MLC teams, is building up a style of play that the united experience deems fit – for e.g., how the right back wants to play, how they want their No.8 to play 11 and so on. The same is then fed into technical scouting, which tries to narrow down the targets that fit into the franchise’ budget. Around the same time, the General Manager shares the brainstorming process with the aid of technical side and the scouts start to look into the list of possible players and see who fit. To think, that Atlanta United FC is making their team from scratch, is a mountainous task but humbled in implementation due to the Soccer Analytics deployed by the management and used by the scouting team. Obviously, this useful infrastructure is not possible without the seamless connectivity, one avails around these days, between information islands.

When it comes to Sports Analytics there is one celebrity case – that of Moneyball. This is a story of the duo of Billy Beane (then General Manager of Oakland A’s baseball team) and Paul DePodesta (an analyst working for Billy Beane) who, in a bid to stand against bigger & wealthier baseball teams (like NewYork and Boston), were inspired and later adopted a school of baseball statistical analysis known as sabermetrics 12 (a reference to Society for American Baseball Research). DePodesta, in his recent speeches, has warned that a solo pursuit of causal relationship for making data driven decisions in Sports can lead to bad conclusions. According to him, emotional responses to data and player performance need to be separated wisely, i.e. stripping away biases of any type, in order to be successful at using Sports Analytics 13.

Beane had deployed two key procedures to select and retain his baseball players – developed regression models to predict performance of players and ensured the proper use of these models 14. The models used players past performance and current price as the predictors; the models were not to be revised based on opinions, but only when the changes occur in performance or price. Essentially, decisions to draft, play, retain and trade players were made based on regressive prediction models rather than experience from years of being in baseball. Such was the effect of these models that, between 2000 and 2006, Beane’s A’s were registering 95 wins a year, just 2 wins lesser than the mighty New York Yankees.

Similar prediction models are now used today in baseball, basketball, soccer, hockey and football. 15

I still remember in 2016, NCAA men’s basketball final 16 Villanova Wildcats defeated North Carolina Tar Heels in the last second to win the National Championship game. This news suggests how a last 5 seconds play saw Villanova went past their opponents with the last shot by Kris Jenkins at the NCAA’s Men Championship game. To many sports commentators, it looked as if Villanovan management was already up on their feet and celebrating as Jenkins had his knees parallel to the ground, after he received the pass from another Villanovian, standing prepared to score the winner. This has sent sports analyst into discussions that Villanovan had been reading their opponent’s defense all the time and knew exactly the person, to be juxtaposed just at the right place and moment, to score the winner – their leading 3 point shooter and hitting 39% of them before coming to the game, Kris Jenkins. Though North Carolinan defenders got mixed up in blocking Jenkins, many sports commentators believe that the winning shot was a move inspired by the statistics going into the game. The winning shot is present at this link for your reference.

SUPPLY CHAIN ANALYTICS

The concept of Supply Chain comes from the network of exchangeable relationships, which must exist and execute, for the creation & supply of a product /service to its final customer. The steps involved in these 2 type of Supply Chain are presented in the following figures: -

Product

Service

We have often experienced, in our day to day life, how the vendors in context to vegetables, fruits, flowers, etc. have their market in India. We, as consumers, often go to these road side vendors straight to get the said items, expecting to get the best prices and the freshest of those products. But what we generally don’t indulge into is the path those items take from the producers to the consumers. For example, ‘VeggieKart’, a hyperlocal player operating in the city of Bhubaneswar (India), is a consumer and farmer beneficial initiative that uses its e-commerce platform to sell vegetables and fruits. They thrive on providing fresh quality farm produce, through their value-adding supply chain, while giving a win-win return to the farmers and the consumers. The role of this initiative is not only to create a symbiotic relationship between the consumers and farmers but also to remove the middlemen from the value chain. The involvement of Wholesalers and Retailers pushes the time period for delivery of the products by a great extent and to explain the same I would like to provide the example of ’TheBouqs’.
’The Bouqs’ Company is a cut-to-order, farm-to-table, eco-friendly flower retailer, delivering straight from farms. The company knows how to save money at each of the phases in the supply chain from farms to flower recipients. Typically, flowers go from farm to wholesale, wholesale to retail and retail to consumer, wasting about 17 days of the lifespan of a commodity that only lasts 21 days. The Bouqs is aiming to change the flower industry for the better with farm-to-door flowers that are only cut once a customer places an order and arrive on day 4 rather than on day 17. Supply chain analytics allows The Bouqs to constantly analyze and use historical sales from the farms’ data, as an input into their systems, to determine what’s available for sale and link to their estimate of what they think is going to sell in terms of volume, of any given type of flower, in any given month to ensure that their farm network can satisfy that demand./p>

Supply Chain Analytics is the streamlining of a business’ supply-side activities to maximize customer value and to gain a competitive advantage in the market place. It represents the effort by the suppliers to develop and implement supply chains that are as efficient and economic as possible. The potential benefits of Supply Chain Analytics are the following: -
1.   Using historical data to feed predictive models that support more informed decisions.
2.   Identify hidden inefficiencies to capture greater cost savings
3.   Using Risk Modeling to conduct “pre-mortems” around significant investments and decisions
4.   Predicting Consumer and Pricing Analytics to provide the whole profitability picture.

Let’s think of Supply Chain Analytics in 3 forms – (a) Descriptive Analytics which says “Where am I today”, (b) Predictive Analytics which says “With my current trajectory, where will I be headed tomorrow”, (c) Prescriptive Analytics which says “Where should I be”.  3
To get insight into product data, a company can use historical data with present day factors which affect the production and sale of the product, such as weather, demographic, economic and social media, to better analyze and predict the sale of the product. From Supply Chain Analytics point of view, building predictive models, based on the available data, with the usage of machine learning algorithms, estimates the time period for the supply of a product from the producer to the customer, including the middlemen. Talking about Predictive Analysis 4  on the factory floor, which is the first step of the supply chain ladder, any delay in this step will be an obvious impact on the supply chain performance. To tackle this kind of situation, different types of sensors are used on critical, capital intensive production machinery to detect breakdowns before they occur; this sensor data is analyzed to prepare predictive models for different failure conditions. 5  Thus, Predictive Analytics helps in forecasting the demand based on the historical data and external factors related to shipment of products by providing visibility in terms of assets and operations for the company.
For a logistic company like ups (United Parcel Service) which is so fast paced and is well known for its unparalleled shipment and delivery system, the use of supply chain analytics along with sensor data, which they use in their delivery mechanisms such as trucks provided with wireless infrastructure known as Delivery Information Acquisition Device (DIAD), this comes under the aspect of descriptive analytics, provides them real time data starting from the time the driver loads the shipment to different time intervals, recording every scheduled and unscheduled stops on the way to the delivery address, to efficiently analyze and predict the delivery time of the shipment- this is referred to as Prescriptive Analytics.The sheer presence of big data helps in about ‘how’ to analyze the work they do rather than ‘what’ to do with the data, that’s where Analytics comes in.
From an Analytics perspective, they built a model which would predict where every package was at every moment of the day, and where it needs to go and why, then we could just flip a bit and change where a package is headed tomorrow and that helped them become more efficient as the drivers didn’t have to start the day with an empty DIAD, it had all the information they needed to have, so basically DIAD rather than being an acquisition device became an assistant for them so as to guide them during the travel saving a lot of fuel and time expenses in the meanwhile.This is where the ORION (On Road Integrated Optimization and Navigation) system comes in. ORION systems helped them to reduce 85 million miles driven in a year which ensured faster delivery which is 8.5 million gallons of fuel less consumed and on the environment aspect 85000 metric tons of carbon dioxide not emitted. 6
With the ORION system, ‘ups’ has what they call “all services on board”, where they assign one driver, one vehicle, one service area and one facility but having 2 different services per vehicle namely premium service and deferred service which helps them manage the delivery service on a priority basis.The DIAD using the Geospatial technology helps in reorganizing the best route according to the delivery schedule, for the driver. Building these predictive models on arrival times based on as much historical data available as well as third party data sources such as traffic and weather, depending on machine learning predicts the estimatedtime of arrival and the same is notified to the customer by the DIAD itself. Putting that in perspective, it’s amazing to see how the data and analytics is being used to make the life of the driver, basically the Supply Chain Management, more organized.
With better connection of the manufacturing unit, logistics, warehousing and supply chain processes, there’s opportunity for manufacturers to cut costs and also face the business challenges. The use of IoT (Internet of Things) through sensor data and statistical forecasting algorithms used with real time data boosts up the opportunity of the producers to maintain stability in their delivery system. This is ‘Supply Chain Analytics’.
Another great example I would like to refer is SAP – IBP (Integrated Business Planning) which provides functionality for sales and operations, demand, inventory, supply and response planning -- and SAP Supply Chain Control Tower can help supply chain planners accurately and collaboratively develop sales, inventory and operations plans.IBP’s Supply Chain Control Tower, which is available only in the cloud, combines data from several systems, including SAP ERP, non-SAP systems, and third-party systems, and uses the data in conjunction with signals from IoT devices in manufacturing, logistics and related information networks. It acts as a supply chain collaboration hub for all stakeholders to simulate, visualize, analyze and predict the information and possible outcomes they need to know to resolve issues and remediate risks, as well as to improve business performance. With Supply Chain Control Tower, cause-and-effect and what-if simulations help planners gain insight into how a disruption to manufacturing, logistics, transportation or supply chain operations will affect business. These supply chain analytics, that enable planners to ensure corrective and preventive measures, are put in place to eliminate or minimize factors leading to supply chain nightmares.

ANOVA - TUKEY KRAMER P...

Now, in our previous blog we discussed how we conduct a One-way ANOVA test for the reason of finding whether the population means across groups are equal or not (or at least one of them is not equal).

From the previously discussed example, we know that all population means are not equal, so the next step is to make multiple comparisons to determine which groups are different, or in this case the mean number of hours the battery runs after a full battery charge cycle of which laptop brands are different. One of the methods to do a multiple comparison is the Tukey-Kramer procedure (To be noted that a multiple comparison method is always conducted when we conduct a hypothesis test to test whether the population means of the groups are same or different (or at least one of groups have an unequal population mean) and then the result of the test shows the latter, which is - we reject the hypothesis that all the groups have the same population mean).

The formula to find the critical range for the Tukey-Kramer procedure is as follows:

Critical range = Qu , where

Qu is the upper-tail value from the studentised range distribution having ‘c’ degrees of freedom in the numerator and ‘n-c’ degrees of freedom in the denominator.

The sample sizes we have in the above conducted One-way Hypothesis test are the same (5 observations in each of the 5 groups) and thus we only need to calculate one critical range. If the sample sizes differ, we would have had to calculate a critical range for each pairwise comparison of the sample means, for example we would have had to prepare a critical range for the laptop brands – Dell and Apple, then Dell and HP and so on.

So, the Critical range in this case is:

Finally, we compare each of the ‘c(c-1)/2’ pairs of means against its corresponding critical range. This gives the number of pairwise comparison we need to make to declare a specific pair(s) whether it’s or are significantly different – if the absolute difference in the sample means is greater than the critical range.

In our case, c(c-1)/2 = 5(5-1)/2 = 10 pairwise comparisons need to be made. So, to apply the Tukey-Kramer procedure we first need to calculate the absolute mean differences for the 10 pairwise comparisons.

Because the pairwise mean difference between the group means of Apple and HP, Apple and Sony, and Sony and Asus are greater than the critical range of 3.21, we conclude that there’s a significant difference between the means (mean hours the battery runs after a full battery charge cycle) of Apple and HP, Apple and Sony, and Sony and Asus.

Thus, we learnt how to conduct a One-way ANOVA test. If the test involves multiple comparisons of different group means, Tukey Kramer procedure can be applied to understand the significant difference between the means.

ANOVA - TUKEY KRAMER P...

Analysis of Variance or commonly put as ANOVA is basically used to compare three or more population means. It has many useful business applications such as determining:-

- If the average amount of time spent per month on Facebook differs between various age groups (three or more age groups)

- If the average number of sales calls per day differs between sales representatives (three or more sales representatives)

Every time you conduct a t-test there is a chance that you will make a Type I error. This error is usually 5%. By running two t-tests on the same data you will have increased your chance of "making a mistake" to 10%. The formula for determining the new error rate for multiple t-tests is not as simple as multiplying 5% by the number of tests. However, if you are only making a few multiple comparisons, the results are very similar if you do. As such, three t-tests would be 15% (For instance, for three groups say A, B and C, we carry out three t-tests - comparison between A and B, then between A and C, and between B and C, at a significance level of 5% or alpha = 0.05. Thus, the overall probability of no Type I error will be 0.95 * 0.95 * 0.95 = 0.857. This implies that the probability of Type I error will be 0.143 (= 1 – 0.857).) and so on. These are unacceptable errors. An ANOVA controls for these errors so that the Type I error remains at 5% and you can be more confident that any statistically significant result you find is not just running lots of tests.

Example 1:

A firm that studies customer satisfaction conducts a survey to measure how satisfied customers are with several smartphones.

In the above table, ‘average score’ is the dependent variable and ‘type of phone’ is the independent variable. This independent variable has 4 groups.

Here, we can employ an ANOVA procedure to test if we have enough evidence from this sample to conclude whether the average satisfaction scores from these populations of phone users are different from one another. Using F-test (F-tests are named after its test statistic, F, which was named in honor of Sir Ronald Fisher. The F-statistic is simply a ratio of two variances. To use the F-test to determine whether group means are equal, it’s just a matter of including the correct variances in the ratio. In one-way ANOVA, the F-statistic is this ratio:F = variation between sample means / variation within the samples), ANOVA determines whether the variation in satisfaction scores is due to the type of phone (between group variation) or simply due to randomness (within group variation).

F-test compares the amount of systematic variance (variance between groups) in the data to the amount of unsystematic (error) variance (variance within groups – the variance that cannot be explained by the independent variable).

In this example, we use a One-way ANOVA since there’s one dependent variable and one independent variable with more than two treatment levels. There can be situations where there can be more than two factors or independent variables can be tested for which the interaction between the two factors should be tested, this is referred to as a Two-way ANOVA. As one might expect this concept can be extended beyond just two factors to ‘N’ number of factors.

Further to be noted is that there are certain assumptions of ANOVA, which are the following: -

- Data should be from a normally distributed population.

- The variance in each experimental condition is fairly similar, also referred to as homogeneity of variance.

- The observations should be independent and random.

- The dependent variable should be measured on at least an interval scale.

Hypotheses of One-way ANOVA:

Null Hypothesis (H0): µ1 = µ2 = µ3 = ….. = µc

(all population means are equal for ‘c’ different groups, i.e. there is no variation in means among the groups)

Alternate Hypothesis (H1): Not all the population means are equal

(at least one population mean is different to the others)

To perform an F-test for differences in more than two means, we should calculate the following:

Total Sum of Squares (SST) SST = SSB + SSW
Sum of Squares Within Groups (SSW)
Sum of Squares Between Groups (SSB)
Mean Square Between (MSB) MSB = SSB/c-1
Mean Square Within (MSW) MSW = SSW/n-c
F statistic F = MSB/MSW

Steps to calculate the above are:

SST = where, = Grand mean
- Xij = ith value in group j
- nj = number of values in group j
- n = total number of values in all groups combined
- c = number of groups
SSB = where, = sample mean of group j
SSW =

To give an illustration of how to perform calculations in a One-way ANOVA test:

Example – the data below represents the number of hours the battery of laptops (after one full charge cycle) by 5 different brands of laptop manufacturers (Dell, Apple, HP, Sony and ASUS) owned by 25 people. The 25 owners were randomly divided into 5 groups and each group was treated with a different brand. Assume confidence level to be 95%.

In this example, ‘number of hours of battery life’ is the dependent variable and ‘brand of laptop’ is the independent variable. This independent variable has 5 groups. Also, α = 0.05.

SSW = (5-5.2)^2 + (4-5.2)^2 + (8-5.2)^2 + (6-5.2)^2 + (3-5.2)^2 + (9-7.8)^2 + (7-7.8)^2 +SSB = 5(5.2-5.28)^2 + 5(7.8-5.28)^2 + 5(4-5.28)^2 + 5(2.8-5.28)^2 + 5(6.6-5.28)^2 = 79.44

(8-7.8)^2 + (6-7.8)^2 + (9-7.8)^2 + (3-4)^2 + (5-4)^2 + (2-4)^2 + (3-4)^2 + (7-4)^2 +

(2-2.8)^2 + (3-2.8)^2 + (4-2.8)^2 + (1-2.8)^2 + (4-2.8)^2 + (7-6.6)^2 + (6-6.6)^2 +

(9-6.6)^2 + (4-6.6)^2 + (7-6.6)^2 = 57.6

SST = (5-5.28)^2 + (4-5.28)^2 + (8-5.28)^2 + (6-5.28)^2 + (3-5.28)^2 + (9-5.28)^2 + (7-5.28)^2 +

(8-5.28)^2 + (6-5.28)^2 + (9-5.28)^2 + (3-5.28)^2 + (5-5.28)^2 + (2-5.28)^2 + (3-5.28)^2 +

(7-5.28)^2 + (2-5.28)^2 + (3-5.28)^2 + (4-5.28)^2 + (1-5.28)^2 + (4-5.28)^2 + (7-5.28)^2 +

(6-5.28)^2 + (9-5.28)^2 + (4-5.28)^2 + (7-5.28)^2 = 137.04

Now, we conduct a hypothesis test to test whether the mean number of hours the battery runs after a full battery charge cycle is not the same or not the same for all 5 brands.

H0 : µ1 = µ2 = µ3 = µ4 = µ5 H1 : not all population means are equal

Decision rule: Reject null hypothesis (H0) if Fcalc > Fcri

Test statistic: Fcalc = 6.90

Critical value: Fcri = Fα,c-1,n-c = F0.05,4,20 = 2.87

Therefore, the F calculated value of 6.90 is greater than the F critical value of 2.87 and thus we reject the null hypothesis and conclude that the mean number of hours the battery runs after a full battery charge cycle is not the same across all the 5 brands of laptops.

Hence, we have concluded how we conduct a One-way ANOVA test.

In our next blog we would be discussing how we conduct a One-way ANOVA test for the reason of finding whether the population means across groups are equal or not (or at least one of them is not equal). Read our next blog.

TIME SERIES ANALYSIS &...

Observations of any variable recorded over time in sequential order are considered a time series. The measurements may be taken every hour, day, week, month, or year, or at any other regular interval. The time interval over which data are collected is called periodicity. There are two common approaches to forecasting: -

1)Qualitative Forecasting method: When historical data are unavailable or not relevant to future. Forecasts generated subjectively by the forecaster. For example – a manager may use qualitative forecasts when he/she attempts to project sales for a brand-new product. Although qualitative forecasting method is attractive in certain scenarios, it’s often criticised as it’s prone to optimism and overconfidence.

2)Quantitative Forecasting method: When historical data on variables of interest are available. Methods are based on an analysis of historical data concerning the time-series of the specific variable of interest. Forecasts are generated through mathematical modelling. Quantitative forecasting methods are subdivided into two types:

1)Time Series Forecasting methods: forecast of future values based on the past and present values of the variable being forecasted. These are also known as non-casual forecasting methods, they are purely time series models and do not present any explanation of the mechanism generating the variable of interest and simply provide a method for projecting historical dat

2)Casual Forecasting methods: It attempts to find casual variables to account for changes (for the variable to be forecasted) in a time series. It forecasts the future values by examining the cause and effect relationships. Casual forecasting methods are based on a regression framework, where the variable of interest is related to a single or multiple independent variables. Here, forecasts are caused by the known values of the independent variables.

Basic assumptions of time-series forecasting are: -

Factors that have influenced activities in the past and present will continue to do so in more or less the same way in the future.
As the forecast horizon shortens, forecast accuracy increases.
Forecasting in the aggregate is more accurate than forecasting individual items.
Forecasts are seldom accurate (therefore it is wise to offer a forecast range)

In this blog, we are going to focus on Time Series Forecasting when there are no Trends in the model. The main aim of which is to identify and isolate influencing factors to make predictions. To achieve this objective, we need to explore the fluctuations using mathematical models, the most basic of which is the classical ‘multiplicative’ model.

Figure 1: This shows a Time Series Plot which shows the monthly sales for two companies over two years, where the vertical axis measures the variable of interest and the horizontal axis corresponds to the time periods.

Figure 2 shows the components of a time series. The pattern or behaviour of the data in a time series involves several components:

Trend - the long-term increase or decrease in a variable being measured over time (such as the growth of national income). Forecasters often describe an increasing trend by an upwards sloping straight line and a decreasing trend by a downward sloping straight line.
Cyclical - a wave like pattern within the time series that repeats itself throughout the time series and has a recurrence period of more than one year (such as prosperity, recession, depression and recovery).
Seasonal - a wave like pattern that is repeated throughout a time series and has a recurrence period of at most one year (such as sales of ice-cream or garden supplies)
Irregular - changes in time-series data that are unpredictable and cannot be associated with the other components (such as floods, strikes).

The classical multiplicative time series model states that any value in a time-series is the product of trend, cyclical, seasonal and irregular, as the multiplicative model assumes that the effect of these four components in a time series model are interdependent.

Classical multiplicative time series model for annual data: Yi = Ti * Ci * Ii

where, Ti is the value for the trend component in the year ‘i’,

Ci is the value of the cyclical component in the year ‘i’,

Ii is the value of the irregular component in the year ‘i’.

Classical multiplicative time series model includes the seasonal component where there is quarterly or monthly data available: Yi = Ti * Ci * Ii * Si

Where, Si is the value of the seasonal component in time period ‘i’.

Since in this blog we are primarily focusing on Non-Trend Models (which means after plotting the data there are no patterns that occur over time, neither an upward nor a downward trend), we use smoothing techniques to smooth series and provide an overall long term impression. When there’s no trend, we use smoothing techniques such as the method of moving averages or the method of exponential smoothing to smooth the series.

Time-series smoothing methods: If, for instance, we use annual data, a smoothing technique can be used to smooth a time series by removing unwanted cyclical and irregular variations.

Let’s take an example of Gasoline sales (in 1000s of Gallons) over a period of time:

Year	1	2	3	4	5	6	7	8	9	10	11	12
Sales (Yi)	17	21	19	23	18	16	20	18	22	20	15	22

We drew a scattered diagram using the above-mentioned data. In figure 4, our visual impression of the long-term trend in the series is obscured by the amount of variation from year to year. It becomes difficult to judge whether any long term upward or downward trend exists in the series. To get a better overall impression of the pattern of movement in the data over time, we smooth the data.

One of the ways is using the Moving Averages method: here the mean of the time series data is taken from several consecutive periods. The term moving is used because it’s continually recomputed as new data becomes available, it progresses by dropping the earliest value and adding the latest value. To calculate moving averages, we need to know the length of periods chosen to be included in the moving average. Moving Averages are represented by MA(L ) where L denotes the length of periods chosen. A Weighted Moving Average (WMA) is prepared as It helps to smooth the price curve for better trend identification. It places even greater importance on recent data.

Using the above example, we prepare a table to show the Weighted Moving Averages:

In the above figures (5 & 6), we can observe that the 5 year moving averages smooth the series more than the 3 year moving averages because the period is longer. So, as L increases, it smoothens the variations better but the number of moving averages that we can calculate becomes fewer, this is because too many moving averages will be missing at the beginning and end of the series.

A Moving Average has two main disadvantages:

It involves the loss of the first and last sets of time periods. This could be a significant loss of information if there are few observations in the time series.

The process of dropping the last observation in current set causes the moving average to forget most of the previous time series values A technique that addresses both of these problems is called Exponential Smoothing. It’s a forecasting technique in which a weighting system is used to determine the importance of previous time periods in the forecast. It’s used to weight data from previous time periods with exponentially decreasing importance in the forecast. The aim is to estimate the current level and use it as a forecast of future value.

To calculate an exponentially smoothed value in time period ‘i’, we use the following understanding: -

E1 = Y1 Ei = WYi + (1-W)Ei-1,

where,

Ei is the value of the exponentially smoothed series being calculated in the time period ‘i’

Ei-1 is the value of the exponentially smoothed series already calculated in the time period ‘i-1’

Yi is the observed value of the time series in period ‘i’

W is subjectively assigned weight or smoothing coefficient (where, 0 < W < 1)

Let us use the same example of Gasoline sales (in 1000s of Gallons) over a period of time:

(Assume W = 0.5)

From figure 7, we can observe how exponentially smoothening the series with lesser variations. Now comes the point where we take a decision to choose the smoothing coefficient. When we use a small W (such as W = 0.05) then there’s heavy smoothing, as there’s more emphasis on the previous time period (Yi-1), therefore, slow adoption to recent data. If there’s moderate smoothing (such as W = 0.2) then there’s moderate smoothing or moderate adaptation to recent data. And if we choose a high value for W (such as W = 0.8) then there’s little smoothing and quick adaptation to the recent data.

Therefore, the selection has to be somewhat subjective. So, if our goal is to only smooth a series by eliminating unwanted cyclical and irregular variations, we should select a small value for W (thus less responsive to recent changes). If our goal is forecasting, then we should choose a large value for W (in this case more weight is being put on the actual value than the forecast value as large W assigns more weights to the more recent values).