Mining for Data Gold Using Microsoft Excel
When some folks think of “data mining” they imagine a process involving massive databases and very expensive special-purpose software running on costly mainframes or even supercomputers. That’s one part of the story, from the point of view of a company like an airline with literally hundreds of millions–if not billions–of reservation and passenger records and their field values stored over time that may reveal some profitable patterns if sifted properly. In this article I’ll share with you some Excel data analysis basics that you can apply today to your own data, on your own PC. I guarantee you’ll learn something new and beneficial about your business. The features I cover include list management, AutoFilter, Advanced Filter, statistics tray, subtotals, PivotTables, and more.
In a recent consulting assignment I performed data mining on a scale of 10,000 records in a flat-file database spanning a decade, and my toolkit was comprised of one powerful and relatively ubiquitous tool: Microsoft Excel.
The project’s goal was to examine the data for patterns that might yield more information when studied in greater detail.
In Excel, the term “list” and “database” are synonymous. Excel’s help file defines a list like so, “A series of worksheet rows that contain related data, such as an invoice database or a set of client names and phone numbers. The first row of the list has labels for the columns.” (Note: To fit this article within the designated space, I’ll refer to some external resources, one of which is an Excel book Jim and I co-wrote. If you want to take a test-drive with these features you need a robust database; I provide the complete steps for doing this on a supplemental Web page; links appear at the article’s end.) To read what Microsoft has to say about creating a list, study the help topic “Guidelines for creating a list on a worksheet”. Assuming you’ve already got a list set up, click any cell inside the list then turn on the AutoFilter feature: Data, Filter, AutoFilter.
Paraphrasing from our book, AutoFilter converts each column’s header into a searchable drop-down control. The control contains several pre-determined choices: All, Top 10, Custom, Blanks and NonBlanks (the latter two appear only if there’s at least one blank cell in the column), followed by a sorted list of all the unique values in the column. This is an amazingly powerful feature, yet so easy to use; I highly recommend you play around with it. It only takes a few mouse clicks to discover something new about your data; once you starting using AutoFilter you’ll rarely turn it off. I only have room in this article to describe the overall capabilities of AutoFilter; as you dig deeper into this feature you’ll discover the ability to perform simple queries from a “build your own query” dialog box or complex queries involving multiple fields that you build using formulas (this is Excel’s Advanced Filter feature).
Sometimes you just need to do a quick, fundamental statistical analysis on a range of cells, like a count or a sum or an average. This is literally a one-click operation using Excel’s cool but little-known statistics tray. Select a range of cells (at least two cells), say, the values 2 and 3 in A1:A2. Now right-click anywhere on Excel’s status bar (the horizontal bar at the bottom of the parent window that displays “Ready” at its left edge), and notice the pop-up menu: None, Average, Count, Count Nums, Max, Min, Sum. Select Average and notice that the status bar now displays the text “Average=2.5” (about two-thirds of the way across the status bar). Play around with changing the statistic of interest. From now on, whenever you select *any* range of cells, the statistic is updated dynamically.
With one mouse click, Excel’s subtotals feature can automatically generate single-level and multi-level subtotals. You can choose which fields to subtotal at each level change, you can have subtotals appear above or below the associated data, you can page break between groups, and you can choose between a variety of subtotal functions: Sum, Count, Average, Max, Min, Product, and more. The most important thing about subtotals is to first sort your data by the field(s) you are interested in; each field of interest is a “level.” In the Orders table example (see supplemental page), to see Freight subtotals by customer you first sort the database by customer (CustomerID), then you select Data, Subtotals. Excel uses its powerful outlining feature to display these automatic subtotals, and you can put this outline view to good use. Continuing with the freight example, say you want to see only the subtotals and not the underlying detail data, click the outlining level 2 button and only the customer- level subtotals are visible. Click the outlining level 1 button and you’ll see just the database-wide grand total. To see the details again click the outlining level 3 button. Removing subtotals is easy: Data, Subtotals, Remove All.
PivotTables are the cream of the crop in Excel’s data analysis feature set. In our book we describe them this way, “Think of a PivotTable as an interactive summary of your data. You can quickly change the functions you use for this summarization (from counts to sums to averages, and so on), change the fields across which the data is tabulated, flip through groupings and sub- groupings (pages), chart these groupings, refresh the source data, drill down or up to see more or less detail, change formatting, and more.” If you’ve been following along with the Orders table example, what if you wanted to see a table of freight charges by destination country by employee? (Counting from the point where subtotals are removed, you need a mere 11 mouse clicks to create this table.) Here are the steps: remove all subtotals, click anywhere inside the list, select Data, PivotTable and PivotChart Report, click Next, click Next again, click Layout, drag the ShipCountry field into the Row area, drag the Employee field into the Column area, drag the Freight field into the Data area, click OK, and click Finish. And remember, a PivotTable can be updated to reflect changes in underlying data by clicking the Refresh button, or you can set it to automatically refresh each time the workbook is opened. Many more PivotTable settings are available: click the PivotTable button on the PivotTable toolbar, then select Table Options. PivotTables can give you valuable insights into your data. I highly recommend that you spend some time with them.