Tableau – Know your Data

Data is the first piece of information available in the hands of Tableau Developer. As soon as you get the data you want to connect it to your Tableau App and start with your report. However, we all know that it’s not the ideal world that we live in. The data that is provided is not always in the perfect shape and needs some grooming. Cleaning up the data and taking care of all the inconsistencies takes a substantial amount of time.

marketing man people woman
Smartly Know your Data in Tableau

As a BI Developer, we sometimes face the dilemma, whether to clean the data ourselves or pass it back to the source from where you got it. Answer to this question lies in the structure of your project team, whether your project has a dedicated team to help you with data piece or you are the whole and sole of the project. It also depends the level of access, you as a BI developer have on the source data. Before starting with actual report designing in Tableau, it is advisable to spend time understanding the data and cleaning it up wherever required. Below are some pointers to help you get started quickly

  1. Connecting to data source: Tableau has grown very fast over past years, allowing you to connect to almost all the data sources that you can think of, from On-Premises to Cloud data bases. However, there can still be some new data sources which might not be in the list of the allowed connections. In such scenarios, you need to figure out other ways to connect, like exporting the data to other databases or flat files which can be connected to Tableau. 
  2. Working with Cube Data sources: Few Tableau features are not supported with cube data sources, which can urge you to look for other options. Example, one of the most limiting feature of cube data sources is that measures in cube data sources are pre-aggregated. Tableau doesn’t support aggregations like SUM, AVG, CNT etc in Cube data sources like SAP NetWeaver Business Warehouse, Oracle Essbase, Teradata OLAP, etc. In such scenarios, where you have cube database as your source data, dragging the measure to the column shelf or row shelf, will take the default aggregation set at database level, which cannot be changed in Tableau. It means that, if a measure’s aggregation is set as SUM at database level, you will not be able to get AVG of that measure. However, there are workarounds using Table Calculations, which may or may not be convenient for every user. If these limiting features are seriously affecting your deliverables, you may want to explore other options like exporting the data from cube data source to relational data source or flat files & connecting Tableau to the exported data. 
  3. Live Vs Extract: Live connections get the most recent data for your analysis. When using Live connection, you can be assured that whatever you are seeing on your screen is the most up to date information. However, getting the most updated information can come at the cost of speed. As Tableau is sending and requesting data across the servers every time you use a visualization, speed at which you get the result may be reduced. If you want faster results, you can opt for Extract connections. With extract connections, Tableau saves a copy of your data set in its temporary location. Once the data is saved in Tableau’s temporary location, it can be accessed faster as Tableau will not have to send and request data across the servers for every query. However, it may or may not be the most up to date information. Choosing between Live and Extract connection requires you look at various aspects. If your data changes frequently and you want the most up to date information, then you should go for Live connection, but if the data gets updated less frequently, then Extract connection are the most optimal choice. Other than frequency of changes and speed of query result, there may be other factors that will affect your decision of Live or Extract connection. You may want to consider cost factor while using Live connections, especially with Cloud database or Cloud Servers. If your dashboard is accessed by multiple users, with every action, there may be a query fired to your database which will add to your data transfer & other overhead costs. Apart from this, if the database is accessed by other groups in your organization, then hitting the database continuously may result in slow speed for other groups as well. To conclude, one’s default choice should always be Extract connection, until and unless we have a very convincing reason to go for Live Connection. 
  4. Data Validation: Once you get the data connected, first thing you should do is have a look at meta data and a sample of your data. When you import your data, you should check for data types of all the important fields for your analysis. Check for all your measures to be in number format. One of the most common issue is the data type of ID fields. Sometimes, ID fields only consists of numerical values, which result in those getting aggregated whenever you add them to the visualization. To avoid this, change the data type of ID fields to String. 

If you want to use maps, make sure your location fields are configured as Geographic data type. Having a look at sample data will give you a rough idea of what the data looks like and whether it meets your expectations in the first pass. 

  1. Data Transformation: Once you have validated the data, you would want to add or remove some of the fields. Some measures that you want to use in your analysis, may not be readily available in your source, and thus there is a need create calculated fields. Sometimes, you also want to add more dimensions to your report. There are multiple approaches that can be followed to add the additional fields.
  1. Add calculated fields in Tableau: New calculated fields can be added in Tableau. It is advisable to add the calculated fields in Tableau, if the calculations are simple. If the data is huge and the logic of calculated field is too complex, it may affect the performance of your report. If performance of the report is not as expected, then other approaches listed below can be explored. However, calculations like Profit Ratio should always be done at Report Level, as these are to be calculated after the aggregation is done. 
  2. Create Custom SQL Query to add calculated fields: Instead of adding calculated fields in Tableau, fields can also be added in Custom SQL query. If you are using Extract connection along with this method, the fields will be calculated only while creating extract. This will result in improvement in performance as the results will not have to be calculated every time the user connects to Tableau file or loads the visualization.
  3. Add measures at database level: New fields can also be added directly at database level. If you have required permissions, you can add fields directly in the table or create a view with additional fields. Adding measures in database will further improve the performance of your Tableau report as these will not have to be calculated by Tableau. If you don’t have the required access to make changes at the database level, you might want to choose one of the above approaches to get the required fields in your data. As a Tableau Developer, you will find adding fields in Tableau Report directly more convenient than adding fields in database. If your data is not very huge and adding fields will not have major impact on performance, then you can opt for directly creating calculated fields in Tableau Report. 
  4. Hide/Unhide Unused Fields: If you are working on large datasets, you would want to make every effort to improve the performance of your report. One way is to optimize your data set by removing all the un-necessary columns. If your Tableau report is connected to Database that has more than 50 columns, but your Tableau report needs only 5 of those, then it makes sense to import only the required columns in your report. Tableau has in-built feature to allow you to hide/unhide unused columns. When this option is selected, Tableau will hide all the columns which are not used anywhere in the report. These hidden columns are excluded while creating extract, thus reducing the size of source data. This feature is best used only after the report designing is complete, because only after the report designing is complete, one can be sure of the fields that are not required to be part of your source data set.

Even after looking at the above-mentioned aspects of data, there will be still times, when in the middle of your report designing, you will have to make changes to the data model, clean up some data, and add few more fields. The process is iterative and just like few versions of report, there will be few versions of your data set. However, investing some time to understand & cleanup your data correctly before starting with your report, will surely help in getting things going in the right direction.