Modeling Wuhan Coronavirus (2019-nCoV) Global Cases Dashboard with Power BI — Part 3
This is a series for Modeling Wuhan Coronavirus (2019-nCoV) Global Cases Dashboard with Power BI . For the rest of the series, please refer to links below:
- Modeling Wuhan Coronavirus (2019-nCoV) Global Cases Dashboard with Power BI — Part 1
- Modeling Wuhan Coronavirus (2019-nCoV) Global Cases Dashboard with Power BI — Part 2
In this last post, I will address on the Advanced use of Power Query to consolidate all the existing sheets and aggregate the data to form this visual based on the Google sheets provided by JHU CSSE.
Understanding the existing data set
Based on the downloadable Google Sheets provided by JHU CSSE since January 22 2020, the columns and metrics collected as evolved along the way.
The sheet provided by JHU CSSE changes over time since the start. Below are the key changes:
- Change in the number of columns (from 6 to 7 and then to 6 again).
- Change of column name from Jan 26.
- Change in the name convention for Mainland China and China.
Assuming that the number of columns and the columns names are fixed since day 1, you can easily merge all the sheets to get a single consolidated view of all the statistics.
Based on the line chart which I have done, I’m only looking at the total number of confirmed cases between Mainland China and the rest of the world occurring over time. Hence, the columns which I need are Country or Country/Region, and Confirmed.
Combining data from multiple worksheets in the same workbook
Step 1 — Get a table with all the required worksheets listed
Create a new query in the same Power BI file using the same web URL source.
Once you connect with the data source, right click the root and click Edit.
In the table, the sheets already sorted accordingly to how the folks at JHU CSSE add to the Google sheet, with the latest set of data as the first sheet, so on and forth. As mentioned in Part 1, there is a possibility that there are multiple updates in a single day. Therefore, we need to find a way to filter and get the updated data set for each unique day.
Select the column Item, click Extract under Add Column tab and select Text Before Delimiter.
Under Delimiter, enter underscore. We want to extract the prefix of the sheet name (i.e. Jan31_11pm → Jan31). A new column will be added with the extracted name and renamed the column as Date.
Select the column Date, click Remove Rows under Home tab and select Remove Duplicates. It will iterate from the first row of the table as Row N, compare between Row N and Row N+1, and it will remove N + 1 if it’s the same. After removing N+1, it will then compare N with N+2. If there is no match, Row N+2 will take on as N and repeat the cycle again.
This will ensure that the latest data set of each day will be retained and remove the redundant ones.
Step 2 — Create a template query
Duplicate the query above and call the new query Template.
In the Template query, select one of the worksheets to use to build the query whose logic will be applied to the rest of the sheets, and filter the table above so it only contains the row for that worksheet. For this example, I will be using the worksheet named Jan31.
Remove all the other columns in the table except the Data column.
Click on the Table link inside the cell and the content of the worksheet will be expanded.
Under the Applied Step, there will be a Changed Type step in the query. This step sets the data types for each of the columns. You will need to remove it.
From this step onward, we need to avoid any transformations that will generate M code which refers to any columns on the original worksheet that may not be present in other worksheets. You can refer to this link to understand the effects of M code and cell referencing.
In this table, we want to keep the following columns — Country/Region and Confirmed.
Country/Region column name and column value for Mainland China
Remove the “Promoted Headers” step.
Select Column 2 and replace values “Country” to Country/Region”, expand Advanced option and check “Match entire cell contents”. This is to handle the different column naming for Country/Region used in the earlier days.
Select Column 2 again and replace values “China” to “Mainland China”, expand Advanced option and check “Match entire cell contents”. This is to handle the different column value to indicate the country — Mainland China.
Once completed, promote the first row of the table as Column Headers.
Select the identified columns and aggregate the statistics according to Country/Region
Select Choose Columns under Home tab, select the required columns in the pop-up window and click OK.
Select Group By under Transform tab, enter the following information shown above, and click OK. You will have the following aggregated statistics of confirmed cases for each Country/Region.
Step 3 — Create a function
You will now need to create a new parameter by clicking the Manage Parameters/New Parameter button, call the parameter Worksheet, set the data type to text and have it return the name of the worksheet you chose in the previous step as shown below.
Go back to the Template query, find the step called Filtered Rows towards the beginning, and click the gear icon next to the step to edit it.
Edit the step so it will use the value returned by the parameter to filter by instead of the hard-coded value you have entered. Click on the button highlighted below, select Parameter and select the Worksheet parameter in the drop-down box.
Go the the Queries pane on the left hand side and right click on the Template query and select Create Function.
The Create Function window will open, provide a name for the new function as shown below and click OK.
Step 4 — Invoke the function and combine the data
Go back to the duplicate copy of the original query at the beginning of Step 2.
Go to the Add Column tab on the ribbon and click the Invoke Custom Function button and invoke the GetAggregateData function, passing in the contents of the Date column to the function’s parameter.
After invoking the custom function, a new column GetData will contain a nested table of the aggregated confirmed cases by Country/Region for each date.
Click the Expand/Aggregate button on the GetData Column and expand the nested tables. Go ahead to clean up the tables and you should get the following outcome.
As for the rest, I would assume that you should be able to manipulate the data for the visuals. With some tweaks and new measures, you should be able to arrive at this outcome for your line chart.
Power BI is able to cover the “last-mile” of your data journey — from Connecting to your data sources, performing simple ETL, publish and managing your data visualization. Using the data provided by JHU CSSE for Coronavirus (2019-nCoV), we are able to replicate and build a simple online dashboard using Power BI and explore ways of maintaining an updated view of this global epidemics.
Thank you for reading and I hope that you have managed to pick up something through these short posts. I have made my .pbix file available (link) and feel free to use it for your own.
Stay safe and wash your hands 😃
The opinions and views expressed here are those of my own and do not necessarily state or reflect those of Microsoft Singapore or Microsoft Corporation.