Learning Outcomes
At the completion of this session, participants will
- be able to describe data literacy
- understand and identify areas of data responsibility in working with data
- demonstrate abilities in structuring and cleaning data
- utilize data dictionaries
- demonstrate abilities in preparing data for visualization
Data Literacy
The ability to consume for knowledge, produce coherently, and think critically about data
Data Responsibility
Be aware of how data was collected. Some examples:
- Crime Statistics
- Reporting Statistics
80% voting for Trump!*
*10 people polled
... at Trump campaign office
Correlation versus Causation
When ice cream sales rise, so do homicides.
Coincidence?
Or will your next cone murder you?
Don’t confuse skepticism with cynicism.
Don’t read what you want into the data.
Before you begin
Determine what question(s) you are trying to answer.
Finding the Data
Find or request all the variables you may (or may not) need.
Data Structure & Formatting
(For example, CSV intead of XLS)
File library naming (Don't use file names like "datasetspres_FINAL.doc")
First row is headers(In naming, don't use spaces, hyphens, or other special characters)
Transform and rearrange columns and rows as necessary
Data Cleaning
- Field uniformity - male, MALE, man, femal, Female, F
Names are the worst -
Are Joseph Smith, JT Smith, Joe Smith, Joseph T Smith and Smith, Joseph all the same person?
First and last names or address info
together or separate columns?
Data Dictionary
A file (PDF, spreadsheet, readme, etc.) that tells
- how the data is formatted (delimited text, dBase, etc.)
- the order of the variables
- the name of each variable
- the datatype of each variable (text string, integer, decimal, etc.)
- and explanation of codes (1=Male, 0=Female).
Start with the data.
End with a story.
Tips
Last year, Americans spent X billion dollars on vitamins.
Provide context:
- Proportion
- Internal comparison
- External comparison
- Change over time
- Combination of methods
Tips
Other ways to provide context:
- Geographical, historical, or other breakdowns of data
- Additional data needed to ensure comparisons are fair
- Any other data to provide interesting analysis to compare or relate spending to
Investigate the correlation between factors (e.g., anesthesia, family size) and baby outcome (e.g., time of birth, weight, or gender). Use the data provided in the files on the LibGuide http://libguides.uta.edu/datavizlg
References
Gray, J., Bounegru, L., & Chambers, L. (Eds.). (2016). The data journalism handbook. (1st ed.) Retrieved from www.datajournalismhandbook.org/1.0/en/
Kim, W., Choi, B-J., Eui-Kyeong, H., Kim, S-K., & Lee, D. (2003). A taxonomy of dirty data. Data mining and knowledge discovery, 7(1), 81-99. doi:10.1023/A:1021564703268
McDearmon, M. (2016). Cleaning data for analysis and visualization. Retrieved from http://mikemcdearmon.com/portfolio/techposts/cleaning-data-for-analysis-and-visualization