Dataset 101

Datasets 101

Peace Ossom Williamson
Sailee Pawar

Data Visualization Learning Group

presentation: http://pow123.github.io/datasets101

Continue to next slide ❱

Learning Outcomes

At the completion of this session, participants will

be able to describe data literacy

understand and identify areas of data responsibility in working with data

demonstrate abilities in structuring and cleaning data

utilize data dictionaries

demonstrate abilities in preparing data for visualization

Data Literacy

The ability to consume for knowledge, produce coherently, and think critically about data

Data Responsibility

Be aware of how data was collected. Some examples:

Crime Statistics

Reporting Statistics

80% voting for Trump!*

*10 people polled

... at Trump campaign office

Correlation versus Causation

When ice cream sales rise, so do homicides.

Coincidence?
Or will your next cone murder you?

Don’t confuse skepticism with cynicism.

Don’t read what you want into the data.

Before you begin

Determine what question(s) you are trying to answer.

Finding the Data

Find or request all the variables you may (or may not) need.

Taxonomy of Dirty Data

Data Structure & Formatting

Open source file types

(For example, CSV intead of XLS)

File library naming

(Don't use file names like "datasetspres_FINAL.doc")

First row is headers

(In naming, don't use spaces, hyphens, or other special characters)

Transform and rearrange columns and rows as necessary

Data Cleaning

Field uniformity - male, MALE, man, femal, Female, F

Names are the worst -
Are Joseph Smith, JT Smith, Joe Smith, Joseph T Smith and Smith, Joseph all the same person?

Delimitation

First and last names or address info
together or separate columns?

Data Dictionary

A file (PDF, spreadsheet, readme, etc.) that tells

how the data is formatted (delimited text, dBase, etc.)

the order of the variables

the name of each variable

the datatype of each variable (text string, integer, decimal, etc.)

and explanation of codes (1=Male, 0=Female).

Visualization

Start with the data.
End with a story.

Tips

Last year, Americans spent X billion dollars on vitamins.
Provide context:

Proportion

Internal comparison

External comparison

Change over time

Combination of methods

Tips

Other ways to provide context:

Geographical, historical, or other breakdowns of data

Additional data needed to ensure comparisons are fair

Any other data to provide interesting analysis to compare or relate spending to

Activity

Investigate the correlation between factors (e.g., anesthesia, family size) and baby outcome (e.g., time of birth, weight, or gender). Use the data provided in the files on the LibGuide http://libguides.uta.edu/datavizlg

References

Gray, J., Bounegru, L., & Chambers, L. (Eds.). (2016). The data journalism handbook. (1st ed.) Retrieved from www.datajournalismhandbook.org/1.0/en/
Kim, W., Choi, B-J., Eui-Kyeong, H., Kim, S-K., & Lee, D. (2003). A taxonomy of dirty data. Data mining and knowledge discovery, 7(1), 81-99. doi:10.1023/A:1021564703268
McDearmon, M. (2016). Cleaning data for analysis and visualization. Retrieved from http://mikemcdearmon.com/portfolio/techposts/cleaning-data-for-analysis-and-visualization

Datasets 101

Learning Outcomes

At the completion of this session, participants will

be able to describe data literacy understand and identify areas of data responsibility in working with data demonstrate abilities in structuring and cleaning data utilize data dictionaries demonstrate abilities in preparing data for visualization

Data Literacy

Data Responsibility

Be aware of how data was collected. Some examples: Crime Statistics Reporting Statistics

80% voting for Trump!*

Correlation versus Causation

Don’t confuse skepticism with cynicism.

Don’t read what you want into the data.

Before you begin

Determine what question(s) you are trying to answer.

Finding the Data

Find or request all the variables you may (or may not) need.

Taxonomy of Dirty Data

Data Structure & Formatting

Open source file types

File library naming

First row is headers

Transform and rearrange columns and rows as necessary

Data Cleaning

Data Cleaning

Field uniformity - male, MALE, man, femal, Female, F

Delimitation

Data Dictionary

A file (PDF, spreadsheet, readme, etc.) that tells

how the data is formatted (delimited text, dBase, etc.) the order of the variables the name of each variable the datatype of each variable (text string, integer, decimal, etc.) and explanation of codes (1=Male, 0=Female).

Visualization

Start with the data. End with a story.

Tips

Last year, Americans spent X billion dollars on vitamins.Provide context:

Proportion Internal comparison External comparison Change over time Combination of methods

Tips

Other ways to provide context:

Geographical, historical, or other breakdowns of data Additional data needed to ensure comparisons are fair Any other data to provide interesting analysis to compare or relate spending to

Activity

Investigate the correlation between factors (e.g., anesthesia, family size) and baby outcome (e.g., time of birth, weight, or gender). Use the data provided in the files on the LibGuide http://libguides.uta.edu/datavizlg

References

be able to describe data literacy

understand and identify areas of data responsibility in working with data

demonstrate abilities in structuring and cleaning data

utilize data dictionaries

demonstrate abilities in preparing data for visualization

Be aware of how data was collected. Some examples:

Crime Statistics

Reporting Statistics

how the data is formatted (delimited text, dBase, etc.)

the order of the variables

the name of each variable

the datatype of each variable (text string, integer, decimal, etc.)

and explanation of codes (1=Male, 0=Female).

Start with the data.
End with a story.

Last year, Americans spent X billion dollars on vitamins.
Provide context:

Proportion

Internal comparison

External comparison

Change over time

Combination of methods

Geographical, historical, or other breakdowns of data

Additional data needed to ensure comparisons are fair

Any other data to provide interesting analysis to compare or relate spending to