Instantly share code, notes, and snippets.

@mGalarnyk

mGalarnyk / project1.md

  • Download ZIP
  • Star ( 1 ) 1 You must be signed in to star a gist
  • Fork ( 9 ) 9 You must be signed in to fork a gist
  • Embed Embed this gist in your website.
  • Share Copy sharable link for this gist.
  • Clone via HTTPS Clone using the web URL.
  • Learn more about clone URLs
  • Save mGalarnyk/939c4a52e98e557e1ad844f7e81fcdba to your computer and use it in GitHub Desktop.

Exploratory Data Analysis Project 1

This assignment uses data from the UC Irvine Machine Learning Repository, a popular repository for machine learning datasets. In particular, we will be using the “Individual household electric power consumption Data Set” which I have made available on the course web site:

Dataset: Electric power consumption [20Mb] Description: Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.

exploratory data analysis assignment 1

akundu3 commented Dec 16, 2018

Hi! I am just starting to work on this dataset. Some exploratory analysis in a python notebook. This will be great for reference. Thank you for posting!

Sorry, something went wrong.

Logo image

  • Suggest edit

Assignment #1 (demo). Exploratory data analysis with Pandas

Assignment #1 (demo). exploratory data analysis with pandas #.

../../_images/ods_stickers.jpg

mlcourse.ai – Open Machine Learning Course

Author: Yury Kashnitsky . Translated and edited by Sergey Isaev , Artem Trunov , Anastasia Manokhina , and Yuanyuan Pao . All content is distributed under the Creative Commons CC BY-NC-SA 4.0 license.

Same assignment as a Kaggle Kernel + solution .

In this task you should use Pandas to answer a few questions about the Adult dataset. (You don’t have to download the data – it’s already in the repository). Choose the answers in the web-form .

Unique values of features (for more information please see the link above):

age : continuous;

workclass : Private , Self-emp-not-inc , Self-emp-inc , Federal-gov , Local-gov , State-gov , Without-pay , Never-worked ;

fnlwgt : continuous;

education : Bachelors , Some-college , 11th , HS-grad , Prof-school , Assoc-acdm , Assoc-voc , 9th , 7th-8th , 12th , Masters , 1st-4th , 10th , Doctorate , 5th-6th , Preschool ;

education-num : continuous;

marital-status : Married-civ-spouse , Divorced , Never-married , Separated , Widowed , Married-spouse-absent , Married-AF-spouse ,

occupation : Tech-support , Craft-repair , Other-service , Sales , Exec-managerial , Prof-specialty , Handlers-cleaners , Machine-op-inspct , Adm-clerical , Farming-fishing , Transport-moving , Priv-house-serv , Protective-serv , Armed-Forces ;

relationship : Wife , Own-child , Husband , Not-in-family , Other-relative , Unmarried ;

race : White , Asian-Pac-Islander , Amer-Indian-Eskimo , Other , Black ;

sex : Female , Male ;

capital-gain : continuous.

capital-loss : continuous.

hours-per-week : continuous.

native-country : United-States , Cambodia , England , Puerto-Rico , Canada , Germany , Outlying-US(Guam-USVI-etc) , India , Japan , Greece , South , China , Cuba , Iran , Honduras , Philippines , Italy , Poland , Jamaica , Vietnam , Mexico , Portugal , Ireland , France , Dominican-Republic , Laos , Ecuador , Taiwan , Haiti , Columbia , Hungary , Guatemala , Nicaragua , Scotland , Thailand , Yugoslavia , El-Salvador , Trinadad&Tobago , Peru , Hong , Holand-Netherlands ;

salary : >50K , <=50K .

1. How many men and women ( sex feature) are represented in this dataset?

2. What is the average age ( age feature) of women?

3. What is the percentage of German citizens ( native-country feature)?

4-5. What are the mean and standard deviation of age for those who earn more than 50K per year ( salary feature) and those who earn less than 50K per year?

6. Is it true that people who earn more than 50K have at least high school education? ( education – Bachelors , Prof-school , Assoc-acdm , Assoc-voc , Masters or Doctorate feature)

7. Display age statistics for each race ( race feature) and each gender ( sex feature). Use groupby() and describe() . Find the maximum age of men of Amer-Indian-Eskimo race.

8. Among whom is the proportion of those who earn a lot ( >50K ) greater: married or single men ( marital-status feature)? Consider as married those who have a marital-status starting with Married ( Married-civ-spouse , Married-spouse-absent or Married-AF-spouse ), the rest are considered bachelors.

9. What is the maximum number of hours a person works per week ( hours-per-week feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot ( >50K ) among them?

10. Count the average time of work ( hours-per-week ) for those who earn a little and a lot ( salary ) for each country ( native-country ). What will these be for Japan?

Exploratory Data Analysis. Assignment 1

Anastasiia alieksieienko.

This assignment uses data from the UC Irvine Machine Learning Repository , a popular repository for machine learning datasets. In particular, we will be using the “Individual household electric power consumption Data Set” which is available on the course web site:

Dataset : Electric power consumption [20Mb]

Description : Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.

  • Date : Date in format dd/mm/yyyy
  • Time : time in format hh:mm:ss
  • Global_active_power : household global minute-averaged active power (in kilowatt)
  • Global_reactive_power : household global minute-averaged reactive power (in kilowatt)
  • Voltage : minute-averaged voltage (in volt)
  • Global_intensity : household global minute-averaged current intensity (in ampere)
  • Sub_metering_1 : energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered).
  • Sub_metering_2 : energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.
  • Sub_metering_3 : energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.

Loading and preprocessing the data

Download file and unzip the archive to the current working directory

Read first 5 rows to get headers

Read 2900 rows that contain information on 2007-02-01 and 2007-02-02

Converting Date and Time variables to Date/Time format

Subsetting loaded data for 2007-02-01 and 2007-02-02

Creating plots

Overall goal here is to examine how household energy usage varies over a 2-day period in February, 2007. Every plot is recreated for the second time in order to save it to png file without being corrupted by dev.copy() function.

Plot 1. Histogram of Global active pover consumption

Plot 2. global active pover consumption over time, plot 3. energy sub meterings, plot 4. combination of 4 plots: global active power, energy sub meterings, voltage over time, global reactive power over time.

6.894 : Interactive Data Visualization

Assignment 2: exploratory data analysis.

In this assignment, you will identify a dataset of interest and perform an exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses. Your final submission will take the form of a report consisting of captioned visualizations that convey key insights gained during your analysis.

Step 1: Data Selection

First, you will pick a topic area of interest to you and find a dataset that can provide insights into that topic. To streamline the assignment, we've pre-selected a number of datasets for you to choose from.

However, if you would like to investigate a different topic and dataset, you are free to do so. If working with a self-selected dataset, please check with the course staff to ensure it is appropriate for the course. Be advised that data collection and preparation (also known as data wrangling ) can be a very tedious and time-consuming process. Be sure you have sufficient time to conduct exploratory analysis, after preparing the data.

After selecting a topic and dataset – but prior to analysis – you should write down an initial set of at least three questions you'd like to investigate.

Part 2: Exploratory Visual Analysis

Next, you will perform an exploratory analysis of your dataset using a visualization tool such as Tableau. You should consider two different phases of exploration.

In the first phase, you should seek to gain an overview of the shape & stucture of your dataset. What variables does the dataset contain? How are they distributed? Are there any notable data quality issues? Are there any surprising relationships among the variables? Be sure to also perform "sanity checks" for patterns you expect to see!

In the second phase, you should investigate your initial questions, as well as any new questions that arise during your exploration. For each question, start by creating a visualization that might provide a useful answer. Then refine the visualization (by adding additional variables, changing sorting or axis scales, filtering or subsetting data, etc. ) to develop better perspectives, explore unexpected observations, or sanity check your assumptions. You should repeat this process for each of your questions, but feel free to revise your questions or branch off to explore new questions if the data warrants.

  • Final Deliverable

Your final submission should take the form of a Google Docs report – similar to a slide show or comic book – that consists of 10 or more captioned visualizations detailing your most important insights. Your "insights" can include important surprises or issues (such as data quality problems affecting your analysis) as well as responses to your analysis questions. To help you gauge the scope of this assignment, see this example report analyzing data about motion pictures . We've annotated and graded this example to help you calibrate for the breadth and depth of exploration we're looking for.

Each visualization image should be a screenshot exported from a visualization tool, accompanied with a title and descriptive caption (1-4 sentences long) describing the insight(s) learned from that view. Provide sufficient detail for each caption such that anyone could read through your report and understand what you've learned. You are free, but not required, to annotate your images to draw attention to specific features of the data. You may perform highlighting within the visualization tool itself, or draw annotations on the exported image. To easily export images from Tableau, use the Worksheet > Export > Image... menu item.

The end of your report should include a brief summary of main lessons learned.

Recommended Data Sources

To get up and running quickly with this assignment, we recommend exploring one of the following provided datasets:

World Bank Indicators, 1960–2017 . The World Bank has tracked global human developed by indicators such as climate change, economy, education, environment, gender equality, health, and science and technology since 1960. The linked repository contains indicators that have been formatted to facilitate use with Tableau and other data visualization tools. However, you're also welcome to browse and use the original data by indicator or by country . Click on an indicator category or country to download the CSV file.

Chicago Crimes, 2001–present (click Export to download a CSV file). This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department's CLEAR (Citizen Law Enforcement Analysis and Reporting) system.

Daily Weather in the U.S., 2017 . This dataset contains daily U.S. weather measurements in 2017, provided by the NOAA Daily Global Historical Climatology Network . This data has been transformed: some weather stations with only sparse measurements have been filtered out. See the accompanying weather.txt for descriptions of each column .

Social mobility in the U.S. . Raj Chetty's group at Harvard studies the factors that contribute to (or hinder) upward mobility in the United States (i.e., will our children earn more than we will). Their work has been extensively featured in The New York Times. This page lists data from all of their papers, broken down by geographic level or by topic. We recommend downloading data in the CSV/Excel format, and encourage you to consider joining multiple datasets from the same paper (under the same heading on the page) for a sufficiently rich exploratory process.

The Yelp Open Dataset provides information about businesses, user reviews, and more from Yelp's database. The data is split into separate files ( business , checkin , photos , review , tip , and user ), and is available in either JSON or SQL format. You might use this to investigate the distributions of scores on Yelp, look at how many reviews users typically leave, or look for regional trends about restaurants. Note that this is a large, structured dataset and you don't need to look at all of the data to answer interesting questions. In order to download the data you will need to enter your email and agree to Yelp's Dataset License .

Additional Data Sources

If you want to investigate datasets other than those recommended above, here are some possible sources to consider. You are also free to use data from a source different from those included here. If you have any questions on whether your dataset is appropriate, please ask the course staff ASAP!

  • data.boston.gov - City of Boston Open Data
  • MassData - State of Masachussets Open Data
  • data.gov - U.S. Government Open Datasets
  • U.S. Census Bureau - Census Datasets
  • IPUMS.org - Integrated Census & Survey Data from around the World
  • Federal Elections Commission - Campaign Finance & Expenditures
  • Federal Aviation Administration - FAA Data & Research
  • fivethirtyeight.com - Data and Code behind the Stories and Interactives
  • Buzzfeed News
  • Socrata Open Data
  • 17 places to find datasets for data science projects

Visualization Tools

You are free to use one or more visualization tools in this assignment. However, in the interest of time and for a friendlier learning curve, we strongly encourage you to use Tableau . Tableau provides a graphical interface focused on the task of visual data exploration. You will (with rare exceptions) be able to complete an initial data exploration more quickly and comprehensively than with a programming-based tool.

  • Tableau - Desktop visual analysis software . Available for both Windows and MacOS; register for a free student license.
  • Data Transforms in Vega-Lite . A tutorial on the various built-in data transformation operators available in Vega-Lite.
  • Data Voyager , a research prototype from the UW Interactive Data Lab, combines a Tableau-style interface with visualization recommendations. Use at your own risk!
  • R , using the ggplot2 library or with R's built-in plotting functions.
  • Jupyter Notebooks (Python) , using libraries such as Altair or Matplotlib .

Data Wrangling Tools

The data you choose may require reformatting, transformation or cleaning prior to visualization. Here are tools you can use for data preparation. We recommend first trying to import and process your data in the same tool you intend to use for visualization. If that fails, pick the most appropriate option among the tools below. Contact the course staff if you are unsure what might be the best option for your data!

Graphical Tools

  • Tableau Prep - Tableau provides basic facilities for data import, transformation & blending. Tableau prep is a more sophisticated data preparation tool
  • Trifacta Wrangler - Interactive tool for data transformation & visual profiling.
  • OpenRefine - A free, open source tool for working with messy data.

Programming Tools

  • JavaScript data utilities and/or the Datalib JS library .
  • Pandas - Data table and manipulation utilites for Python.
  • dplyr - A library for data manipulation in R.
  • Or, the programming language and tools of your choice...

The assignment score is out of a maximum of 10 points. Submissions that squarely meet the requirements will receive a score of 8. We will determine scores by judging the breadth and depth of your analysis, whether visualizations meet the expressivenes and effectiveness principles, and how well-written and synthesized your insights are.

We will use the following rubric to grade your assignment. Note, rubric cells may not map exactly to specific point scores.

Submission Details

This is an individual assignment. You may not work in groups.

Your completed exploratory analysis report is due by noon on Wednesday 2/19 . Submit a link to your Google Doc report using this submission form . Please double check your link to ensure it is viewable by others (e.g., try it in an incognito window).

Resubmissions. Resubmissions will be regraded by teaching staff, and you may earn back up to 50% of the points lost in the original submission. To resubmit this assignment, please use this form and follow the same submission process described above. Include a short 1 paragraph description summarizing the changes from the initial submission. Resubmissions without this summary will not be regraded. Resubmissions will be due by 11:59pm on Saturday, 3/14. Slack days may not be applied to extend the resubmission deadline. The teaching staff will only begin to regrade assignments once the Final Project phase begins, so please be patient.

  • Due: 12pm, Wed 2/19
  • Recommended Datasets
  • Example Report
  • Visualization & Data Wrangling Tools
  • Submission form

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons

Margin Size

  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Statistics LibreTexts

Unit 1: Exploratory Data Analysis

  • Last updated
  • Save as PDF
  • Page ID 31260

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

\( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

\( \newcommand{\Span}{\mathrm{span}}\)

\( \newcommand{\id}{\mathrm{id}}\)

\( \newcommand{\kernel}{\mathrm{null}\,}\)

\( \newcommand{\range}{\mathrm{range}\,}\)

\( \newcommand{\RealPart}{\mathrm{Re}}\)

\( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

\( \newcommand{\Argument}{\mathrm{Arg}}\)

\( \newcommand{\norm}[1]{\| #1 \|}\)

\( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

\( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

\( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

\( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

\( \newcommand{\vectorC}[1]{\textbf{#1}} \)

\( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

\( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

\( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

CO-1: Describe the roles biostatistics serves in the discipline of public health.

CO-6: Apply basic concepts of probability, random variation, and commonly used statistical probability distributions.

Exploratory Data Analysis Introduction (2 videos, 7:04 total)

The Big Picture

Learning objectives.

LO 1.3: Identify and differentiate between the components of the Big Picture of Statistics

Recall “The Big Picture,” the four-step process that encompasses statistics (as it is presented in this course):

1. Producing Data — Choosing a sample from the population of interest and collecting data.

2. Exploratory Data Analysis (EDA) {Descriptive Statistics} — Summarizing the data we’ve collected.

3. and 4. Probability and Inference — Drawing conclusions about the entire population based on the data collected from the sample.

Even though in practice it is the second step in the process, we are going to look at Exploratory Data Analysis (EDA) first. (If you have forgotten why, review the course structure information at the end of the page on The Big Picture and in the video covering The Big Picture .)

Exploratory Data Analysis

LO 1.5: Explain the uses and important features of exploratory data analysis.

As you can tell from the examples of datasets we have seen, raw data are not very informative. Exploratory Data Analysis (EDA) is how we make sense of the data by converting them from their raw form to a more informative one.

In particular, EDA consists of:

  • organizing and summarizing the raw data,
  • discovering important features and patterns in the data and any striking deviations from those patterns, and then
  • interpreting our findings in the context of the problem

And can be useful for:

  • describing the distribution of a single variable (center, spread, shape, outliers)
  • checking data (for errors or other problems)
  • checking assumptions to more complex statistical analyses
  • investigating relationships between variables

Exploratory data analysis (EDA) methods are often called Descriptive Statistics due to the fact that they simply describe, or provide estimates based on, the data at hand.

In Unit 4 we will cover methods of Inferential Statistics which use the results of a sample to make inferences about the population under study.

Comparisons can be visualized and values of interest estimated using EDA but descriptive statistics alone will provide no information about the certainty of our conclusions.

Important Features of Exploratory Data Analysis

There are two important features to the structure of the EDA unit in this course:

  • The material in this unit covers two broad topics:

Examining Distributions — exploring data one variable at a time .

Examining Relationships — exploring data two variables at a time .

  • In Exploratory Data Analysis, our exploration of data will always consist of the following two elements:

visual displays , supplemented by

numerical measures .

Try to remember these structural themes, as they will help you orient yourself along the path of this unit.

Examining Distributions

LO 6.1: Explain the meaning of the term distribution in statistics.

We will begin the EDA part of the course by exploring (or looking at) one variable at a time .

As we have seen, the data for each variable consist of a long list of values (whether numerical or not), and are not very informative in that form.

In order to convert these raw data into useful information, we need to summarize and then examine the distribution of the variable.

By distribution of a variable, we mean:

  • what values the variable takes, and
  • how often the variable takes those values.

We will first learn how to summarize and examine the distribution of a single categorical variable, and then do the same for a single quantitative variable.

  • One Categorical Variable
  • One Quantitative Variable: Introduction
  • Role-Type Classification
  • Summary (Unit 1)

1   EDA

1.1 what is exploratory data analysis (eda).

Traditional approaches to data analysis tend to be linear and unidirectional. It often starts with the acquisition or collection of a dataset, then ends with the computation of some inferential or confirmatory procedure.

Unfortunately, such practice can lead to faulty conclusions. The following datasets generate identical regression analysis results shown in the above figure yet, they are all completely different!

The four plots represent Francis Anscombe’s famous quartet which he used to demonstrate the importance of visualizing the data before proceeding with traditional statistical analysis. Of the four plots, only the first is a sensible candidate for the regression analysis; the second dataset highlights a nonlinear relationship between X and Y; the third and fourth plots demonstrate the disproportionate influence of a single outlier on the regression procedure.

The aforementioned example demonstrates that a sound data analysis workflow must involve data visualization and exploration techniques. Exploratory data analysis seeks to extract salient features about the data (that may have otherwise gone unnoticed) and to help formulate hypotheses. Only then should appropriate statistical tests be applied to the data to confirm a hypothesis.

However, not all EDA workflows result in a statistical test: We may not be seeking a hypothesis or, if a hypothesis is sought we may not have the statistical tools necessary to test the hypothesis. It’s important to realize that many statistical procedures found in commercial software make restrictive assumptions about the data and the type of hypothesis being tested; data sets seldom meet those stringent requirements.

–John Tukey

John Tukey is credited with having coined the term Exploratory Data Analysis and with having written the first comprehensive book on that subject (Tukey, 1977 1 ). The book is still very much relevant today and several of the techniques highlighted in the book will be covered in this course.

1.2 The role of graphics in EDA

The preceding example highlights the importance of graphing data. A core component of this course is learning how to construct effective data visualization tools for the purpose of revealing patterns in the data. The graphical tools must allow the data to express themselves without imposing a story .

–William S. Cleveland

William Cleveland has written extensively about data visualization and has focused on principles founded in the field of cognitive neuroscience to improve data graphic designs. His book, Visualizing Data , is a leading authority on statistical graphics and, despite its age, is as relevant today as it was two decades ago. It focuses on graphical techniques (some newer than others) designed to explore the data. This may differ from graphics generated for public dissemination which benefits from another form of data visualization called information visualization (or infovis for short). Infovis will not be covered in this course (though there is some overlap between the two techniques). For a good discussion on the differences between statistical graphics and infovis see the 2013 article Infovis and Statistical Graphics: Different Goals, Different Looks 2

Cleveland has also contributed a very important tool to EDA: the LOESS curve. The LOESS curve will be used extensively in this course. It is one of many fitting options used in smoothing (or detrending) the data. Others include parametric models such as the family of linear polynomials and Tukey’s suite of smoothers notably the running median and the 3RS3R .

1.3 We need a good data analysis environment

Effective EDA requires a flexible data analysis environment that does not constrain one to a limited set of data manipulation procedures or visualization tools. After all, would any good writer limit herself to a set of a hundred pre-built sentences? Of course not–we would be reading the same novels over and over again! So why would we limit ourselves to a limited set of pre-packaged data analysis procedures? EDA requires an arsenal of data analysis building blocks much like a good writer needs an arsenal of words. Such an environment must provide us with flexible data manipulation capabilities, a flexible data visualization environment and access to a wide range of statistical procedures. A scripting environment, like R, offers such an environment.

The data analysis environment should also be freely available, and its code open to the public. Free access to the software allows anyone with the right set of skills to share in the data analysis, regardless of any budgetary constraints. The open source nature of the software ensures that any aspect of the code used for a particular task can be examined when additional insight into the implementation of an analytical/numerical method if needed. However, deciphering code may not be a skill available to all researchers; if the need to understand how a procedure is implemented is important enough, an individual with the appropriate programming skills can be easy to come by, even if it’s for a small fee. Open source software also ensures that the underlying code used to create the executable application can be ported to different platforms or different operating systems (even though this too may require some effort and modest programming skills).

1.3.1 The workhorse: R

R is an open source data analysis and visualization programming environment whose roots go back to the S programming language developed at Bell Laboratories in the 1970’s by John Chambers . It will be used almost exclusively in this course.

1.3.2 The friendly interface: RStudio

RStudio is an integrated development environment (IDE) to R. An IDE provides a user with an interface to a programming environment (like R) by including features such as a source code editor (with colored syntax). RStudio is not needed to use R (which has its own IDE environment–albeit not as nice as RStudio’s), but makes using R far easier. RStudio is an open source software, but unlike R, it’s maintained by a private entity which also distributes a commercial version of RStudio for businesses or individuals needing customer support.

1.3.3 Data manipulation

Before one can begin plotting data, one must have a data table in a form ready to be plotted. In cases where the data table consists of just two variables (columns), little data manipulation may be needed, but in cases where data tables consist of tens or scores of variables, data manipulation, subsetting and/or reshaping may be required. Tackling such a task can be challenging in a point and click spreadsheet environment and can introduce clerical error. R offers an array of data table manipulation tools and packages such as tidyr and dplyr . Furthermore, R’s scripting environment enables one to read through each step of a manipulation procedure in a clear and unambiguous way. Imagine the difficulty in properly documenting all the point-and-click steps followed in a spreadsheet environment.

For example, a data table of grain production for North America may consist of six variables and 1501 rows. The following table shows just the first 7 lines of the 1501 rows.

There are many ways in which we may want to summarize the data table. We could, for example, want to compute the total Barley yield for Canada by year for the years ranging from 2005 and 2007. In R, this would be done in just a few lines of code:

On the other hand, creating the same output in a spreadsheet environment would take a bit more effort and its workflow would be less transparent.

1.3.4 Reproducible analysis

Data table manipulation is inevitable in any data analysis workflow and, as discussed in the last section, can be prone to clerical errors if performed in a point-and-click environment. Furthermore, reproducing a workflow in a spreadsheet environment can be difficult unless each click and each copy-and-paste operations are meticulously documented. And even if the documentation is adequate, there is no way of knowing if the analyst followed those exact procedures (unless his mouse and keyboard moves were recorded). However, with a scripting environment, each step of a workflow is clearly and unambiguously laid out as demonstrated with the FAO grain data above. This leads to another basic tenet of the scientific method: reproducibility of the workflow .

Reproducible research lends credence to scientific work. The need for reproducibility is not limited to data collection or methodology but includes the actual analytical workflow that generated the results including data table output and statistical tests.

Data analysis can be complex. Each data manipulation step that requires human interaction is prone to clerical error. But error can also manifest itself in faulty implementation of an analytical procedure—both technical and theoretical. Unfortunately, workflows are seldom available in technical reports or peer-reviewed publications where the intended audience is only left with the end product of the analysis.

–Keith A. Baggerly & Kevin R. Coombes 3

Unfortunately, examples of irreproducible research are all too common . An example of such was reported by the New York Times in an article titled How Bright Promise in Cancer Testing Fell Apart . In 2006, researchers at Duke had published a paper in Nature Medicine on a breakthrough approach to fighting cancer. The authors’ research suggested that genomic tests of a cancer cell’s DNA could be used to target the most effective chemotherapy treatment. This was heralded as a major breakthrough in the fight against cancer. Unfortunately, the analysis presented by the authors was flawed. Two statisticians, Dr. Baggerly and Dr. Coombes, sought to replicate the work but discovered instead that the published work was riddled with problems including mis-labeling of genes and confounding experimental designs. The original authors of the research did not make the analytical workflow available to the public thus forcing the statisticians to scavenge for the original data and techniques. It wasn’t until 5 years later, in 2011, that Nature decided to retract the paper because they were “unable to reproduce certain crucial experiments”.

Many journals now require or strongly encourage authors to “make materials, data and associated protocols promptly available to readers without undue qualifications” ( Nature, 2014 ). Sharing data file is not too difficult, but sharing the analytical workflow used to generate conclusions can prove to be difficult if the data were run though many different pieces of software and point-and-click procedures. An ideal analytical workflow should be scripted in a human readable way from beginning (the moment the data file(s) is/are read) to the generation of the data tables or data figures used in the report of publication. This has two benefits: elimination of clerical errors (associated with poorly implemented point-and-click procedures) and the exposition of the analytical procedures adopted in the workflow.

1.4 Creating dynamic documents using R Markdown

Another source of error in the write-up of a report or publication is the linking of tables, figures and statistical summaries to the write-up. Typically, one saves statistical plots as image files then loads the image into the document. However, the figures may have gone through many different iterations resulting in many different versions of the image file in a working folder. Add to this many other figures, data table files and statistical results from various pieces of software, one quickly realizes the potential for embedding the wrong image files in the document or embedding the wrong statistical summaries in the text. Furthermore, the researcher is then required to properly archive and document the provenance of each figure, data table or statistical summary resulting in a complex structure of files and directories in the project folder thus increasing the odds of an irreproducible analysis.

Confining all of the analysis to a scripting environment such as R can help, but this still does not alleviate the possibility of loading the wrong figure into the document, or forgetting to update a statistical summary in the text when the original data file was revised. A solution to this potential pitfall is to embed the actual analysis and graphic generation process into the document–such environments are called dynamic documents. In this course, we will use the R Markdown authoring tool which embeds R code into the document. An example of an R Markdown document is this course website which was entirely generated in RMarkdown! You can view the R Markdown files on this author’s GitHub repository .

Tukey, John W. Exploratory Data Analysis . 1977. Addison-Wesley. ↩︎

Gelman A. and Unwin A. Infovis and Statistical Graphics: Different Goals, Different Looks Journal of Computational and Graphical Statistics. Vol 22, no 1, 2013. ↩︎

Baggerly, Keith A. and Coombes, Kevin R. Deriving Chemosensitivity from Cell Lines: Forensic Bioinformatics and Reproducible Research in High-Throughput Biology . The Annals of Applied Statistics, vol.3, no.4, pp. 1309-1334. 2009. ↩︎

Illustration with collage of pictograms of clouds, pie chart, graph pictograms on the following

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.

EDA helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today.

Learn how to leverage the right databases for applications, analytics and generative AI.

Register for the ebook on generative AI

The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.

Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning .

Specific statistical functions and techniques you can perform with EDA tools include:

  • Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables.
  • Univariate visualization of each field in the raw dataset, with summary statistics.
  • Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.
  • Multivariate visualizations, for mapping and understanding interactions between different fields in the data.
  • K-means Clustering is a clustering method in unsupervised learning where data points are assigned into K groups, i.e. the number of clusters, based on the distance from each group’s centroid. The data points closest to a particular centroid will be clustered under the same category. K-means Clustering is commonly used in market segmentation, pattern recognition, and image compression.
  • Predictive models, such as linear regression, use statistics and data to predict outcomes.

There are four primary types of EDA:

  • Univariate non-graphical. This is simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.
  • Stem-and-leaf plots, which show all data values and the shape of the distribution.
  • Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
  • Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.
  • Multivariate nongraphical: Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.
  • Multivariate graphical: Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.

Other common types of multivariate graphics include:

  • Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
  • Multivariate chart, which is a graphical representation of the relationships between factors and a response.
  • Run chart, which is a line graph of data plotted over time.
  • Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
  • Heat map, which is a graphical representation of data where values are depicted by color.

Some of the most common data science tools used to create an EDA include:

  • Python: An interpreted, object-oriented programming language with dynamic semantics. Its high-level, built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for rapid application development, as well as for use as a scripting or glue language to connect existing components together. Python and EDA can be used together to identify missing values in a data set, which is important so you can decide how to handle missing values for machine learning.
  • R: An open-source programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians in data science in developing statistical observations and data analysis.

For a deep dive into the differences between these approaches, check out " Python vs. R: What's the Difference? "

Use IBM Watson® Studio to determine whether the statistical techniques that you are considering for data analysis are appropriate.

Learn the importance and the role of EDA and data visualization techniques to find data quality issues and for data preparation, relevant to building ML pipelines.

Learn common techniques to retrieve your data, clean it, apply feature engineering, and have it ready for preliminary analysis and hypothesis testing.

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

  • 0: Course prep
  • 1: Course introduction
  • 2: Exploring data
  • 3: Visualizing information
  • 4: Graphing amounts & proportions
  • 5: Graphing comparisons
  • 6: Graphing trends
  • 7: Data cleaning & joins
  • 8: Polishing your charts
  • 9: Maps & geospatial data
  • 11: Correlation analysis
  • 12: Shiny apps
  • 13: Communicating your results
  • 0: Class prerequisites
  • 1: Exploring data
  • 2: Plotting with ggplot2
  • 3: Plotting with factors
  • 4: Wind power redesign
  • 5: Data cleaning & joins
  • 6: Maps & geo-spatial data
  • 7: Redesign 2
  • Final project
  • Getting Help
  • Finding Data
  • Visualizing Data
  • Programming in R

Assignment 1: Exploring Data

Due : 21 January, 11:00 pm

Weight : This assignment is worth 4% of your final grade.

Purpose : The purpose of this assignment is to develop some basic strategies for exploring data sets to gain a greater understanding of the variable types and their relationships.

Skills & Knowledge : After completing these exercises, you should be able to:

  • Apply a strategy for systematically exploring data.
  • Know the distinctions between variables, values, and observations.
  • Know the distinctions between nominal, ordinal, interval, and ratio data.
  • Select appropriate measures of centrality and variability for different data types.
  • Select appropriate data visualizations for different relationships among data variables.

Assessment : This assignment is graded for completion. Credit will be allocated in proportion to the percentage of the assignment completed by the due date. No more than 2 late days can be used on any one assignment.

Register : If you haven’t already, register for:

  • DataCamp : you must use your @gwu.edu email for this to work (not the @email.gwu.edu address). An invite link can be found in the announcement on Blackboard or in Slack.
  • RStudio Cloud

Read : Open up a notebook (physical, digital…whatever you take notes in best), and take notes while you go through the readings for this week .

Exercise : Take notes while you complete the following exercises:

  • Complete Lesson 3: Numerical Summaries from the DataCamp course “Exploratory Data Analysis in R” (you don’t need to complete the other lessons in the course). When finished, you should be able to see it completed in your DataCamp dashboard .
  • Complete the following RStudio Primer lesson: Exploratory Data Analysis

Report : When you have completed all of the above exercises, go to the “Assignment Submission” page on Blackboard and write three things from your notes that you learned while going through these readings and exercises.

  • Exploratory Data Analysis. Assignment 1
  • by Anastasiia
  • Last updated almost 6 years ago
  • Hide Comments (–) Share Hide Toolbars

Twitter Facebook Google+

Or copy & paste this link into an email or IM:

IMAGES

  1. Exploratory Data Analysis Beginners Guide To Explanatory Data Analysis

    exploratory data analysis assignment 1

  2. (PDF) Exploratory Data Analysis

    exploratory data analysis assignment 1

  3. An Ultimate Guide To Exploratory Data Analysis (EDA)

    exploratory data analysis assignment 1

  4. Exploratory Data Analysis: An Illustration in Python

    exploratory data analysis assignment 1

  5. What is Exploratory Data Analysis?

    exploratory data analysis assignment 1

  6. Exploratory Data Analysis(EDA) Assignment Help

    exploratory data analysis assignment 1

VIDEO

  1. Retail

  2. NPTEL Programming In Java WEEK 1 Quiz Assignment Solutions💡

  3. Exploratory Data Analysis With Excel

  4. Exploratory Data Analysis (EDA)

  5. explore: simplified exploratory data analysis (EDA) in R

  6. Exploratory Data Analysis in Python #coding #datascience

COMMENTS

  1. Exploratory Data Analysis Project 1

    This assignment uses data from the UC Irvine Machine Learning Repository, a popular repository for machine learning datasets. In particular, we will be using the "Individual household electric power consumption Data Set" which I have made available on the course web site:

  2. GitHub

    Description: Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years.Different electrical quantities and some sub-metering values are available. The following descriptions of the 9 variables in the dataset are taken from the UCI web site:. Date: Date in format dd/mm/yyyy ; Time: time in format hh:mm:ss

  3. Assignment #1 (demo). Exploratory data analysis with Pandas

    Exploratory data analysis with Pandas — mlcourse.ai. Assignment #1 (demo). Exploratory data analysis with Pandas #. mlcourse.ai - Open Machine Learning Course. Author: Yury Kashnitsky. Translated and edited by Sergey Isaev, Artem Trunov, Anastasia Manokhina, and Yuanyuan Pao.

  4. PDF Chapter 4 Exploratory Data Analysis

    As mentioned in Chapter 1, exploratory data analysis or \EDA" is a critical rst step in analyzing the data from an experiment. Here are the main reasons we use EDA: detection of mistakes checking of assumptions preliminary selection of appropriate models determining relationships among the explanatory variables, and

  5. RPubs

    Sign inRegister. Coursera - Exploratory Data Analysis - Project 1. by Ali Magzari. Last updatedalmost 3 years ago. HideComments(-)ShareHide Toolbars. ×. Post on: TwitterFacebookGoogle+. Or copy & paste this link into an email or IM:

  6. Exploratory Data Analysis. Assignment 1

    This assignment uses data from the UC Irvine Machine Learning Repository, a popular repository for machine learning datasets. In particular, we will be using the "Individual household electric power consumption Data Set" which is available on the course web site: Description: Measurements of electric power consumption in one household with ...

  7. An Extensive Step by Step Guide to Exploratory Data Analysis

    What is Exploratory Data Analysis? Exploratory Data Analysis (EDA), also known as Data Exploration, is a step in the Data Analysis Process, where a number of techniques are used to better understand the dataset being used. 'Understanding the dataset' can refer to a number of things including but not limited to…

  8. Assignment 2: Exploratory Data Analysis

    Assignment 2: Exploratory Data Analysis. In this assignment, you will identify a dataset of interest and perform an exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses. Your final submission will take the form of a report consisting of ...

  9. Unit 1: Exploratory Data Analysis

    Exploratory data analysis (EDA) methods are often called Descriptive Statistics due to the fact that they simply describe, or provide estimates based on, the data at hand.. In Unit 4 we will cover methods of Inferential Statistics which use the results of a sample to make inferences about the population under study.. Comparisons can be visualized and values of interest estimated using EDA but ...

  10. GitHub

    Exploratory-Data-Analysis-assignment-1. ##Background This repo is created for submitting an assignment for a Coursera-program called "Exploratory data analysis". In this assignment, I was required to precisely reconstruct given plots. Four plots have been given as images. The reconstructed plots are stored in plot1.png~plot4.png above. ## ...

  11. Exploratory Data Analysis for Machine Learning

    Exploratory Data Analysis and Feature Engineering. Module 3 • 4 hours to complete. In this module you will learn how to conduct exploratory analysis to visually confirm it is ready for machine learning modeling by feature engineering and transformations. What's included. 15 videos 3 readings 3 quizzes 4 app items.

  12. Exploratory Data Analysis in R

    Exploratory data analysis seeks to extract salient features about the data (that may have otherwise gone unnoticed) and to help formulate hypotheses. Only then should appropriate statistical tests be applied to the data to confirm a hypothesis. However, not all EDA workflows result in a statistical test: We may not be seeking a hypothesis or ...

  13. Step-by-Step Exploratory Data Analysis (EDA) using Python

    Exploratory Data Analysis in Python. Exploratory data analysis (EDA) is a critical initial step in the data science workflow. It involves using Python libraries to inspect, summarize, and visualize data to uncover trends, patterns, and relationships. Here's a breakdown of the key steps in performing EDA with Python: 1. Importing Libraries:

  14. chenghanyu/exploratory-data-analysis-project-1

    This repo is for the course project one of the course "exploratory data analysis" offered from Coursera Data Science specialization. - chenghanyu/exploratory-data-analysis-project-1

  15. Beginner's Guide To Exploratory Data Analysis

    3.c BOX PLOT: Box plot is an alternative and more robust way to illustrate a continuous variable. The vertical lines in the box plot have a specific meaning. The centerline in the box is the 50th percentile of the data (median). Variability is represented by a box that is formed by marking the first and third quartile.

  16. RPubs

    Password. Forgot your password? Sign InCancel. RPubs. by RStudio. Sign inRegister. Exploratory Data Analysis Assignment 1. by Vinu Chandy. Last updatedabout 8 years ago.

  17. What is Exploratory Data Analysis?

    Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. EDA helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test ...

  18. Assignment 1: Exploring Data

    Assignment 1: Exploring Data. Due: 21 January, 11:00 pm. Weight: This assignment is worth 4% of your final grade. Purpose: The purpose of this assignment is to develop some basic strategies for exploring data sets to gain a greater understanding of the variable types and their relationships. Skills & Knowledge: After completing these exercises ...

  19. RPubs

    Forgot your password? Sign InCancel. RPubs. by RStudio. Sign inRegister. Exploratory Data Analysis Assignment-1 COURSERA. by Akash Gupta. Last updatedover 7 years ago. HideComments(-)ShareHide Toolbars.

  20. Plotting Assignment 1 for Exploratory Data Analysis

    This assignment uses data from the UC Irvine Machine Learning Repository, a popular repository for machine learning datasets.In particular, we will be using the "Individual household electric power consumption Data Set" which I have made available on the course web site:

  21. GitHub

    Exploratory-Data-Analysis-Week-1-Project. Instructions. This assignment uses data from the UC Irvine Machine Learning Repository, a popular repository for machine learning datasets. In particular, we will be using the "Individual household electric power consumption Data Set" which I have made available on the course web site:

  22. RPubs

    Password. Forgot your password? Sign InCancel. RPubs. by RStudio. Sign inRegister. Exploratory Data Analysis - Course Project 1. by Ghida Ibrahim. Last updatedabout 8 years ago.

  23. Assignment Title: "Exploring Predictive Factors in

    Question: Assignment Title: "Exploring Predictive Factors in Residential Real Estate: Constructing anAnalytic Base Table (ABT)"Assignment Description: In this assignment, the objective is primarily identifying andselecting predictive features for constructing an Analytic Base Table (ABT) to predict houseprices in Vancouver.

  24. RPubs

    Exploratory Data Analysis. Assignment 1; by Anastasiia; Last updated almost 6 years ago; Hide Comments (-) Share Hide Toolbars