WeRateDogs Data Wrangling

Wrangling Report

Gather

Three sources of data were gathered to complete this project:

  1. twitter enhanced archive file - csv downloaded from udacity course site, which includes various information about tweets of WeRateDogs account
  2. tweet image predictions file - tsv requested from udacity site, which includes predictions of objects on images included in WeRateDogs tweets
  3. tweet details file - json file downloaded from twitter using API, which includes information missing from the enhanced archive file, namely retweets and likes counts

Data was gathered using different methods:

  1. csv was read from a file and stored in df_archive
  2. tsv was downloaded using the requests library, written to a file and stored in df_preds
  3. json was gathered using twitter API, stored as txt file, loaded using json library and finally saved as df_likes

Assess

We assessed each of the three files separately to look at their structure and to find quality and tidiness issues, of which we found quite a few.
We used both visual (namely looking at the data using sample and head methods) and programmatic methods (for example with describe, info, value_counts, nunique, mean methods) to assess the data.

The summary of found issues is below. Most of the identified issues come from df_archive because that is the main file with the most information in which a lot of inherently imperfect programmatic work was done to extract data from the text column into many new columns.

Quality issues

Completeness:

  • -

Validity:

  • df_archive: we should exclude REtweets
  • df_archive: we should exclude tweets without images
  • df_preds: we only want ratings of dogs, i.e. where px_dog is True

Accuracy:

  • df_archive: ratings of more than 10/10 are acceptable, but there are some really extreme or decimal values of rating_numerator, which are wrongly extracted from the text
  • df_archive: ratings can have various bases, but some rating_denominator values are inaccurate based on the text
  • df_archive: some names and stages have incorrect values

Consistency:

  • df_archive: incorrect data types - timestamp should be datetime, rating columns should be floats to allow for decimals
  • df_preds: some dog names start with an uppercase, some with a lowercase letter

Tidiness issues

  • df_archive: columns doggo, floofer, pupper, puppo are actually values of a single column dog_stage
  • df_archive: rating would make more sense in its own column combining the numerator and denominator columns
  • df_preds: the columns relevant for analysis can be merged with the main df_archive file, they do not have to be in a separate table
  • df_preds: some columns will not be interesting for the analysis and can be deleted
  • df_likes: information should be included in the main archive table, there is no reason to have it in a separate data frame

Clean

We cleaned the following 12 issues in the cleaning process:

  • tidiness: merging all files together, creating a dog stage column, creating a rating column
  • quality: deleting retweets, deleting tweets without an image, deleting tweets with an image not of a dog, cleaning dog stage column, cleaning rating columns, changing datatypes, deleting columns, cleaning name column, making all predicted breed names lowercase

Each issue was handled separately using the define-code-test workflow.
The cleaning efforts finished with storing the final merged data set as twitter_archive_master.csv.