WeRateDogs Data Wrangling¶

Wrangling Report¶

Gather

Assess

Clean

Gather¶

Three sources of data were gathered to complete this project:

twitter enhanced archive file - csv downloaded from udacity course site, which includes various information about tweets of WeRateDogs account
tweet image predictions file - tsv requested from udacity site, which includes predictions of objects on images included in WeRateDogs tweets
tweet details file - json file downloaded from twitter using API, which includes information missing from the enhanced archive file, namely retweets and likes counts

Data was gathered using different methods:

csv was read from a file and stored in df_archive
tsv was downloaded using the requests library, written to a file and stored in df_preds
json was gathered using twitter API, stored as txt file, loaded using json library and finally saved as df_likes

Assess¶

We assessed each of the three files separately to look at their structure and to find quality and tidiness issues, of which we found quite a few.
We used both visual (namely looking at the data using sample and head methods) and programmatic methods (for example with describe, info, value_counts, nunique, mean methods) to assess the data.

The summary of found issues is below. Most of the identified issues come from df_archive because that is the main file with the most information in which a lot of inherently imperfect programmatic work was done to extract data from the text column into many new columns.

Quality issues¶

Completeness:

-

Validity:

df_archive: we should exclude REtweets
df_archive: we should exclude tweets without images
df_preds: we only want ratings of dogs, i.e. where px_dog is True

Accuracy:

df_archive: ratings of more than 10/10 are acceptable, but there are some really extreme or decimal values of rating_numerator, which are wrongly extracted from the text
df_archive: ratings can have various bases, but some rating_denominator values are inaccurate based on the text
df_archive: some names and stages have incorrect values

Consistency:

df_archive: incorrect data types - timestamp should be datetime, rating columns should be floats to allow for decimals
df_preds: some dog names start with an uppercase, some with a lowercase letter

Tidiness issues¶

df_archive: columns doggo, floofer, pupper, puppo are actually values of a single column dog_stage
df_archive: rating would make more sense in its own column combining the numerator and denominator columns
df_preds: the columns relevant for analysis can be merged with the main df_archive file, they do not have to be in a separate table
df_preds: some columns will not be interesting for the analysis and can be deleted
df_likes: information should be included in the main archive table, there is no reason to have it in a separate data frame

Clean¶

We cleaned the following 12 issues in the cleaning process:

tidiness: merging all files together, creating a dog stage column, creating a rating column
quality: deleting retweets, deleting tweets without an image, deleting tweets with an image not of a dog, cleaning dog stage column, cleaning rating columns, changing datatypes, deleting columns, cleaning name column, making all predicted breed names lowercase

Each issue was handled separately using the define-code-test workflow.
The cleaning efforts finished with storing the final merged data set as twitter_archive_master.csv.