Three sources of data were gathered to complete this project:
Data was gathered using different methods:
df_archive
df_preds
df_likes
We assessed each of the three files separately to look at their structure and to find quality and tidiness issues, of which we found quite a few.
We used both visual (namely looking at the data using sample and head methods) and programmatic methods (for example with describe, info, value_counts, nunique, mean methods) to assess the data.
The summary of found issues is below. Most of the identified issues come from df_archive
because that is the main file with the most information in which a lot of inherently imperfect programmatic work was done to extract data from the text column into many new columns.
Completeness:
Validity:
Accuracy:
Consistency:
We cleaned the following 12 issues in the cleaning process:
Each issue was handled separately using the define-code-test workflow.
The cleaning efforts finished with storing the final merged data set as twitter_archive_master.csv
.