Tuesday 25 July 2017

Data Science & Machine Learning - 5 SF Salaries Kaggle

Hi friends,

Welcome to another post under Data Science & Machine Learning. In the previous post, we discussed how to read and write data from and to various sources such as csv files, excel files, etc. using Pandas DataFrames. 

This post however will be different from the other ones in a way that we will not be learning anything new in this post but will be reviewing the concepts we have learnt till now using the SF Salaries Dataset available at the Kaggle website. Download the dataset from this Kaggle link. You will be required to login there in order to download the dataset. Once downloaded, copy and paste the csv file to your Jupyter Notebook. 

Note: All the commands discussed below are run in the Jupyter Notebook environment. See this post on Jupyter Notebook to know about it in detail. 

First, import the downloaded Salaries dataset using the read_csv method supported by the Pandas library:


Let's first see a few entries of the SF Salaries Dataset using the head method:


We can see that the dataset has the following columns:
  1. Id
  2. EmployeeName
  3. JobTitle
  4. BasePay
  5. OvertimePay
  6. OtherPay
  7. Benefits
  8. TotalPay
  9. TotalPayBenefits
  10. Year
  11. Notes
  12. Agency
  13. Status
We can find the total number of entries in the SF dataset using the info method:


Now, let's answer some relevant questions using the concepts we have gathered till now:
  1. Unique Job Titles in the dataset:


  2. Top 10 most common Job Titles:


  3. Average BasePay of the dataset:


  4. Maximum amount of OvertimePay of the dataset:


  5. JobTitle of ALBERT PARDINI:


  6. TotalPayBenefits of ALBERT PARDINI:


  7. Individual with highest TotalPayBenefits in the dataset


  8. We can get the above result using the advance argmax method as well:


  9. Average TotalPay year-wise:


  10. Number of individuals with Chief in their Job Title: This involves lambda expression and might appear tricky at first sight but I suggest to break it down into sub steps for clear understanding. 


It is always advisable to explore various datasets from Kaggle or other websites since Data Science is not about just reading the theory but applying those concepts to datasets and gain insights to achieve a desirable output. From the next posts on ward, we'll start learning about another very important aspect of Data Science i.e. Data Visualizing
Share:

1 comment:

  1. Nice Blog.Thanks for sharing.
    For Online MBA check below.
    Innomatics Research Labs is collaborated with JAIN (Deemed-to-be University) and offering the Online MBA in Artificial Intelligence & Business Intelligence Program. It is a sublime program of getting an MBA degree from one of the best renowned university – JAIN University and an IBM certification program in Data Science, Artificial Intelligence, and Business Intelligence from Innomatics Research Labs in collaboration with Royal Society London.
    Online MBA in Data Science
    Online MBA in Business Analytics
    Online MBA in Business Intelligence

    ReplyDelete

Contact Me

Name

Email *

Message *

Popular Posts

Blog Archive