Saturday, 22 July 2017

Data Science & Machine Learning - 4.3.1 Pandas DataFrames

Hi friends,

Welcome to the post on Pandas DataFrames under Data Science & Machine LearningIn the previous post, we discussed the Series data structure supported by Pandas. In this and a couple of next posts, we'll learn about the Pandas' DataFrames

Note: All the commands discussed below are run in the Jupyter Notebook environment. See this post on Jupyter Notebook to know about it in detail.

Pandas DataFrames

Pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. It is like the Microsoft Excel of Python which provides tons for features for efficient data analysis.

Before beginning to start working with the Pandas DataFrames, remember to import the NumPy and the Pandas libraries. 

We will create our first DataFrame with random values so we import the randn library provided by NumPy library as well to generate random numbers.


Now, let's begin using the Pandas DataFrames for better understanding. The three important parameters that Pandas DataFrames takes are:
  1. Data - The data values of the DataFrame
  2. Index - The row labels for the DataFrame
  3. Columns  - The column labels for the DataFrame

So, let's first declare the index and the columns as two lists and then fill the DataFrame with the random values. In the below example, I have created a DataFrame of dimensions 4x5.


We can find the size of the DataFrame using the shape attribute of the DataFrame:



Note: Each of the columns in the DataFrame is a series in itself. We can verify that by just printing the type of one of the columns:


You can also use the object-oriented notion(using . operator) to select a column:


However, the dot(.) notation is usually avoided since it is mostly used for calling various methods on the DataFrame object.

We can also access multiple columns of the DataFrame by passing the list of columns inside the [] brackets:


We can even add a column to the DataFrame using the following step:


We can also drop a column from the DataFrame:


The additional parameter axis is whether we want to delete the row or the column. The value 0 for axis represents the rows and the value 1 represents the columns. Thus, as we want to delete the column new, so we set the value of axis as 1.

However, it does not drop the column permanently from the DataFrame which you can verify by printing the DataFrame again which shows the dropped column as it is:



This is done by Python to avoid any accidental deletion of data from the DataFrame. Thus, to delete the column permanently, we need to set the inplace parameter of the drop method to be True:



Notice above that the new column has been permanently deleted from the DataFrame.

We now see how to access the rows of a DataFrame. Accessing a row of a DataFrame is not as trivial as selecting the columns. Pandas support the method loc() to access the rows by its labels or method iloc() to access the row by its index. Here is the usage of each of them:


We can also select part of the DataFrame using the NumPy notation of accessing the 2-D arrays.


We end this post here on basics of Pandas DataFrames. From the next post, we dive into more detail things we can do with Pandas DataFrames
Share:

Data Science & Machine Learning - 4.2 Pandas Series

Hi friends,

In the previous post under Data Science & Machine Learning, we learnt how to install the Pandas library to our Python. From this post onward, we will start learning about the Pandas library which is one of the most important libraries for Data Science. In this post, we will learn about the Series data structure supported by the Pandas library.

Note: All the commands discussed are run in the Jupyter Notebook environment. See this post on Jupyter Notebook to know about it in detail.

Pandas Series

So, let's begin to learn about Pandas Series. The first thing we need to do before we can start using any Python library is to import the library. So, let's import the Pandas library using the following command:


You might see an error if Pandas was not installed successfully otherwise you won't see any other message.

As a Pandas' Series is built on the NumPy Arrays so we need to include the NumPy library as well to our script. 

You might find Pandas Series similar to Python Dictionaries by its usage just like NumPy Arrays find similarity with the Python Lists but these (Pandas Series and NumPy Arrays) are designed specifically for data science tasks and supports many additional features which the lists and dictionaries lack. 

If you type pd.Series (in your Jupyter Notebook cell) and press Shift + Tab, you can see a wide variety of parameters that the Series method takes but the two most important ones are the data and the index. These are like the values and the keys in the Python dictionaries.


So, let's declare our first Series using Pandas


Notice, I have first created two Python lists that act as the keys and the data for the Series. Also, notice it is mentioned that the datatype(dtype) of the data part of the Series is int64

We can pass the NumPy arrays as well in place of the data part and it will generate the same result:


We can even pass the myLabels(list of strings) as the data part and it will work quite well which shows the flexibility of the Series data structure that it can take a variety of parameters as data points.


Accessing elements in Series

Accessing an element in a Pandas Series is just like working with dictionaries. So, if we want to access the data corresponding to the index 'y' in the first series we declared, we do the following:


Now, that we have learnt how to declare and use a Pandas Series, we will start working with Pandas DataFrames, another very important data structures supported by Pandas from the next post
Share:

Friday, 21 July 2017

283. Move Zeroes

Given an array nums, write a function to move all 0's to the end of it while maintaining the relative order of the non-zero elements.
For example, given nums = [0, 1, 0, 3, 12], after calling your function, nums should be [1, 3, 12, 0, 0].
Note:
  1. You must do this in-place without making a copy of the array.
  2. Minimize the total number of operations.

Approach: Bring forward the non-zero values using a for loop and then fill the zeros at the end. 

class Solution {
public:
    void moveZeroes(vector<int>& nums) {
        int j = 0, n = nums.size();
        //move the non-zero elements to the front
        for(int i = 0; i < n; i++){
            if(nums[i] != 0){
                nums[j++] = nums[i];
            }
        }
        //fill the remaining right indices with zeros
        for(int i = j; i < n; i++){
            nums[i] = 0;
        }
    }
};

Here is the link to the ideone solution : http://ideone.com/O0MRQa
Share:

Data Science & Machine Learning - 4.1 Pandas & Its Installation

Hello friends,

Welcome to yet another post on Data Science & Machine Learning. In the previous few posts, we learnt about NumPy library for Data Science mostly used for NumPy Arrays. From this post onward, we will learn about Pandas, one of the most important Data Science libraries in Python. 

About Pandas

Pandas is an open source library and provides the following functionalities:
  1. Fast data analysis
  2. It is high performance
  3. It has its own built-in data visualization features
  4. It can work from variety of data sources such as excel files, csv files, sql files and so on.

Pandas Installation:

Just like the installation of NumPy, it is recommended to use the Anaconda distribution of Python in order to install Pandas as well. You can see the installation of Anaconda distribution of Python here. Once you have that installed, you can install Pandas by running the following command in the command prompt:

conda install pandas

You can still install Pandas even if you don't have the recommended Anaconda distribution of Python (not recommended) using the following command:

pip install pandas

We will learn about the following topics under the Pandas library:
  • Series
  • Data Frames
  • Filling missing data
  • Various data input & output techniques
  • Operations on series and data frames
Now that we have installed Pandas successfully on our systems, we will start using the Pandas library starting with Series from the next post.
Share:

Data Science & Machine Learning - 3.5 NumPy Array Methods

Hi friends,

Welcome to another NumPy tutorial under Data Science & Machine Learning. In the previous post, we discussed several ways to index NumPy Arrays and Conditional Selection in NumPy Arrays. In this post, we will learn several operations that we can perform on NumPy Arrays. We will also see various methods supported by the NumPy library to deal with NumPy Arrays.

Note: All the commands discussed below are run in the Jupyter Notebook environment. See this post on Jupyter Notebook to know about it in detail.


NumPy Array Methods

So, let's first declare a NumPy Array using the NumPy's arange() function:


  1. We can add/subtract a scalar to the NumPy Array:


  2. We can multiply/divide a scalar to the NumPy Array:


We can perform the same operations among the NumPy Arrays as well.
  1. Addition/Subtraction between two NumPy Arrays:


  2. Multiplication/Division between two NumPy arrays:



Notice the warning generated due to division by zero operation. However, the division was successful and the output was inf (for 1/0).

We can also perform exponentiation operation using the ** operator in Python. Here is the operation for raising each element of the NumPy Array to the power of 3:


Now, we see some useful methods supported by NumPy library for NumPy arrays:
  1. sqrt() - This method finds the square root of each element of a NumPy array


  2. max() - This returns the maximum element in a NumPy Array
  3. min() - Returns the minimum element of a NumPy Array


  4. Trigonometric functions - NumPy also supports trigonometric methods like sin(), cos() and tan()


      1. mean() - Returns the mean of a NumPy Array
      2. log() - Performs the natural logarithm of each element of a NumPy Array


        Notice, there is a warning since logarithm of zero does not exist.
      You can visit this link to know the list of all the functions supported over NumPy arrays by the NumPy library. 

      From the next post, we'll start with Pandas, another very important library for Data Science under Data Science & Machine Learning.
      Share:

      Contact Me

      Name

      Email *

      Message *

      Popular Posts