Contents:
What are Pandas?
Pandas is an open sourced python library which provides us with easy-to-use high performance data structures and data analytics tools for data science applications
Pandas finds its application in a wide range of field’s which include academic and commercial domains including economics, finance, analytics, Statistics, etc.
Pandas builds on packages existing python libraries like NumPy and Matplotlib to give us a singular, convenient place to do nearly all our visualization and data analysis work.
Pandas currently has a BSD-license for copy right protection.
Key Features of this Library are as follows:
- DataFrame object in pandas is fast and effective for data manipulation with integrated indexing included.
- Tools for Reading and Writing Data between different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format and in-memory data structures.
- Integrated handling and intelligent data alignment of missing data. Gaining automatic label-based alignment in computations and easily manipulate messy data into an orderly form.
- Flexible pivoting and reshaping of data sets;
- It has Intelligent label-based slicing, subsetting and fancy indexing for large data sets
- Size mutability is available for inserting and deleting columns from data structures.
- This library has powerful group by engine for Aggregating or transforming data allowing us to split-apply-combine operations on data sets.
- Pandas support high performance joining and merging of data sets;
- Hierarchical axis indexing provides a new and innovative way of working with high-dimensional data in a lower-dimensional data structure.
- Time series-functionality: frequency conversion and date range generation, moving window linear regressions, moving window statistics, date shifting and lagging. It even allows us to create domain-specific time offsets and to join time series without losing data.
- This library provides high optimization in terms of performance with critical code paths written in Cython and C.
Creating and Running your First Pandas application
To create this application we first need to have python installed properly.
To Install Python follow the bellow steps:
Steps:
- Visit https://www.python.org/downloads/ and download one of the following versions
-
I recommend going with Python 3.6.x version because 3.6.4 is the latest version which will most likely remain stable while 2.7.x is just released to support legacy code.
- While installing python you need to select this option as it is very important and will help us further as we progress
After this click Install Now and let the software install itself
- Now launch IDLE (Python 3.6 32 bit) which will be installed in your PC if everything is installed properly.
Now go to your Command Prompt window and type python -m pip install pandas.
This will install the panda library along with all its dependencies including NumPy and others.
Now let us get back to creating a new program but before that lets discuss a key concept specifically Series.
Series are one dimensional labelled homogenous array which is size immutable.
In our program we must first import Pandas and NumPy at the top as declarations just like other languages
import pandas as pd
import numpy as np
Here pd and np are used as alias as it is a short hand technique to reduce complexity and keep it highly readable.
data = np.array([‘a’,’b’,’c’,’d’])
s = pd.Series(data)
Here the series gives data with index
print(s)
To generate the output Select and Click Run Module in Run or Just Press F5.
0 a
1 b
2 c
3 cd
dtype: object
In the above output the values in the series array are iterated along with its indexes
Hence with the below code we can query through a given series
import pandas as pd
import numpy as np
data = np.array([‘a’,’b’,’c’,’d’])
s = pd.Series(data)
####series gives data with index
print(s)
data = np.array([‘a’,’b’,’c’,’d’])
s = pd.Series(data,index=[100,101,102,103])
print(s)
data = {‘a’ : 0., ‘b’ : 1., ‘c’ : 2.}
s = pd.Series(data)
print(s)
data = {‘a’ : 0., ‘b’ : 1., ‘c’ : 2.}
s = pd.Series(data,index=[‘b’,’c’,’d’,’a’])
print(s)
s = pd.Series(5, index=[0, 1, 2, 3])
print(s)
s = pd.Series([1,2,3,4,5],index = [‘a’,’b’,’c’,’d’,’e’])
#retrieve the first element
print(s[‘a’])
#retrieve the first two element
print(s[:2])
#retrieve the last three element
print(s[-3:])
print(s[‘a’])
print(s[[‘a’,’c’,’d’]])
print(s[‘h’])
Syllabus of Data Science training in Mumbai