Pandas is a high-level data manipulation library built on top of the Numpy package, hence a lot of the structure of NumPy is used or replicated in Pandas. It provides more flexibility in working with large datasets and helpful methods to carryout various data operations for data analysis. Pandas allows to load, prepare, manipulate, model, and analyze data, regardless of the origin of data. Pandas provides tools for loading data from various file formats into in-memory data objects. It allows to reshape & pivot the data sets, aggregate data using group by, and slice, index and subset large data sets. We can join and merge data sets with high performance using Pandas. Data in pandas is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn. Pandas has become a popular tool for effective data manipulation and analysis. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.
Jupyter Notebooks provide a good environment for using pandas to do data exploration and modeling. Jupyter Notebook enables to execute a piece of code in a particular cell as opposed to running the entire file. This saves a lot of time when working with large datasets and complex transformations. Notebooks also provide an easy way to visualize pandas' DataFrames and plots.
Series
The primary two components of pandas are the Series and DataFrame.
Series is a one-dimensional labeled array which can contain any type of data, including mixed types. The column labels are collectively called index. The labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.
Series is created using python array as below. The default numerical index can be overidden by passing custom index array as below.
series1 = pd.Series([1, 2, 3, 4] series2 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']) type(series1)
Series elements can be accessed by using the index operator[] and index number. Multiple elements can be accessed using the slice operation. Elements can also be accessed using the custom index label passed during initialization of series. The .loc and .iloc indexers can also be used instead of the indexing operator to make selections. The df.loc indexer allows to retrieve data by position and can also select subsets of data compared to the indexing operator. The df.iloc indexer is very similar to df.loc but only uses integer locations to make its selections.
# Get all elements from start to third print(series1[:3]) # Uses index labels to fetch values print(series2['b']) # access the element of series using .loc[] function. print(series1.loc[2:4]) # using .iloc[] function print(series1.iloc[2:5])
Series also provides functions for binary operations such as addition, subtraction, multiplication etc as below.
series_a = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd']) series_b = pd.Series([1, 6, 4, 9], index=['a', 'b', 'd', 'e']) # adding two series using add function series_a.add(series_b, fill_value=0) # subtract one series from another using sub function series_a.sub(series_b, fill_value=0)
Series can be converted to a list or other types with the help of conversion functions such as .astype(), .tolist() etc.
# importing pandas module import pandas as pd # reading csv file from url data = pd.read_csv("nba.csv") # dropping null value columns to avoid errors data.dropna(inplace = True) # storing dtype before converting before = data.dtypes # converting dtypes using astype data["Salary"]= data["Salary"].astype(int) data["Number"]= data["Number"].astype(str) salary_list = data["Salary"].tolist() # storing dtype after converting after = data.dtypes
The Series .to_frame() method is used to convert a Series object into a DataFrame.
import pandas as pd series = pd.Series([100, 200, 300, 400, 500]) # converting the series into the dataframe dataframe = series.to_frame()
Pandas Series unique() function extracts a unique data from the dataset. The unique() method does not take any parameter and returns the numpy array of unique values in that particular column. It can only be applied to 1 dimensional array. The unique() method works only on series and not on DataFrames. The pd.unique() method includes NULL or None or NaN value as a unique value.
dataset = { 'Name': ['Jack', 'Rock', 'Tom', 'Jack', 'Mike'], 'serial_no': ['01', '02', '03', '04', '05']} df = pd.DataFrame(dataset) group = df["Name"].unique() # unique function also takes an array of tuples print(pd.unique([('x', 'y'), ('y', 'x'), ('x', 'z'), ('y', 'x')])) # nan and None are part of the unique values in the result print(pd.unique([('x', 'y'), ('y', 'x'), ('x', 'z'), np.nan, None, np.nan]))
DataFrame
DataFrame is a 2-dimensional (labeled) data structure with columns of potentially different Python Data types. A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series. DataFrames are used to store and manipulate tabular data in rows of observations and columns of variables.
The simplest way to create a Dataframe is to use Python dictionary object. Each (key, value) item in data corresponds to a column in the resulting DataFrame. The default index of the DataFrame is numbers starting from 0, but we could also create our own when we initialize the DataFrame as below.
data_dictionary = { 'apples': [3, 2, 0, 1], 'oranges': [0, 3, 7, 2] } purchases = pd.DataFrame(data_dictionary) purchases = pd.DataFrame(data_dictionary, index=['John', 'Tony', 'Michael', 'Chris'])
We will be focusing on DataFrames in the subsequent sections below.
Reading data from files
Pandas allows to load data from various file formats such as csv, excel and json. It can also data from remote URL. While reading CSV data into DataFrame, the index_col parameter indicates the column which will become the index of the DataFrame. If we pass index_col = 0, then the first column of the DataFrame will be converted into the index.
import pandas as pd df = pd.read_csv('sales.csv') # returns as data frame object df = pd.read_csv('sales.csv', index_col=0) # making zero-th column as index # load files in tab seperated format df_tab = pd.read_csv('sales_tab_format.txt', delimiter='\t') # load files in excel format df_xlsx = pd.read_excel('sales.xlsx') # read json file data, which can also take a URL df_json = pd.read_json("https://datasource.com/master/data/records.json")
Although Pandas does not have any built in function to read multiple files within a directory path into a single Dataframe, this can be achieved using basic python file operations as shown in below example.
import pandas as pd import os files = [file for file in os.listdir('/path/directory')] all_data = pd.DataFrame() for file in files: df = pd.read_csv('/path/directory/'+file) all_data = pd.concat([all_data, df])
Reading data from Database
Pandas can also work with a SQL database by first establishing a connection using an appropriate Python library, then pass an sql query to fetch data into pandas. Pandas loads data from sql database using read_sql() function which is a wrapper around read_sql_table and read_sql_query. Below is an example of accessing data from MySQL Database.
Another example of loading data from Oracle Database using cx_Oracle driver.
from sqlalchemy import create_engine import pymysql mysql_connection_str = 'mysql+pymysql://mysql_user:mysql_password@mysql_host/mysql_db' mysql_connection = create_engine(mysql_connection_str) df = pd.read_sql('SELECT * FROM table_name', con=mysql_connection)
Another example of loading data from Oracle Database using cx_Oracle driver.
import cx_Oracle import pandas as pd class Connection(cx_Oracle.Connection): def cursor(self): cursor = super(Connection, self).cursor() cursor.arraysize = 5000 return cursor ora_connection = Connection("oracle_user", "oracle_password", "ora12c") df = pd.read_sql_query("select * from table_name", ora_connection)
Indexing
Indexes in Pandas are similar to an an address which enable to access any data point across the DataFrame or Series. A DataFrame has an index that labels every row with its initial row number similar to Series, and second set of labels are column names. Index in Pandas is an immutable array which enables to access a row or column using the label. The index of the Series or DataFrame can be accessed using the index attribute or Index() function. It allows to change the index column title as well.
df = pd.DataFrame({"Letters": ["a", "b", "c"], "Numbers": [1, 2, 3]}) # returns the DataFrame index as an attribute df_index = df.index # update index column name df_index.name = "Index Title" series = [11, 21, 21, 19, 11] # returns index of list object seriObj = pd.Index(series)
The set_index() function is used to set the List, Series or DataFrame as an index of the Data Frame. It sets the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length). It takes keys parameter with list of column names to set as DataFrame index, drop parameter which if True removes the column used for index and append parameter which appends the column to the existing index column if True.
# set the Rank column as index to the current DataFrame data.set_index('rank',inplace=True) # create a MultiIndex using columns 'year' and 'month' df.set_index(['year', 'month']) # create the MultiIndex using an Index and a column: df.set_index([pd.Index([1, 2, 3, 4]), 'year']) # create a MultiIndex using two Series s = pd.Series([1, 2, 3, 4]) df.set_index([s, s**2]) # set the Timestamp column as index df.set_index('Timestamp', inplace=True, drop=True) # drop the passed columns and append them to the already existing index column df.set_index(["Month", "Year"], inplace = True, append = True, drop = False)
Index of a DataFrame can be reset to a list of integers ranging from 0 to length of data as the index using the reset_index() method. It reassigns the index i.e. the row label of DataFrame and Series to the sequence numbers starting from 0. It is mostly used to remove the current index or use row name (string) as the index. The reset_index() takes level parameter which can be column name or number, or list of columns to remove the index. The drop parameter adds the replaced index column to the data if the value is false. When the drop parameter is set to True, the original index is deleted. The col_level selects the column level to insert the labels, while col_fill determines the naming of other levels. By default, the reset_index() does not change the original object and returns the new object, but if the argument inplace is set to True, an original object is changed.
dataset = { 'Name': ['Rohit', 'Mohit', 'Sohit', 'Arun', 'Shubh'], 'Roll no': ['01', '03', '04', '05', '09'], 'Marks in maths': ['93', '63', '74', '94', '83'], 'Marks in science': ['88', '55', '66', '94', '35'], 'Marks in english': ['93', '74', '84', '92', '87']} df = pd.DataFrame(dataset) # Setting index on the name column df.set_index(["Name"], inplace=True, append=True, drop=True) # Resetting index to level 1, back to original form df.reset_index(level=1, inplace=True, col_level=1) # Setting index on MultiIndex which is name and Roll no df.set_index(["Name", "Roll no"], inplace=True, append=True, drop=True) # convert the 1st index to a column in the DataFrame df.reset_index(level=0, inplace=True) # Resetting index to level 2, remove the index on Roll no and index only the Name column df.reset_index(level=2, inplace=True, col_level=1) # reset old index to new index, saving old index as new column named 'index' new_df = new_df.reset_index() # reset to new index without saving existing index column new_df = new_df.reset_index(drop=True) # perform operation in place rather than creating new data frame altogether new_df.reset_index(drop=True, inplace=True)
Printing data within DataFrame
The head(), tail() and slice allows to get first or last n rows with a DataFrame. The head() function outputs the first five rows of the DataFrame by default. On passing a number argument to head() function it could output the corresponding number of rows. Similarly the tail() function also accepts a number argument.
print(df.head(3)) # get the top 3 rows in the data frame print(df.tail(3)) # get the bottom 3 rows in the data frame print(df[50:55]) # get the rows by specifying the row numbers
Indexing (Subset Selection) in pandas means selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. The indexing operator [] is used to select rows or columns in the dataframe. Pandas allows to read data columns using column headers (column names). The indexing operator [] also enables to take a subset of the data in number of ways when the DataFrames are too large to work with as below.
data = {'Name':['John', 'Jimmy', 'Jackie', 'Joseph'], 'Age':[27, 24, 22, 32], 'Address':['Dublin', 'London', 'Vancover', 'Tokyo'], 'Qualification':['MS', 'MD', 'MBA', 'Phd']} df = pd.DataFrame(data) ## Read Headers of data columns print(df.columns) ## Read single column print(df['Name']) # Read multiple columns by name at once using the double brackets syntax print(df[['Name', 'Age', 'Qualification']] ) # Reach top 5 row in 'Name' column print(df['Name'][0:5]) # select all whose age is 24 and qualification is MD df[(df.Age == '24') & (df.Qualification == 'MD')]
Pandas provide multiple options to display the data in more customized format such as decimal precision, maximum number of rows displayed, row wrapping etc.
Indexer to select rows and columns
Apart from the indexing operator [], Pandas provide the loc and iloc indexers to perform just about any data selection operation. The loc is label-based were rows and columns are fetched by their (row and column) labels. It can select subsets of rows or columns and even select rows/columns simultaneously. The loc indexer is used for selecting rows by label/index or by a boolean condition within the DataFrame. The loc indexer uses the syntax, df.loc[<row selection>, <column selection>]. The selection using loc method is based on the index values of any rows of the DataFrame. The index is set on a DataFrame using df.set_index() method. Below are the examples of selecting single/multiple rows and columns from the DataFrame. When selecting columns using the loc indexer, columns are referred to by names using lists of strings, or “:” slices. We can also select ranges of index labels which return all rows in the data frame between the specified index entries.
data = {'first_name':['Johnny', 'Jimmy', 'Joe', 'Joseph'], 'last_name':['Depp', 'Harrison', 'Liberman', 'Nash'], 'company_name':['Apple', 'Google', 'Tesla', 'Microsoft'], 'address':['14 Taylor Street', '15 Binney Street', '8 Moor Place', '5396 Forth Street'], 'city':['London', 'Berlin', 'Los Angeles', 'Dubai'], 'phone':['0343-6454565', '023-4345345', '012-54645646', '09-23423424']} df = pd.DataFrame(data) df.set_index("last_name", inplace=True) # selecting single row df.loc['Depp'] # selecting multiple rows df.loc[[ 'Depp', 'Harrison']] # select rows and columns using names of the columns df.loc[['Harrison', 'Liberman'], ['first_name', 'address', 'phone']] # select rows with index values 'Liberman' and 'Nash', with all columns between 'city' and 'phone' data.loc[['Liberman', 'Nash'], 'city':'phone'] # Select same rows, with just 'first_name', 'address' and 'city' columns data.loc['Harrison':'Nash', ['first_name', 'address', 'city']]
The iloc is integer based indexer and allows to specify rows and columns by their integer index. The iloc indexer is used for integer-location based indexing or selection by position. The iloc indexer syntax is df.iloc[<row selection>, <column selection>]. The iloc indexer returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected.
# Single selection of Rows: df.iloc[0] # first row of data frame df.iloc[1] # second row of data frame df.iloc[-1] # last row of data frame # Single selection of Columns: df.iloc[:,0] # first column of data frame df.iloc[:,1] # second column of data frame df.iloc[:,-1] # last column of data frame ## Reading a specific location (R,C) using integer location function passing row and column number df.iloc[2,1] # Multiple row and column selections using iloc and DataFrame df.iloc[0:5] # first five rows of dataframe df.iloc[:, 0:2] # first two columns of data frame with all rows df.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns. df.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame # finding specific row based on text rather than index df.loc['Type 1' == "Fire"] # find all rows were column 'Type 1' value is 'Fire'
Logical Boolean indexing
The conditional selections with boolean arrays using data.loc[<selection>] enables to fetch values which match the specified condition. We pass an array or Series of True/False values to the loc indexer to select the rows where the Series has True values. For example, the statement df[‘first_name’] == ‘Jimmy’] produces a Pandas Series with a True/False value for every row in the DataFrame, where there are “True” values for the rows where the first_name is “Jimmy”. Further, the second argument to the loc method takes the column names to fetch, which can be a single string, a list of columns or slice ":" operation. Passing multiple column names to the second argument of loc[] enables to select multiple columns. Selection of single column returns a Series, while selecting a list of columns returns a DataFrame.
# Select rows with first name Jimmy df.loc[df['first_name'] == 'Jimmy'] # Select rows with first name Jimmy and columns 'company_name' 'city' and 'phone' df.loc[df['first_name'] == 'Jimmy', ['company_name', 'city', 'phone']] # Select rows with first name Jimmy and all columns between 'address' and 'phone' df.loc[df['first_name'] == 'Jimmy', 'address':'phone'] # Select rows with last_name equal to some values, all columns df.loc[df['first_name'].isin(['Joe', 'Johnny', 'Joseph'])] # Select rows with first name Johnny AND 'Street' addresses df.loc[df['address'].str.endswith("Street") & (df['first_name'] == 'Johnny')] # select rows with id column between 100 and 200, and just return 'address' and 'city' columns df.loc[(df['id'] > 100) & (df['id'] >= 200), ['address', 'city']] # A lambda function that yields True/False values can also be used. E.g. Select rows where the company name has 4 words in it. df.loc[df['company_name'].apply(lambda x: len(x.split(' ')) == 4)] # Selections can be achieved outside of the main .loc for clarity idx = df['company_name'].apply(lambda x: len(x.split(' ')) == 4) # Select only the True values in 'idx' and only the 3 columns specified: df.loc[idx, ['city', 'first_name', 'company_name']]
Few more examples of filtering data using loc method
df.loc[df['Type 1'] == 'Grass'] # inside pandas data frame '&' is used instead of 'and'. df.loc[(df['Type 1'] == 'Grass') & (df['Type 2'] == 'Poisen')] df.loc[(df['Type 1'] == 'Grass') | (df['Type 2'] == 'Poisen')] # create a new data frame based on filtering results new_df = df.loc[(df['Type 1'] == 'Grass') & (df['Type 2'] == 'Poisen') & (df['HP'] > 70)] # filter to get all the names containing the text 'Mega' df.loc[df['Name'].str.contains('Mega')] # remove all the names containing the text 'Mega', negation example in loc df.loc[~df['Name'].str.contains('Mega')] # The str.contains() function can also accept regular expressions import re df.loc[df['Type 1'].str.contains('Fire|Grass', regex=True)] # ignore case for str.contains() df.loc[df['Type 1'].str.contains('fire|grass', flags=re.I, regex=True)] # fetch all the names in 'Name' column starting with 'Pi' df.loc[df['Name'].str.contains('^pi[a-z]*', flags=re.I, regex=True)]
Setting values in DataFrames using loc
We can update the DataFrame in the same statement as the select and filter using loc indexer. This particular pattern allows to update values in columns depending on different conditions. The setting operation does not make a copy of the data frame, but edits the original data.
# change the first name of all rows with an ID greater than 2000 to "Robert" data.loc[data['id'] > 2000, "first_name"] = "Robert" # update all the places were value of 'Type 1' is 'Fire' to new value 'Flamer' df.loc[df['Type 1'] == 'Fire', 'Type 1'] = 'Flamer' # update all the values of column 'Legendary' to True, were the value of column 'Type 1' is 'Fire' df.loc[df['Type 1'] == 'Fire', 'Legendary'] = True # change multiple column values based on the value condition on another column df.loc[df['Total'] > 500, ['Generation','Legendary']] = 'Test Value' df.loc[df['Total'] > 500, ['Generation','Legendary']] = ['Test1 Value', 'Test2 Value']
Iteration
Dataframe contains rows and columns, and is iterated similar like a dictionary. Rows are iterated using the functions, iteritems(), iterrows(), itertuples(). Columns of DataFrame can be iterated by first creating a list of dataframe columns and then by iterating through that list to pull out the dataframe columns. Iterate through each row in the Dataframe, to read row data.
## Read each row for index, row in df.iterrows(): print(index, row) # Reach each row for the specified column name for index, row in df.iterrows(): print(index, row['first_name']) # creating a list of dataframe columns columns = list(df) # iterating columns and print values of 3rd row for i in columns: print (df[i][2]) # iterate specified column names in the DataFrame for column in df[['first_name', 'address']]: print('Colunm Name : ', column) # iterate over columns using an index for index in range(df.shape[1]): print('Column Number : ', index) # Select column by index position using iloc[] columnsObj = df.iloc[: , index] print('Column Contents : ', columnsObj.values)
The iteritems() function returns an iterator which can be used to iterate all the columns of the dataframe. The iteritems() function returns an iterator to the tuple containing column name and Series pair for each column. Hence it can iterate over the DataFrame columns, returning a tuple with the column name and the content as a Series.
for (columnName, columnData) in df.iteritems(): # prints column name print('Colunm Name : ', columnName) # prints all the values in DataFrame print('Column Values : ', columnData.values)
DataFrame Properties
The info() function provides the concise summary of the dataset, such as the number of rows and columns, the number of non-null values, type of data in each column, and memory usage. The number of non-null values and the column data type helps to detect data issues before performing data operations. The describe() method enables to get statistics details about the data within the Dataframe. The shape property returns a tuple representing the dimensionality of the Dataframe. The values property returns all the rows in the DataFrame.
# summary information of the dataframe df.info() # read basic statistics information of each column in the data set, e.g. count, mean, min and std deviation df.describe() # include all option adds all the columns including non-numeric columns with NAN values df.describe(include='all') df.shape # returns (16598,11) i.e. 16598 records i.e. rows and 11 columns df.shape[0] # reutrns 16598 row count df.values # returns 2 dimensional array, were inner array represent the first row in the data set.
Cleaning Data
The isnull() and notnull() function is used to check for missing values in Pandas. Both the functions help in checking whether a value is NaN or not. They can also be used in Pandas Series in order to find null values in a series.
# using isnull() function df.isnull()
Pandas provides various built in functions such as fillna(), replace() and interpolate() which allow to clean the data by replacing all the NAN values with zeros. The fillna() function fills the NA/NaN values with the specified value in Pandas DataFrame. The replace() method on the other hand replaces the NaN values with zeros by using the numpy.nan property. It replaces the specified value dynamically and differs from loc or iloc functions which require to specify a location to update the value. Interpolate() function is generally used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value.
dict = {'price': ['100', 'KDL100', 400, 'ADL100'], 'discount': ['50', '50%', '30%', '20']} df = pd.DataFrame(dict) # convert string to numeric values using df.to_numeric() function df['price'] = pd.to_numeric(df['price'], errors='coerce') df['price'] = df['price'].fillna(0) # replacing the NaN values with zeros by using numpy's nan property df['price'] = df['price'].replace(np.nan, 0) # alternatively use the apply() function to convert the entire DataFrame values df = df.apply(pd.to_numeric, errors='coerce') # function to replace with zeros in entire dataframe df = df.fillna(0) # replace the NaN values with zeros in entire DataFrame df = df.replace(np.nan, 0)
In order to drop empty values from a dataframe, the dropna() function is used which removes rows and columns with Null/None/NA values from the DataFrame. It returns a new DataFrame without changing the original DataFrame. The dropna() function takes arguments axis, how, thrash, subset and inplace. The axis argument value of 0 results in dropping null values from the rows while axis value 1 drops columns with missing values. The how argument determines whether to drop the row/column when any of the values are null (default) or when all the values are missing. The thresh argument determines the threshold for the drop operation. The subset arguments takes specific rows/columns to drop null values.
series = [('Stranger Things', 3, 'Millie'), ('Game of Thrones', np.nan, 'Emilia'), ('Westworld', pd.NaT, 'Evan Rachel'), ('La Casa De Papel', 4, None)] dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor']) # drop rows with any missing value which is default behavior removedNone = dfObj.dropna() # when axis = 1, drop all columns with any missing value removedNoneColumns = dfObj.dropna(axis=1) # drop the rows if all all the values are either None, NaN, or NaT. removedNoneColumns = dfObj.dropna(how='all') # when thresh = 2, drop only those rows which have a minimum of 2 Null/None/NA values removedNoneColumns = dfObj.dropna(thresh=2) # drop the null values only in the subset of defined labels removeDefinedColumns = dfObj.dropna(subset=['Name', 'Actor'])
Find Duplicates
Pandas DataFrame.duplicated() function is used to find duplicate rows based on all columns or some specific columns. The duplicated() function returns a Boolean Series with a True value for each duplicated row. The pandas.duplicated takes two parameters, subset and keep. The subset parameter if passed single or multiple column, finds the duplicate rows in the corresponding columns. The keep parameter denotes the occurrence which should be marked as duplicate, either 'first' which ignores first occurrence, 'last' ignores last occurrence and 'False' which marks everything as duplicates.
series = [('Stranger Things', 3, 'Millie'), ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'), ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'), ('La Casa De Papel', 4, 'Sergio')] dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor']) # finds all duplicate rows duplicateDFRow = dfObj[dfObj.duplicated()] # finds duplicate rows, ignoring the last duplicate occurrence duplicateDFRow = dfObj[dfObj.duplicated(keep='last')] # find duplicates in the Name column only duplicateDFRow = dfObj[dfObj.duplicated(['Name'])] # find duplicates in both Name and Seasons columns duplicateDFRow = dfObj[dfObj.duplicated(['Name', 'Seasons'])]
Pandas also has a drop_duplicates() method which helps in removing duplicates from the data frame. It also takes subset parameter which takes column names from which the duplicates are dropped. Similar to duplicated() function it takes the keep parameter to determine which duplicate values to keep and inplace to modify the exiting DataFrame.
# remove duplicate values for orderId df = df[['Order Id', 'Grouped']].drop_duplicates()
Value Count
The value_counts() function returns the Series containing counts of unique values in sorted order. The resulting object is in descending order with the first element as the most frequently-occurring element. It excludes NA values by default. The value_counts() function applies only to Series, which can be a Series object or a selected column from the DataFrame. The Series.index.tolist() or Series.to_frame() functions can be used to extract values from resultant Series of value_counts() function for further usage.
# read csv file by skipping first 4 lines in the file df = pd.read_csv('data.csv', skiprows=4) # find counts of the column city in the dataframe, sorted in descending order by default df.City.value_counts() # find counts of the column sport in the dataframe without sorting df.Sport.value_counts(sort=False) # find counts and relative frequency by dividing all values by the sum of values df.Sport.value_counts(normalize=True) # find counts including the NaN values for Sport column df.Sport.value_counts(dropna=False)
Sorting data in Dataframe
There are two methods of sorting in Pandas, namely by label and by actual value.
The sort_index() method sorts the DataFrame by taking the axis arguments and the order of sorting. Passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0, sort by row. By default, sorting is done on row labels in ascending order, which can be changes by passing boolean value to the ascending parameter.
import pandas as pd import numpy as np unsorted_df = pd.DataFrame(np.random.randn(10,2), index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1']) # sort by rows in ascending order by default sorted_df = unsorted_df.sort_index() # sort by columns sorted_df = unsorted_df.sort_index(axis=1) # sort by descending order sorted_df = unsorted_df.sort_index(ascending=False)
The sort_values() method is used for sorting columns by values. It accepts a list of column names of the DataFrame with which the values are to be sorted. The sort_values() method also provides a provision to choose the algorithm from mergesort (stable), heapsort and quicksort.
# Sort column values df.sort_values('Name') df.sort_values('Name', ascending=False) # Sort multiple columns df.sort_values(['Type 1', 'HP']) df.sort_values(['Type 1', 'HP'], ascending=False) # Multiple column sorting with descending df.sort_values(['Type 1', 'HP'], ascending=[1,0]) # Multiple column sorting, with different order for each column
It is recommended to use reindex if the row numbers are already used as index and the order to rows is changed by sorting or deleting rows.
Assign New Columns
New columns can be assigned to a DataFrame using Pandas's assign() method. The assign() method returns the new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten. The length of the newly assigned column must match the number of rows in the DataFrame.
dict = {'price': [520, 500]} df1 = pd.DataFrame(data=dict) # assign new column called revised price df2 = df1.assign(revised_price=lambda x: x.price + x.price * 0.05) # assign new column using the values of other columns df2 = df1.assign(revised_price=df1['price'] + df1['price'] * 0.05) # assigning multiple columns df2 = df1.assign(revised_price=df1['price'] + df1['price'] * 0.05, changed_price=df1['price'] + df1['price'] * 0.10) # add new column with values specified in the list df1['items'] = ['Apple Watch', 'Air Pod'] # add new column with same default value df1['category'] = 'Electronics'
Transpose Rows and Columns
The transpose() method or T attribute in Pandas swaps the rows and columns of the DataFrame. Neither method changes an original object but returns the new transposed object with the rows and columns swapped. The transpose() function transposes index and column, by writing rows as columns and vice-versa.
dt = { 'Stranger Things': ['Mike', 'Eleven'], 'Money Heist': ['Professor', 'Tokyo'] } df1 = pd.DataFrame(data=dt) # transpose dataframe using T attribute transposed_df1 = df1.T # transpose dataframe using use the transpose() method. transposed_df2 = df1.transpose()
Updating the data in Dataframe
Some of the example of updating the DataFrame by adding column values directly or using the sum() function of iloc method.
# Adding a new total column by taking sum of values from other columns in a given row df['Total'] = df['HP'] + df['Attack'] + df['Defense'] + df['Sp.Atk'] + df['Sp.Def'] # another method using iloc function df['Total'] = df.iloc[:, 4:9].sum(axis=1) # update the location or the index of the column df = df[ ['Total', 'HP', 'Attack'] ] cols = list(df.columns.values) # switch the column from last column to the 4th column from left side # get first 3 columns then use reverse index to get total column then, concatenate remaining columns df = df[cols[0:4] + [cols[-1]] + cols[4:11]]
Pandas provides methods to add new rows, add new columns, delete rows and delete columns as shown below.
# adding a new Ratings Column with corresponding values to the DataFrame df['Ratings'] = ['A', 'D', 'C', 'B', 'C'] # adding a row using append() function at the end data = {'Name': 'Mario Brothers', 'Platform': 'Nintendo', 'HP': 22, 'Attack': 45, 'Defense': 67, 'Total': 132} df = pd.DataFrame(data2, index=[4]) # delete the entire column from the data frame df = df.drop(columns=['Total']) # delete specified column from the DataFrame del df['Total'] # delete the row using an index label from a DataFrame df = df.drop(2)
The rename() method can be used to rename certain or all columns via a dictionary. Also list comprehension can be leveraged to rename columns as below.
df.rename(columns={ 'Runtime (Minutes)': 'Runtime', 'Revenue (Millions)': 'Revenue in millions' }, inplace=True) # use list comprehension for renaming columns to lower case df.columns = [col.lower() for col in df]
Aggregate using Groupby
The groupby() function takes the name of the column in the data frame which is to be grouped on and perform aggregation. The groupby() function can also either take either a list of multiple column names, a dict or Pandas Series, or a NumPy array or Pandas Index, or an array-like iterable of these.
# find average for all the 'Type 1' column values df.groupby(['Type 1']).mean() df.groupby(['Type 1']).sum() df.groupby(['Type 1']).mean().sort_values('Defense', ascending=False) df.groupby(['Type 1']).count() # to accurately count the items df['count'] = 1 df.groupby(['Type 1', 'Type 2']).count()['count']
Date Time Conversion
Pandas to_datetime() method helps us to convert string Date time into Python Date time object. It comes in handy while working on datasets involving time. The to_datetime() function has the following parameters.
- arg: It is the object which is to be converted to a datetime, which can be int, float, string, datetime, list, tuple, 1D array, Series.
- errors: It can have 3 values 1st is ‘ignore’ 2nd ‘raise’ 3rd is ‘coerce’. By default, its value is ‘raise. If the value of this parameter is set to ‘ignore’ then invalid parsing will return the input. If ‘raise’, then invalid parsing will raise the exception. If ‘coerce,’ then the invalid parsing will be set to NaT.
- dayfirst: It accepts a Boolean value; it places day first if the value of this parameter is true.
- yearfirst: It agrees with a Boolean value; it places year first if the value of this parameter is true.
- UTC: It is also a Boolean value, it returns time in UTC format if the value is set to true.
- format: It is a string input that tells the position of the day, month, and year.
- exact: This is also Boolean value if true requires an exact format match otherwise allows the format to match anywhere in the target string.
- unit: It is a string with units of arguments ( D,s, ms, us, ns) denote the unit, which is the integer or float number. By default, it is ‘ns’.
- infer_datetime_format: This is a boolean value. When it is true and no format of the data is given, then it attempts to infer the format of the datetime strings.
- Origin: It is used to define a reference date.
- cache: It uses a cache of unique, converted dates to apply the datetime conversions, when cache is True.
The return value depends on the type of input with DatetimeIndex for list input, Series of datetime64 dtype for Series input and Timestamp for Scalar input. When the date does not meet the timestamp limitations, passing errors=’ignore’ will return an original input instead of raising an exception.
import pandas as pd data = pd.DataFrame({'year': [2015, 2016, 2017, 2018, 2019, 2020], 'month': [2, 3, 4, 5, 6, 7], 'day': [4, 5, 6, 7, 8, 9]}) x = pd.to_datetime(data)
Reading large data in Chunks
In some cases the size of the file is very large to fit in memory of the local machine. In such case pandas provides ability to read such large data in chunks. The chunksize parameter means the number of rows to be read into a dataframe at any single time in order to fit into the local memory.
We can also read the large data in chunks, and use it to group by values based on certain columns in order to get corresponding aggregation results.
# fetch 5 rows at one time df = pd.read_csv('data.csv', chunksize=5) for df in pd.read_csv('data.csv', chunksize=5): print("CHUNK DATA FRAME") print(df)
We can also read the large data in chunks, and use it to group by values based on certain columns in order to get corresponding aggregation results.
# create a new empty data frame with same columns as the original data frame new_df = pd.DataFrame(columns=df.columns) # use groupby to get counts from original data frame and concatenate the result to new empty data frame for df in pd.read_csv('data.csv', chunksize=5): results = df.groupby(['Type 1']).count() new_df = pd.concat([new_df, results])
Saving data into files
Similar to the ways data is read, pandas provides intuitive commands to save it. For each file format i.e. JSON, Excel and CSV files, the corresponding functions take the desired filename with the appropriate file extension as below.
df.to_csv('modified.csv') # to ignore/remove the index column from the output csv file df.to_csv('modified.csv', index=False) df.to_excel('modified.xlsx', index=False) df.to_csv('modified.txt', index=False, sep='\t') df.to_json('E:/datasets/data.json')
Inserting Dataframe into Database
The to_sql() function enables to insert a DataFrame into an SQL database. The to_sql() is replacement for the deprecated write_frame(). The to_sql() function takes the parameters table name, engine name, if_exists, and chunksize. The chunksize writes records in batches of a given size at a time. By default, all rows will be written at once.
The argument if_exists tells pandas how to deal if the table already exists. When if_exists=fail and if table exists, then to_sql() does nothing. When if_exists=replace and if table exists then to_sql() drops the existing table, recreate it, and then inserts data. When if_exists=append and if table exists then to_sql() inserts data into the existing table. If the table does not exists, to_sql() function creates a new table.
from sqlalchemy import create_engine # create sqlalchemy engine engine = create_engine("mysql+pymysql://{user}:{pw}@localhost/{db}" .format(user="root",pw="password", db="db_name")) df.to_sql('table_name', con = engine, if_exists = 'append', chunksize = 1000)
Plotting data
Pandas has a built in plot() function as part of the DataFrame class which is useful to just get a quick visualization of the data in Dataframe. The plot function uses matplotlib for visualizations. It has several key parameters:
kind — accepts ‘bar’,’barh’,’pie’,’scatter’,’kde’ etc which can be found in the docs.
color — accepts an array of hex codes corresponding sequential to each data series / column.
linestyle — accepts ‘solid’, ‘dotted’, ‘dashed’ (applies to line graphs only)
xlim, ylim — specify a tuple (lower limit, upper limit) for which the plot will be drawn
legend— a boolean value to display or hide the legend
labels — a list corresponding to the number of columns in the dataframe, a descriptive name can be provided here for the legend
title — The string title of the plot
df.plot(x='age', y='fare', kind='scatter')
Pandas also has DataFrame hist() method is a wrapper for the matplotlib pyplot API. The hist() method is called on each Series in the DataFrame, resulting in one histogram per column.
import pandas as pd import matplotlib.pyplot as plt df = pd.DataFrame({ 'length': [2.5, 3.6, 4.6, 4.8, 5.0], 'width': [2.7, 3.7, 6.4, 0.22, 4.7] }) # histogram on length column df.hist(bins=3, column="length") # alternatively use the plot function to plot histogram for column length df[['length']].plot(kind='hist',bins=3,rwidth=0.8) plt.show()
Pivot Table
Pivot table is a table of statistics that helps summarize the data of a larger table by “pivoting” that data. The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data. A pivot table requires a Pandas dataframe and an at least one index parameter. Index is the feature that allows to group the data in dataframe and appear as an index in the resultant table. We can use more than one feature as an index to group the data. Multiple index values can also be passed as a list, creating further groupings. The pivot table function also takes values parameter which is the list of the columns of the dataframe to be aggregated. If the values parameter is blank then all the numerical values of all columns will be aggregated. The index, columns and values parameters takes the column name in the original table as a value. The pivot function creates a new table, whose row and column indices are the unique values of the respective parameters. The cell values of the new table are taken from column given as the values parameter.
import pandas as pd # pivot table function pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False) # basic pivot table with one index pd.pivot_table(df,index=["Name"]) # pivot table with multiple index values pd.pivot_table(df,index=["Name","Rep","Manager"]) pd.pivot_table(df,index=["Manager","Rep"],values=["Price"])
The aggregation function is the aggfunc parameter which can take a list of functions such as numpy mean function, len function to get total count, numpy sum function to get the sum of all values etc. The specified list of aggregation functions are applied to the values specified.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=np.sum) pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=[np.mean,len])
The fill_value parameter is used to specify the value to be used to replace missing values i.e. NaN values within the data.
pd.pivot_table(df,index=["Manager","Rep"],values=["Price"], columns=["Product"],aggfunc=[np.sum],fill_value=0)
The margins=True parameter enables to add row and column for totals. The dropna=True can be used to not include columns where all entries are NAN. The margin_name enables to provide the names for the total rows/columns.
pd.pivot_table(df, index=["Manager","Rep","Product"], values=["Price","Quantity"], aggfunc=[np.sum,np.mean],fill_value=0,margins=True)
Adding columns to a pivot table in Pandas can add another dimension to the tables. The Columns parameter enables to add a key (column) to aggregate by as below. The columns parameter is optional and displays the values horizontally on the top of the resultant table.
p_table = pd.pivot_table(df, index = 'Type', columns = 'Region', values = 'Units', aggfunc = 'sum')
The pivot table can use the standard data frame functions such as query function to filter the data.
p_table.query('Manager == ["Debra Henley"]') p_table.query('Status == ["pending","won"]')
Conclusion
In conclusion, Pandas has many uses and enables for data by cleaning, transforming, and analyzing it. It helps to calculate statistics and answer questions about the data, like the average, median, max, or min of each column. It enables to correlate data in one column with another column, and determine the distribution of data in column. Data can be cleaned efficiently by removing missing values and filtering rows or columns by some criteria. Pandas helps to visualize the data with help from Matplotlib by Plotting bar chart, line chart, histograms and much more. Pandas can connect with most of the SQL databases and file systems to either fetch or store data.
No comments:
Post a Comment