Logo

Extra Help

 

Introduction to Pandas

Pandas is one of the most popular Python libraries for data analysis. Pandas is a Python-based open source data analysis and manipulation tool that is quick, strong, adaptable, and simple to use.

It is a well-liked tool for various types of data analysis jobs and is frequently used in data science and machine learning.

Among the many formats that may be read and written by Pandas are Excel, CSV, JSON, HDF5, and many others.

Additionally, it offers a variety of useful data structures and functions that enable manipulating and conducting analyses of data easy and effective.

Pandas

 

Pandas can effectively manage massive volumes of data since it is extremely quick and scalable.

Data analysis and manipulation are made incredibly simple and feature-rich using Pandas, which is very easy to use.

Data analysis and manipulation are incredibly convenient and effective thanks to Pandas' many features and user-friendly interface.

Among the most significant characteristics of pandas are:

1. Simple to use

2. Quick and effective

3. Complex data structures

4. Practical features

 

Creating pandas objects

We import the following packages.


>>> import numpy as np

>>> import pandas as pd

 

Creating python series


>>> s = pd.Series([12, 33, 53, 32, np.nan, 61, 28])
>>> s

0 12.0
1 33.0
2 53.0
3 32.0
4 NaN
5 61.0
6 28.0
dtype: float64

 
pd.date_range('20230204', periods=6)


>>> dates = pd.date_range('20230204', periods=6)
>>> dates

DatetimeIndex(['2023-02-04', '2023-02-05', '2023-02-06', '2023-02-07','2023-02-08', '2023-02-09'],
dtype='datetime64[ns]', freq='D')

 

np.random.randn(6, 4)


>>> np.random.randn(6, 4)

array([[ 0.42906588, 0.53918029, -1.81655533, -0.09472001],
       [ 0.68191124, 0.30750562, -0.72235013, 1.32586436],
       [ 1.74221117, 1.43519712, -0.01555522, 1.12539953],
       [ 0.08361743, 0.95102022, -0.3443251 , -0.35833583],
       [-1.43848456, 0.2094893 , 2.26662177, 1.28381844],
       [ 2.35987675, 0.89756953, -0.21003322, 0.49334903]])

 

pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))


>>> df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
>>> df

                   A         B         C         D
2023-02-04 -1.534267  0.325238 -0.470263  1.062306
2023-02-05  0.540712  0.903037  0.838934  0.048263
2023-02-06  0.950849  0.603792  0.037146  1.423318
2023-02-07  0.523925 -1.841955 -2.530231 -0.542932
2023-02-08 -1.670486  0.205526 -0.577388  1.033460
2023-02-09  1.452816  1.921894  0.537812 -0.846805

 

df2.dtypes


>>> df2.dtypes

A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object

 

 

 

Codeblock E.1. Displaying the number of columns of the dataset.

 

Viewing data

To view data in Python pandas, you can use the following methods:

df.head(): views the first five rows of data

df.tail(): views the last five rows of data

df.sample(): views a random sample of data

df.info(): views information about the dataframe, including the data types of each column


>>> df.head()

    A          B   C D     E    F
0 5.0 2023-02-04 4.0 6 White  Pop
1 5.0 2023-02-04 4.0 6 Blue   Pop
2 5.0 2023-02-04 4.0 6 Red    Pop
3 5.0 2023-02-04 4.0 6 Yellow Pop
4 5.0 2023-02-04 4.0 6 Violet Pop

 

 

Codeblock E.2. Displaying various facets of the dataset.

 

Selection of data

selecting data in pandas is typically done with the .loc method. This method allows you to select data by its location in the dataframe. The .loc method takes two arguments:

1. The name of the row or rows you want to select
2. The name of the column or columns you want to select

For example, if you want to select the first row of the dataframe, you would use the following code:

df.loc[0]

This would return all of the columns in the first row of the dataframe.

If you wanted to select just the "Name" and "Age" columns from the first row, you would use the following code:

df.loc[0, ["Name", "Age"]]

This would return a series with the "Name" and "Age" values from the first row.

The loc function is used to access a group of rows and columns in a DataFrame using a label or a boolean array. The label can be a string, an integer, or a boolean. The boolean array must be the same length as the DataFrame.

 

 

Codeblock E.3. Select specific parts of the dataset.

 

Missing data

In pandas, missing data are typically represented by the variable np.nan. By default, it is excluded from calculations.

 

Codeblock E.4. Dealing with missing dataset.

 

Data Stats

 

 

Codeblock E.5. Histogramming dataset.

 

Data Merging

 

 

Codeblock E.6. Merging dataset.

 

Data Reshaping

 

 

Codeblock E.7. Reshaping dataset.

 

Time series

 

 

Codeblock E.8. Times series.

 

Categorical data

 

 

Codeblock E.9. Categorical data manipulation.

 

Plotting

 

 

Codeblock E.9. Plotting.

 

---- Summary ----

What does pd.Timestamp("20130102") do ?

It creates a timestamp object representing the date January 2, 2013.

 

What does pd.Series(1, index=list(range(4)) do ?

It creates a series with index as specified in the list and values as 1.

 

What does np.array([3] * 4, dtype="int32") do ?

It creates an array of four elements, each of which is equal to 3.

 

What does pd.Categorical(["test", "train", "test", "train"]) do ?

This creates a categorical object that can be used to hold data that can be divided into a limited number of categories. This can be useful for holding data that can be mapped to a finite set of values, such as colors, days of the week, or months of the year.

What does DataFrame.head() and DataFrame.tail() do ?

DataFrame.head() returns the first n rows of a DataFrame, where n is the number specified.

DataFrame.tail() returns the last n rows of a DataFrame, where n is the number specified.

 

What does DataFrame.index or DataFrame.columns do ?

DataFrame.index or DataFrame.columns returns the index or columns of the DataFrame.

 

What does df.to_numpy() do ?

The to_numpy() method is used to convert a DataFrame into a NumPy array.

 

What does df.describe() do ?

The df.describe() function is used to generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

 

What does transposing data in python do ?

Transposing data in Python means swapping the rows and columns of a data set. This is often useful when working with data sets that are in a non-standard format, or when you need to rearrange the data in a specific way for analysis.

 

What does df.sort_index() do ?

The df.sort_index() function sorts the dataframe in ascending or descending order based on the index labels.

 

What does df.sort_values() do ?

The sort_values() method sorts a DataFrame by one or more columns. The default sorting is ascending.

 

What does DataFrame.iloc() or DataFrame.at() ?

iloc() and at() are both row selection methods. iloc() is used to select rows by their integer positions, while at() is used to select rows by their label.

 

What does df1.fillna(value=5) do in python pandas?

It will replace all the NaN values in the dataframe with the value 5.

 

What does df.groupby("A")[["C", "D"]].sum() do ?

It groups the dataframe by column "A" and then sums the columns "C" and "D".

 

What does df.groupby(["A", "B"]).sum() do ?

It groups the dataframe by columns A and B and sums the values in each group.

 

 

 


________________________________________________________________________________________________________________________________
Footer
________________________________________________________________________________________________________________________________

Copyright © 2022-2023. Anoop Johny. All Rights Reserved.