Python is one of the most popular and powerful tools for data analysis. Using python we can analyze, process and interpret the data for better decision making and business support.
Here we have discussed about some important python library that helps to analyze and visualize data effectively and efficiently.
- Numpy
- Pandas
- Matplotlib
- scipy
- seaborn
- scikit-learn
- Numpy
Numpy is a numerical python library which support n dimensional array and Matrices. It is used for mathematical and scientific Calculation.
Creating Numpy array:
# Import Numpy array
import numpy as np
# Create one Dimensional Array
arr = np.array([1, 2, 3])
# Create Multidimensional Array
arr = np.array([[1, 2, 3],
[4, 5, 6]])
Numpy is a numeric module which is used for mathematical calculation. Pandas is used to read and write data files. Data manipulation can be done easily with dataframes. Matplotlib library is used to display the data in the graphical image so that it is called plotting library
2. Pandas
Pandas is a python package that is mainly used in data analysis and machine learning projects. It is built on the top of the Numpy packages. Pandas is used in load data from different sources and Reshaping and pivoting of date sets. we can transform and summarize the dataset as required. pandas support column merging and joining of data.
some commonly used functions and methods in Pandas:
Data Loading and Writing:
pd.read_csv()
: Read a CSV file and create a DataFrame.
pd.read_excel()
: Read an Excel file and create a DataFrame.
df.to_csv()
: Write a DataFrame to a CSV file.
df.to_excel()
: Write a DataFrame to an Excel file.
Data Exploration and Manipulation:
df.head()
: Display the top few rows of the DataFrame.
df.tail()
: Display the bottom few rows of the DataFrame.
df.shape
: Get the dimensions of the DataFrame (rows, columns).
df.info()
: Display a summary of the DataFrame, including data types and memory usage.
df.describe()
: Generate descriptive statistics of the DataFrame.
df.columns
: Access the column names of the DataFrame.
df.dtypes
: Get the data types of each column.
df.unique()
: Get the unique values in a column.
df.nunique()
: Get the number of unique values in each column.
df.isnull()
: Check for missing values in the DataFrame.
df.dropna()
: Drop rows with missing values.
df.fillna()
: Fill missing values in the DataFrame.
df.groupby()
: Group the DataFrame by one or more columns.
df.sort_values()
: Sort the DataFrame by one or more columns.
df.merge()
: Merge two DataFrames based on a common column.
df.pivot_table()
: Create a pivot table from the DataFrame.
Data Selection and Filtering:
df[column]
: Access a single column as a Series.
df[[column1, column2]]
: Access multiple columns as a DataFrame.
df.loc[row_label]
: Access a row by label.
df.iloc[row_index]
: Access a row by index.
df.loc[condition]
: Filter rows based on a condition.
df.query(condition)
: Filter rows using a query string.
df[(condition1) & (condition2)]
: Combine multiple conditions.
Data Aggregation and Transformation:
df.groupby().agg()
: Perform aggregation operations on grouped data.
df.apply()
: Apply a function along an axis of the DataFrame.
df.transform()
: Perform transformation operations on DataFrame columns.
df.rename()
: Rename columns or index labels.
df.drop()
: Drop columns or rows from the DataFrame.
df.insert()
: Insert a new column into the DataFrame.
df.replace()
: Replace values in the DataFrame.
These are a few examples of the functions and methods available in Pandas. The library provides a wide range of capabilities for data manipulation, cleaning, analysis, and visualization. For more details and a comprehensive list of functions, you can refer to the official Pandas documentation: https://pandas.pydata.org/docs/
2. Matplotlib
matplotlib is a most popular library for creating visualizations in Python. It provides a wide range of functionalities to create various types of plots, including line plots, bar plots, scatter plots, histograms, and more. It is built on NumPy arrays .
Line plots: To create a line plot, we can use the plot() function. Pass the values x and y as arguments to create a line connecting the data points.
# import matplot library
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.plot(x, y)
plt.show()
Bar plots: Use the bar() or barh() function to create vertical or horizontal bar plots, respectively.
import matplotlib.pyplot as plt
x = ['A', 'B', 'C', 'D']
y = [10, 5, 7, 3]
plt.bar(x, y)
plt.show()
# create horizontal bar chart
plt.barh(x, y)
plt.show()
Scatter plots: Scatter plots are created using the scatter() function. It’s useful for visualizing the relationship between two continuous variables.
import matplotlib.pyplot as plt
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.scatter(x, y)
plt.show()
Histograms: Use the hist() function to create histograms, which display the distribution of a single variable.
import matplotlib.pyplot as plt
data = [1, 2, 2, 3, 3, 3, 4, 4, 5, 6, 7]
plt.hist(data)
plt.show()
matplotlib can be used in combination with other libraries like NumPy and Pandas to create more complex visualizations and perform data analysis tasks.