2021年4月15日 星期四

[ Python 常見問題 ] How to display the value of each bar in a bar chart using Matplotlib?

 Source From Here

Preface
In this article, we are going to see how to display the value of each bar in a bar chat using Matplotlib. There are two different ways to display the values of each bar in a bar chart in matplotlib –
* Using matplotlib.axes.Axes.text() function.
* Use matplotlib.pyplot.text() function.


HowTo

Example 1: Using matplotlib.axes.Axes.text() function:
This function is basically used to add some text to the location in the chart. This function return string, this is always used with the syntax “for index, value in enumerate(iterable)” with iterable as the list of bar values to access each index, value pair in iterable so at it can add the text at each bar:
  1. import os  
  2. import numpy as np  
  3. import matplotlib.pyplot as plt  
  4.     
  5. x = [01234567]  
  6. y = [160167171301204010570]  
  7. fig, ax = plt.subplots()  
  8. width = 0.75  
  9. ind = np.arange(len(y))  
  10.     
  11. ax.barh(ind, y, width, color = "green")  
  12.     
  13. for i, v in enumerate(y):  
  14.     ax.text(v + 3, i + .25, str(v),   
  15.             color = 'blue', fontweight = 'bold')  
  16. plt.show()  


Example 2: Use matplotlib.pyplot.text() function:
Call matplotlib.pyplot.barh(x, height) with x as a list of bar names and height as a list of bar values to create a bar chart. Use the syntax “for index, value in enumerate(iterable)” with iterable as the list of bar values to access each index, value pair in iterable. At each iteration, call matplotlib.pyplot.text(x, y, s) with x as value, y as index, and s as str(value) to label each bar with its size.
  1. import matplotlib.pyplot as plt  
  2. x = ["A""B""C""D"]  
  3. y = [1234]  
  4. plt.barh(x, y)  
  5.     
  6. for index, value in enumerate(y):  
  7.     plt.text(value, index,  
  8.              str(value))  
  9.     
  10. plt.show()  



2021年4月11日 星期日

[ Python 常見問題 ] How to sample a random number from a probability distribution in Python

 Source From Here

Question
A number selected randomly using a probability distribution will return a number according to the relative weights. For example, a number selected from [1, 2] with relative weights [.9, .1] will have a 90% chance of being 1 and a 10% chance of being 2.

HowTo
USE random.choices() TO SELECT A RANDOM NUMBER

Call random.choices(population, weights) to sample a random number from population based on the probability distribution weights. The weights are relative, meaning the percentage of each number being picked depends on what the weights sum to. e.g.:
  1. import random  
  2. from collections import defaultdict  
  3.   
  4. a_list = [12]  
  5. distribution = [.9, .1] # weights add up to 1  
  6. fdict = defaultdict(int)  
  7.   
  8. loop_num = 1000  
  9. for i in range(loop_num):  
  10.     random_number = random.choices(a_list, distribution)[0]  
  11.     fdict[random_number] += 1  
  12.       
  13. for n, f in fdict.items():  
  14.     print(f"number {n} appears with probability as {f*100/loop_num}%")  
Output:
number 1 appears with probability as 91.4%
number 2 appears with probability as 8.6%


2021年4月4日 星期日

[ Python 常見問題 ] Choose element(s) from List with different probability in Python

 Source From Here

Question
Have you ever wondered how to select random elements from a list with different probability in Python? In this article, we will discuss how to do the same. Let’s first consider the below example.
  1. import random  
  2.     
  3. sam_Lst = [102034100]  
  4. ran = random.choice(sam_Lst)  
  5. print(ran)  
In the above example, the probability of getting any element from the list is equal. But we want such methods in which the probability of choosing one element from the list is different. This is known as the weighted random choice in Python. In order to find weighted random choices in Python, there exist two ways:
* Relative weights
* Cumulative weights

HowTo
The function which will help us in this situation is random.choices(). This function allows making weighted random choices in python with replacement:
  1. random.choices(population, weights=None, *, cum_weights=None, k=1)  
Here, the ‘weight’ parameter plays an important role.

Case 1: Using Relative weights
The weight assigned to an element is known as relative weight.

Example1:
  1. import random  
  2.     
  3. # Creating a number list  
  4. num_lst = [12243191329]  
  5.     
  6. print(random.choices(num_lst, weights=(142530455510), k=6))  
Output:
[19, 19, 13, 22, 13, 13]

In the above example, we assign weights to every element of the list. The weight of the element ‘13′ is highest i.e 55, so the probability of its occurrence is maximum. As we can see in the output, element 13 occurs 3 times, 19 occurs 2 times, and so on. So, now the probability of choosing an element from the list is different.

Example 2:
  1. import random  
  2.     
  3. # Creating a name list  
  4. name_lst = ['October''November''December''January''March''June']  
  5.     
  6. print(random.choices(name_lst, weights=(40253051580), k=3))  
Output:
['June', 'October', 'June']

In the above example, the weight of element ‘June’ is maximum so its probability of selection will be maximum. And here, k=3 which means we are choosing only the top 3 elements from the list.

Case 2: Using Cumulative weights
The cumulative weight of an element is determined by adding the weight of its previous element and its own weight.

Example 1:
  1. import random  
  2.     
  3. # Creating a number list  
  4. num_lst = [12293191325]  
  5.     
  6. print(random.choices(num_lst, cum_weights=(71315202520), k=6))  
Output:
[1, 22, 93, 22, 19, 1]

In the above example, the cumulative weight of element ‘19′ is maximum, so the probability of its selection will also be maximum.

[ ML 文章收集 ] Medium - Feature Engineering Examples: Binning Categorical Features

 Preface

(article source / notebookHow to use NumPy or Pandas to quickly bin categorical features

Working with categorical data for machine learning (MLpurposes can sometimes present tricky issues. Ultimately these features need to be numerically encoded in some way so that an ML algorithm can actually work with them.

You’ll also want to consider additional methods for getting your categorical features ready for modeling. For example, your model performance may benefit from [b]binning categorical features[/b]. This essentially means lumping multiple categories together into a single category. By applying domain knowledge, you may be able to engineer new categories and features that better represent the structure of your data.

In this post, we’ll briefly cover why binning categorical features can be beneficial. Then we’ll walk through three different methods for binning categorical features with specific examples using NumPy and Pandas.

Why Bin Categories?
With categorical features, you may encounter problems with rare labels, categories/groups that are extremely uncommon within your dataset. This issue is often related to features having high cardinality — in other words, many different categories.

Having too many categories, and especially rare categories, leads to a noisy dataset. It can be difficult for an ML algorithm to cut through this noise and learn from the more meaningful signals in the data.

High cardinality can also exacerbate the curse of dimensionality if you choose to one hot encode your categorical features. If the original variable has 50 different categories, you’re basically adding 49 columns to your dataset.

Having too many categories can also lead to issues when training and testing your model. It’s completely possible that a category will show up in the test set, but not in the training set. Your model would have no idea how to handle that category because it has never “seen” it before.

One way to address these problems is by engineering new features that have fewer categories. This can be accomplished through binning (groupingmultiple categories into a single category.

In the following examples, we’ll be exploring and engineering features from a dataset with information about voter demographics and participation. I’ve selected 3 categorical variables to work with:
* party_cd: a registered voter’s political party affiliation
* voting_method: how a registered voter cast their ballot in the election
* birth_state: the U.S. state or territory where a registered voter was born

For our fake dataset, let's import necessary packages and create a function to generate fake dataset:
  1. import pandas as pd  
  2. import numpy as np  
  3. import random  
  4. import matplotlib.pyplot as plt  
  5. from collections import OrderedDict  
  6.   
  7. def gen_fake_df(k=20000):  
  8.     datas = []  
  9.     # https://en.wikipedia.org/wiki/List_of_states_and_territories_of_the_United_States_by_population  
  10.     # https://www.ssa.gov/international/coc-docs/states.html  
  11.     us_states_weights = OrderedDict({  
  12.         'AL'4921532,   
  13.         'AK'731158,   
  14.         'AS'19437,   
  15.         'AZ'7421401,   
  16.         'AR'3030522,   
  17.         'CA'39368,   
  18.         'CO'5807719,  
  19.         'CT'3557006,  
  20.         'DE'986809,  
  21.         'DC'712816,  
  22.         'FL'21733312,  
  23.         'GA'10710017,  
  24.         'GU'68485,  
  25.         'HI'1407006,  
  26.         'ID'1826913,  
  27.         'IL'12587530,  
  28.         'IN'6754953,  
  29.         'IA'3163561,  
  30.         'KS'2913805,  
  31.         'KY'4477251,  
  32.         'LA'4645318,  
  33.         'ME'1350141,  
  34.         'MD'6055802,  
  35.         'MA'6893574,  
  36.         'MI'9966555,  
  37.         'MN'5657342,  
  38.         'MS'2966786,  
  39.         'MO'6151548,  
  40.         'MT'1080577,  
  41.         'NE'1937552,  
  42.         'NV'3138259,  
  43.         'NH'1366275,  
  44.         'NJ'8882371,  
  45.         'NM'2106319,  
  46.         'NY'10600823*5,  
  47.         'NC'10600823*9,  
  48.         'Missing'10600823*7,  
  49.         'ND'765309,  
  50.         'NP'51433,  
  51.         'OH'11693217,  
  52.         'OC'10600823*2,  
  53.         'OK'3980783,  
  54.         'OR'4241507,  
  55.         'PA'12783254,  
  56.         'PR'189068,  
  57.         'SC'5218040,  
  58.         'TX'29360759,  
  59.         'UT'3249879,  
  60.         'VA'8590563,  
  61.         'WA'7693612,  
  62.     })  
  63.     vote_method_weights = OrderedDict({  
  64.         'ABSENTEE ONESTOP'8300,   
  65.         'NO VOTE'3900,  
  66.         'IN PERSON'2000,  
  67.         'ABSENTEE BY MAIL'1900,  
  68.         'ABSENTEE CURBSIDE'500,  
  69.         'PROVISIONAL'60,  
  70.         'TRANSFER'40,  
  71.         'CURBSIDE'30,  
  72.     })  
  73.     for v, p, b in zip(  
  74.         random.choices(list(vote_method_weights.keys()), weights=tuple(vote_method_weights.values()), k=k),  
  75.         random.choices(['REP''UNA''DEM''LIB''CST''GRE'], weights=(69005700400010010090), k=k),  
  76.         random.choices(list(us_states_weights.keys()), weights=tuple(list(us_states_weights.values())), k=k),  
  77.     ):  
  78.         datas.append((v, p, b))  
  79.     df = pd.DataFrame(datas, columns = ["voting_method""party_cd""birth_state"])  
  80.     return df  
  81.   
  82. fake_df = gen_fake_df()  
  83. fake_df.head()  


Using np.where() to Bin Categories
First, let’s check out why I chose party_cd. The image below shows how many individual voters belong to each political party:
  1. plt.rcParams['figure.figsize'] = [64]  
  2. fake_df['party_cd'].value_counts().plot(kind='bar', rot=0)  


There are so few registered Libertarians, Constitutionalists, and members of the Green Party that we can barely see them on the graph. These would be good examples of rare labels. For the purposes of this post, we’ll define rare labels as those that make up less than 5% of observations. This is a common threshold for defining rare labels, but ultimately that’s up to your discretion.

Let’s look at a breakdown of the actual numbers:
  1. def show_column_dist(df, column_name, top_n=-1):  
  2.     # https://stackoverflow.com/questions/50558458/pandas-get-frequency-of-item-occurrences-in-a-column-as-percentage  
  3.     a, fake_df_row_count = fake_df[column_name].value_counts(), fake_df.shape[0]  
  4.     dist_df = pd.DataFrame(a.tolist(), columns = [column_name], index=a.index)  
  5.     dist_df['%'] = dist_df.apply(lambda e: 100*e[column_name]/fake_df_row_count, axis=1)  
  6.     if top_n > 0:  
  7.         return dist_df.head(n=15)  
  8.     else:  
  9.         return dist_df  
  10.   
  11. show_column_dist(fake_df, 'party_cd')  


Those three categories each make up far less than 5% of the population. Even if we lumped them all together into a single category, that new category would still represent less than 1% of voters.

REP and DEM represent the two major political parties, whereas UNA represents voters that registered as unaffiliated with a political party. So here, it could make sense to lump in our three rare labels into that unaffiliated group so that we have three categories: one for each of the two major parties, and a third representing individuals that chose not to align with either major party.

This can be accomplished very easily with np.where() which takes 3 arguments:
* a condition
* what to return if the condition is met
* what to return if the condition is not met

The following code creates a new feature, party_grp, from the original party_cd variable using np.where():
  1. fake_df['party_grp'] = np.where(  
  2.     fake_df['party_cd'].isin(['REP''DEM']),  
  3.     fake_df['party_cd'].str.title(),  
  4.     'Other'  
  5. )  
  6.   
  7. show_column_dist(fake_df, 'party_grp')  


Mapping Categories into New Groups with map()
Next up, let’s take a look at the distribution of voting_method:
  1. plt.rcParams['figure.figsize'] = [64]  
  2. fake_df['voting_method'].value_counts().plot(kind='bar', rot=45)  


Not the prettiest of graphs, but we get the picture. We have 8 different categories of voting method. I would hazard a guess that half of them meet our definition of rare labels.
  1. show_column_dist(fake_df, 'voting_method')  


Yup! Four of our categories are rare labels. Now we could just group them all into an “Other” category and call it a day, but this may not be the most appropriate method.

Based on research I did into how these methods are coded, I know that Absentee means someone voted early. So we could group any Absentee method into an Early category, group In-Person and Curbside into an Election Day category, leave No Vote as its own category, and group Provisional and Transfer into an Other category.

The following code accomplishes this by first defining a dictionary using the original voting_method categories as keys. The value for each key is the new category we actually want.
  1. vote_method_map = {'ABSENTEE ONESTOP''Early',  
  2.                    'IN PERSON''Election Day',  
  3.                    'ABSENTEE BY MAIL''Early',  
  4.                    'ABSENTEE CURBSIDE''Early',  
  5.                    'TRANSFER''Other',  
  6.                    'PROVISIONAL''Other',  
  7.                    'CURBSIDE''Election Day',  
  8.                    'NO VOTE''No Vote'}  
  9.   
  10. fake_df['vote_method_cat'] = fake_df['voting_method'].map(vote_method_map)  
That last line creates a new column, vote_method_cat, based on the original values in the voting_method column. It does so by applying Pandas’ map() method to the original column, and feeding in our vote_method_map to translate from key to corresponding value.
  1. show_column_dist(fake_df, 'vote_method_cat')  



Now we’ve gotten rid of all but one of our rare labels. Ultimately I chose to drop those 106 Other votes. Voting method was actually the target variable I was trying to predict and what I was really interested in was how people chose to vote. Provisional and transfer ballots are more reflective of the process and regulations surrounding voting, but my question was specifically about a voter’s active choice.

So not only can you think about engineering predictive features to better represent the underlying structure of your data, you can consider how best to represent your target variable relative to your specific question.

Applying a Custom Function with apply()
Finally, we’re going to work on binning birth_state. This variable has 57 categories: one for each state, one for missing information, one for each U.S. territory, and a final category for individuals born outside the United States.

So the graph looks comically terrible:
  1. plt.rcParams['figure.figsize'] = [125]  
  2. fake_df['birth_state'].value_counts().plot(kind='bar', rot=90)  


If you ever see a graph like this while exploring categorical features, that’s a good indication you should consider binning that variable if you intend to use it as a feature in your model.

Below is the breakdown of the 15 most common categories of birth_state:
  1. show_column_dist(fake_df, 'birth_state', top_n=15)  


North Carolina NC is the most common state, which makes sense since this data is for voters in a specific county in NC. Then we see lots of missing values Missing. New Yorkers NY and people born outside the U.S. OC also make up a decent portion of the population. The remaining 53 categories are rare labels based on our definition and will introduce a lot of noise into our modeling efforts.

Let’s group states by U.S. Census region (Northeast, South, Midwest, West). We’ll also group people born in U.S. territories or outside the country into an “Other” group, and leave “Missing” as its own category.

We’ll do this by defining our own custom function to translate from state to region, then apply that function to our original variable to get our new feature. Here’s one way you could write a function to check each state and return the desired region/category:
  1. ## Define function for grouping birth state/country into categories  
  2. def get_birth_reg(state):  
  3.       
  4.     # check if U.S. territory or out of country  
  5.     if state in ['AS''GU''MP''PR''VI''OC']:  
  6.         return 'Other'  
  7.       
  8.     # the rest of the categories are based on U.S. Census Bureau regions  
  9.     elif state in ['CT''ME''MA''NH''RI''VT',  
  10.                      'NJ''NY''PA']:  
  11.         return 'Northeast'  
  12.       
  13.     elif state in ['DE''FL''GA''MD''NC''SC''VA',   
  14.                      'DC''WV''AL''KY''MS''TN''AR',  
  15.                      'LA''OK''TX']:  
  16.         return 'South'  
  17.       
  18.     elif state in ['IL''IN''MI''OH''WI',  
  19.                      'IA''KS''MN''MO''NE''ND''SD']:  
  20.         return 'Midwest'  
  21.       
  22.     elif state in ['AZ''CO''ID''MT''NV''NM''UT',  
  23.                      'WY''AK''CA''HI''OR''WA']:  
  24.         return 'West'  
  25.       
  26.     else:  
  27.         return 'Missing'  
And now to use Pandas’ apply() method to create our new feature:
  1. fake_df['birth_reg'] = fake_df['birth_state'].apply(get_birth_reg)  
  2. show_column_dist(fake_df, 'birth_reg')  


Much better! We’ve gone from 57 total categories with 53 rare labels to only 6 categories.

[ Python 常見問題 ] How to display the value of each bar in a bar chart using Matplotlib?

  Source From  Here Preface In this article, we are going to see how to display the value of each bar in a bar chat using Matplotlib. There ...