2020年9月14日 星期一

[ Python 文章收集 ] libsvm/libffm 與 dataframe格式相互轉換

 Source From Here

libsvm 與 dataframe格式相互轉換

libsvm 轉化為 dataframe
  1. ## 将libsvm转为dataframe  
  2. from sklearn.datasets import load_svmlight_file  
  3. from pandas import DataFrame  
  4. import pandas as pd  
  5.   
  6. X_train, y_train = load_svmlight_file("libsvm_data.txt")  
  7. mat = X_train.todense()   
  8.   
  9. df1 = pd.DataFrame(mat)  
  10. df1.columns = ['sepal_length',  'sepal_width',  'petal_length',  'petal_width']  
  11.   
  12. df2 = pd.DataFrame(y_train)  
  13. df2.columns = ['target']  
  14.   
  15. df = pd.concat([df2, df1], axis=1)      # 第一列为target  
  16. df.to_csv("df_data.txt", index=False)  
如果 libsvm 文件的特徵索引是亂序的,直接使用 load_svmlight_file 讀取會報錯,採用下面的函數將每行數據的索引轉化為正序排列:
  1. ## 将索引乱序的libsvm文件转化为索引排序的文件  
  2. def libsvm_index_order(input_file, out_file):  
  3.     with open(input_file, 'r') as f_in, open(out_file, 'w') as f_out:  
  4.         for line in f_in.readlines():  
  5.             items = line.strip().split()  
  6.             features = {}  
  7.             for i in range(1, len(items)):  
  8.                 key, value = items[i].split(":")  
  9.                 features[int(key)] = value  
  10.             features_sort = sorted(features.items(), key = lambda k: k[0])  
  11.             row_order = items[0]  
  12.             for item in features_sort:  
  13.                 feature = ":".join((str(item[0]), item[1]))  
  14.                 row_order = row_order + " " + feature  
  15.   
  16.             f_out.write(row_order + "\n")  
  17.   
  18. input_file = "./ml-tag.train.libfm"  
dataframe 轉化為 libsvm
  1. ## 将 dataframe 转为 libsvm  
  2. import pandas as pd  
  3. from sklearn.datasets import dump_svmlight_file  
  4.   
  5. df = pd.read_csv("data.txt")      # 第一个字段为target  
  6. y = df.target      # y为数据的label值  
  7. dummy = pd.get_dummies(df.iloc[:, 1:])  
  8. mat = dummy.as_matrix()  
  9. dump_svmlight_file(mat, y, 'svm_output.libsvm', zero_based=False)      # 默认为zero_based=True,转换后的字段编号从0开始  
dataframe 轉換為 libffm格式
  1. import numpy as np  
  2. import pandas as pd  
  3. from sklearn.datasets import make_classification  
  4.   
  5. class FFMFormatPandas:  
  6.     def __init__(self):  
  7.         self.field_index_ = None  
  8.         self.feature_index_ = None  
  9.         self.y = None  
  10.   
  11.     def fit(self, df, y=None):  
  12.         self.y = y  
  13.         df_ffm = df[df.columns.difference([self.y])]  
  14.         if self.field_index_ is None:  
  15.             self.field_index_ = {col: i for i, col in enumerate(df_ffm)}  
  16.   
  17.         if self.feature_index_ is not None:  
  18.             last_idx = max(list(self.feature_index_.values()))  
  19.   
  20.         if self.feature_index_ is None:  
  21.             self.feature_index_ = dict()  
  22.             last_idx = 0  
  23.   
  24.         for col in df.columns:  
  25.             vals = df[col].unique()  
  26.             for val in vals:  
  27.                 if pd.isnull(val):  
  28.                     continue  
  29.                 name = '{}_{}'.format(col, val)  
  30.                 if name not in self.feature_index_:  
  31.                     self.feature_index_[name] = last_idx  
  32.                     last_idx += 1  
  33.             self.feature_index_[col] = last_idx  
  34.             last_idx += 1  
  35.         return self  
  36.   
  37.     def fit_transform(self, df, y=None):  
  38.         self.fit(df, y)  
  39.         return self.transform(df)  
  40.   
  41.     def transform_row_(self, row, t):  
  42.         ffm = []  
  43.         if self.y != None:  
  44.             ffm.append(str(row.loc[row.index == self.y][0]))  
  45.         if self.y is None:  
  46.             ffm.append(str(0))  
  47.   
  48.         for col, val in row.loc[row.index != self.y].to_dict().items():  
  49.             col_type = t[col]  
  50.             name = '{}_{}'.format(col, val)  
  51.             if col_type.kind ==  'O':  
  52.                 ffm.append('{}:{}:1'.format(self.field_index_[col], self.feature_index_[name]))  
  53.             elif col_type.kind == 'i':  
  54.                 ffm.append('{}:{}:{}'.format(self.field_index_[col], self.feature_index_[col], val))  
  55.         return ' '.join(ffm)  
  56.   
  57.     def transform(self, df):  
  58.         t = df.dtypes.to_dict()  
  59.         return pd.Series({idx: self.transform_row_(row, t) for idx, row in df.iterrows()})  
  60.   
  61. ########################### Lets build some data and test ############################  
  62.   
  63. train, y = make_classification(n_samples=100, n_features=5, n_informative=2, n_redundant=2, n_classes=2, random_state=42)  
  64.   
  65. train=pd.DataFrame(train, columns=['int1','int2','int3','s1','s2'])  
  66. train['int1'] = train['int1'].map(int)  
  67. train['int2'] = train['int2'].map(int)  
  68. train['int3'] = train['int3'].map(int)  
  69. train['s1'] = round(np.log(abs(train['s1'] +1 ))).map(str)  
  70. train['s2'] = round(np.log(abs(train['s2'] +1 ))).map(str)  
  71. train['clicked'] = y  
  72.   
  73.   
  74. ffm_train = FFMFormatPandas()  
  75. ffm_train_data = ffm_train.fit_transform(train, y='clicked')  
  76. print('Base data')  
  77. print(train[0:10])  
  78. print('FFM data')  
  79. print(ffm_train_data[0:10])  

This message was edited 4 times. Last update was at 15/09/2020 10:12:01

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...