Credits: this code and documentation was adapted from Paul Butler’s sklearn-pandas

DataFrames

It is possible to use a dataframe as a training set, but it needs to be converted to an array first. DataFrameMapper is used to specify how this conversion proceeds. For example, PCA might be applied to some numerical dataframe columns, and one-hot-encoding to a categorical column.

Transformation Mapping

Consider this dataset:

using ScikitLearn
using DataFrames: DataFrame, NA, DataArray
using DataArrays: @data
@sk_import preprocessing: (LabelBinarizer, StandardScaler)

data = DataFrame(pet=["cat", "dog", "dog", "fish", "cat", "dog", "cat", "fish"],
                 children=[4., 6, 3, 3, 2, 3, 5, 4],
                 salary=[90, 24, 44, 27, 32, 59, 36, 27])

	pet	children	salary
1	cat	4.0	90
2	dog	6.0	24
3	dog	3.0	44
4	fish	3.0	27
5	cat	2.0	32
6	dog	3.0	59
7	cat	5.0	36
8	fish	4.0	27

Map the Columns to Transformations

The mapper takes a list of pairs. The first is a column name from the DataFrame, or a list containing one or multiple columns (we will see an example with multiple columns later). The second is an object which will perform the transformation which will be applied to that column:

Note: ScikitLearn.DataFrameMapper won’t be available until DataFrames is imported

1 2	mapper = DataFrameMapper([(:pet, LabelBinarizer()), ([:children], StandardScaler())]);

The difference between specifying the column selector as :column (as a single symbol) and [:column] (as a list with one element) is the shape of the array that is passed to the transformer. In the first case, a one dimensional array with be passed, while in the second case it will be a 2-dimensional array with one column, i.e. a column vector.

Test the Transformation

We can use the fit_transform! shortcut to both fit the model and see what transformed data looks like. In this and the other examples, output is rounded to two digits with round to account for rounding errors on different hardware:

1	round(fit_transform!(mapper, copy(data)), 2)

8x4 Array{Float64,2}:
 1.0  0.0  0.0   0.21
 0.0  1.0  0.0   1.88
 0.0  1.0  0.0  -0.63
 0.0  0.0  1.0  -0.63
 1.0  0.0  0.0  -1.46
 0.0  1.0  0.0  -0.63
 1.0  0.0  0.0   1.04
 0.0  0.0  1.0   0.21

Note that the first three columns are the output of the LabelBinarizer (corresponding to cat, dog, and fish
respectively) and the fourth column is the standardized value for the number of children. In general, the columns are ordered according to the order given when the DataFrameMapper is constructed.

Now that the transformation is trained, we confirm that it works on new data:

1 2	sample = DataFrame(pet= ["cat"], children= [5.]) round(transform(mapper, sample), 2)

1x4 Array{Float64,2}:
 1.0  0.0  0.0  1.04

Transform Multiple Columns

Transformations may require multiple input columns. In these cases, the column names can be specified in a list:

1 2	@sk_import decomposition: PCA mapper2 = DataFrameMapper([([:children, :salary], PCA(1))]);

Now running fit_transform! will run PCA on the children and salary columns and return the first principal component:

1	round(fit_transform!(mapper2, copy(data)), 1)

8x1 Array{Float64,2}:
  47.6
 -18.4
   1.6
 -15.4
 -10.4
  16.6
  -6.4
 -15.4

Multiple transformers for the same column

Multiple transformers can be applied to the same column specifying them in a list:

@sk_import preprocessing: Imputer
mapper3 = DataFrameMapper([([:age], [Imputer()])]; NA2NaN=true)
data_3 = DataFrame(age= @data([1, NA, 3]))
fit_transform!(mapper3, data_3)

3x1 Array{Float64,2}:
 1.0
 2.0
 3.0

Columns that don’t need any transformation

Only columns that are listed in the DataFrameMapper are kept. To keep a column but don’t apply any transformation to it, use nothing as transformer:

mapper3 = DataFrameMapper([
     (:pet, LabelBinarizer()),
     (:children, nothing)])
round(fit_transform!(mapper3, copy(data)))

8x4 Array{Float64,2}:
 1.0  0.0  0.0  4.0
 0.0  1.0  0.0  6.0
 0.0  1.0  0.0  3.0
 0.0  0.0  1.0  3.0
 1.0  0.0  0.0  2.0
 0.0  1.0  0.0  3.0
 1.0  0.0  0.0  5.0
 0.0  0.0  1.0  4.0

Cross-validation

Now that we can combine features from a DataFrame, we may want to use cross-validation to see whether our model works.

@sk_import linear_model: LinearRegression

pipe = Pipelines.Pipeline([
     (:featurize, mapper),
     (:lm, LinearRegression())])
round(CrossValidation.cross_val_score(pipe, data, data[:salary]), 2)

3-element Array{Float64,1}:
  -1.09
  -5.3 
 -15.38

Iris Dataset

using RDatasets: dataset
using ScikitLearn

@sk_import linear_model: LogisticRegression

iris = dataset("datasets", "iris")

X = convert(Array, iris[[:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]])
y = convert(Array, iris[:Species])

model = fit!(LogisticRegression(), X, y)
accuracy = sum(predict(model, X) .== y) / length(y)
println("accuracy: $accuracy")  # accuracy on training set

WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /Users/kay/.julia/v0.6/IJulia/src/kernel.jl:31
WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /Users/kay/.julia/v0.6/IJulia/src/kernel.jl:31
WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /Users/kay/.julia/v0.6/IJulia/src/kernel.jl:31
WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /Users/kay/.julia/v0.6/IJulia/src/kernel.jl:31
WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /Users/kay/.julia/v0.6/IJulia/src/kernel.jl:31
WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /Users/kay/.julia/v0.6/IJulia/src/kernel.jl:31
WARNING: Compat.UTF8String is deprecated, use String instead.
  likely near /Users/kay/.julia/v0.6/IJulia/src/kernel.jl:31


accuracy: 0.96

1
2
3

using ScikitLearn.CrossValidation: cross_val_score

cross_val_score(LogisticRegression(), X, y, cv=5)  # 5-fold

5-element Array{Float64,1}:
 1.0     
 0.966667
 0.933333
 0.9     
 1.0

1	using ScikitLearn.GridSearch: GridSearchCV

1
2
3

gridsearch = GridSearchCV(LogisticRegression(), Dict(:C => 0.1:0.1:2.0))
fit!(gridsearch, X, y)
println("Best parameters: $(gridsearch.best_params_)")

Best parameters: Dict{Symbol,Any}(Pair{Symbol,Any}(:C, 1.1))

using PyPlot

plot([cv_res.parameters[:C] for cv_res in gridsearch.grid_scores_],
[mean(cv_res.cv_validation_scores) for cv_res in gridsearch.grid_scores_])

[1m[36mINFO: [39m[22m[36mRecompiling stale cache file /Users/kay/.julia/lib/v0.6/PyPlot.ji for module PyPlot.
[39m

png

1-element Array{PyCall.PyObject,1}:
 PyObject <matplotlib.lines.Line2D object at 0x1a3b5abc50>

1	?GridSearchCV

search: [1mG[22m[1mr[22m[1mi[22m[1md[22m[1mS[22m[1me[22m[1ma[22m[1mr[22m[1mc[22m[1mh[22m

Exhaustive search over specified parameter values for an estimator.

Important members are fit, predict.

GridSearchCV implements a “fit” method and a “predict” method like any classifier except that the parameters of the classifier used to predict is optimized by cross-validation.

Parameters

estimator : object type that implements the “fit” and “predict” methods A object of that type is instantiated for each grid point.

param_grid : dict or list of dictionaries Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.

scoring : string, callable or None, optional, default: None A string (see model evaluation documentation) or a scorer callable object / function with signature $scorer(estimator, X, y)$.

fit_params : dict, optional Parameters to pass to the fit method.

n_jobs : int, default 1 Number of jobs to run in parallel.

pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:

- None, in which case all the jobs are immediately
  created and spawned. Use this for lightweight and
  fast-running jobs, to avoid delays due to on-demand
  spawning of the jobs

- An int, giving the exact number of total jobs that are
  spawned

- A string, giving an expression as a function of n_jobs,
  as in '2*n_jobs'

iid : boolean, default=True If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds.

cv : integer or cross-validation generator, default=3 If an integer is passed, it is the number of folds. Specific cross-validation objects can be passed, see sklearn.cross_validation module for the list of possible objects

refit : boolean, default=True Refit the best estimator with the entire dataset. If “False”, it is impossible to make predictions using this GridSearchCV instance after fitting.

verbose : integer Controls the verbosity: the higher, the more messages.

error_score : ‘raise’ (default) or numeric Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.

Examples

from sklearn import svm, grid_search, datasets iris = datasets.load_iris() parameters = {‘kernel’:(‘linear’, ‘rbf’), ‘C’:[1, 10]} svr = svm.SVC() clf = grid_search.GridSearchCV(svr, parameters) clf.fit(iris.data, iris.target)

… # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS GridSearchCV(cv=None, error_score=…, estimator=SVC(C=1.0, cache_size=…, class_weight=…, coef0=…, degree=…, gamma=…, kernel=’rbf’, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=…, verbose=False), fit_params={}, iid=…, n_jobs=1, param_grid=…, pre_dispatch=…, refit=…, scoring=…, verbose=…)

Attributes

gridscores : list of named tuples Contains scores for all parameter combinations in param_grid. Each entry corresponds to one parameter setting. Each named tuple has the attributes:

* ``parameters``, a dict of parameter settings
* ``mean_validation_score``, the mean score over the
  cross-validation folds
* ``cv_validation_scores``, the list of scores for each fold

bestestimator : estimator Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False.

bestscore : float Score of best_estimator on the left out data.

bestparams : dict Parameter setting that gave the best results on the hold out data.

scorer_ : function Scorer function used on the held out data to choose the best parameters for the model.

Notes

The parameters selected are those that maximize the score of the left out data, unless an explicit score is passed in which case it is used instead.

If n_jobs was set to a value higher than one, the data is copied for each point in the grid (and not n_jobs times). This is done for efficiency reasons if individual jobs take very little time, but may raise errors if the dataset is large and not enough memory is available. A workaround in this case is to set pre_dispatch. Then, the memory is copied only pre_dispatch many times. A reasonable value for pre_dispatch is 2 * n_jobs.

50-element Array{Float64,1}:
  0.162175 
  0.722845 
  0.362414 
  0.549103 
  0.763084 
  0.270936 
  0.485301 
  0.41325  
  0.393773 
  0.663121 
  0.595053 
  0.498473 
  0.42885  
  ⋮        
  0.639132 
  0.702459 
  0.60041  
  0.229709 
  0.296202 
  0.34932  
 -0.0609968
  0.680865 
  0.0816659
  0.433398 
  0.463019 
  0.542642

1 2	m = decomp.PCA()[:fit](randn(20,10)) m[:transform](randn(30, 10))

30×10 Array{Float64,2}:
  0.0492857  -0.740589     0.0282217  …   0.607899    0.582307    1.04095  
 -0.766687   -0.00563516  -0.972977      -0.111523   -1.61246     1.21119  
 -0.304345   -0.195561     0.523944       1.17019    -0.159978   -2.34993  
 -1.5165      0.725348    -0.589536      -0.077829    0.384159   -2.53808  
  0.10421    -0.592984    -0.579348       0.581532    0.0428926  -0.231679 
 -1.92898     1.20708     -1.7819     …  -1.17005    -0.994895   -1.03124  
 -0.436318    0.954802    -0.300581      -0.849914    2.33628     0.810238 
 -1.10481     0.913266     0.872128      -0.0772121  -0.562754   -0.689544 
 -0.456214    0.16222      0.0980988     -0.773041   -0.764416    0.683189 
 -0.273398    1.21917      0.627378      -1.00523     0.514595   -2.58339  
  0.570575    0.247493    -1.07453    …  -0.366462    0.36321    -0.639821 
 -0.233897   -1.67526     -1.66674        1.33278     0.122427    1.04181  
 -0.624457    0.741128    -1.67955       -0.311017    1.28079    -0.685025 
  ⋮                                   ⋱                                    
 -0.388898   -2.29054      0.498946      -2.33793     1.6837     -0.488352 
  0.403224    0.317897    -0.290501      -1.13959     0.55129     1.11115  
  0.857659   -0.918163    -0.0997422  …  -1.3764      0.581696   -0.0762093
  0.304933    0.549334     1.4302         0.335545   -0.934283    0.110987 
 -0.255592   -1.78458     -0.831169       0.572008    1.22434     0.335873 
  0.31237    -1.03264     -1.22471       -0.258204   -0.299442   -0.995692 
  0.405265   -0.508662     0.657679      -0.404234    1.092      -0.439828 
  0.390309   -1.77777      0.132518   …   2.10857    -1.26122     0.442358 
 -0.687856    1.27136      1.76452       -1.73486    -0.400697    0.084941 
 -0.963835   -0.826341    -1.03699       -0.11351     2.36443    -1.27699  
 -1.94291    -1.02019      0.786294      -2.08115    -0.954      -1.44567  
 -0.926151    1.12885     -0.56298       -1.75913     0.183775   -0.987595

1	model[:predict](X)

PyObject array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'virginica', 'versicolor', 'versicolor', 'versicolor',
       'virginica', 'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica',
       'virginica', 'virginica', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'versicolor', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica'],
      dtype='|S10')