Credits: this code and documentation was adapted from Paul Butler’s sklearn-pandas
DataFrames
It is possible to use a dataframe as a training set, but it needs to be converted to an array first. DataFrameMapper
is used to specify how this conversion proceeds. For example, PCA might be applied to some numerical dataframe columns, and one-hot-encoding to a categorical column.
Transformation Mapping
Consider this dataset:
1 | using ScikitLearn |
pet | children | salary | |
---|---|---|---|
1 | cat | 4.0 | 90 |
2 | dog | 6.0 | 24 |
3 | dog | 3.0 | 44 |
4 | fish | 3.0 | 27 |
5 | cat | 2.0 | 32 |
6 | dog | 3.0 | 59 |
7 | cat | 5.0 | 36 |
8 | fish | 4.0 | 27 |
Map the Columns to Transformations
The mapper takes a list of pairs. The first is a column name from the DataFrame, or a list containing one or multiple columns (we will see an example with multiple columns later). The second is an object which will perform the transformation which will be applied to that column:
Note: ScikitLearn.DataFrameMapper
won’t be available until DataFrames
is imported
1 | mapper = DataFrameMapper([(:pet, LabelBinarizer()), |
The difference between specifying the column selector as :column (as a single symbol) and [:column] (as a list with one element) is the shape of the array that is passed to the transformer. In the first case, a one dimensional array with be passed, while in the second case it will be a 2-dimensional array with one column, i.e. a column vector.
Test the Transformation
We can use the fit_transform!
shortcut to both fit the model and see what transformed data looks like. In this and the other examples, output is rounded to two digits with round
to account for rounding errors on different hardware:
1 | round(fit_transform!(mapper, copy(data)), 2) |
8x4 Array{Float64,2}:
1.0 0.0 0.0 0.21
0.0 1.0 0.0 1.88
0.0 1.0 0.0 -0.63
0.0 0.0 1.0 -0.63
1.0 0.0 0.0 -1.46
0.0 1.0 0.0 -0.63
1.0 0.0 0.0 1.04
0.0 0.0 1.0 0.21
Note that the first three columns are the output of the LabelBinarizer (corresponding to cat
, dog
, and fish
respectively) and the fourth column is the standardized value for the number of children. In general, the columns are ordered according to the order given when the DataFrameMapper is constructed.
Now that the transformation is trained, we confirm that it works on new data:
1 | sample = DataFrame(pet= ["cat"], children= [5.]) |
1x4 Array{Float64,2}:
1.0 0.0 0.0 1.04
Transform Multiple Columns
Transformations may require multiple input columns. In these cases, the column names can be specified in a list:
1 | decomposition: PCA |
Now running fit_transform!
will run PCA on the children
and salary
columns and return the first principal component:
1 | round(fit_transform!(mapper2, copy(data)), 1) |
8x1 Array{Float64,2}:
47.6
-18.4
1.6
-15.4
-10.4
16.6
-6.4
-15.4
Multiple transformers for the same column
Multiple transformers can be applied to the same column specifying them in a list:
1 | preprocessing: Imputer |
3x1 Array{Float64,2}:
1.0
2.0
3.0
Columns that don’t need any transformation
Only columns that are listed in the DataFrameMapper
are kept. To keep a column but don’t apply any transformation to it, use nothing
as transformer:
1 | mapper3 = DataFrameMapper([ |
8x4 Array{Float64,2}:
1.0 0.0 0.0 4.0
0.0 1.0 0.0 6.0
0.0 1.0 0.0 3.0
0.0 0.0 1.0 3.0
1.0 0.0 0.0 2.0
0.0 1.0 0.0 3.0
1.0 0.0 0.0 5.0
0.0 0.0 1.0 4.0
Cross-validation
Now that we can combine features from a DataFrame, we may want to use cross-validation to see whether our model works.
1 | linear_model: LinearRegression |
3-element Array{Float64,1}:
-1.09
-5.3
-15.38
Iris Dataset
1 | using RDatasets: dataset |
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /Users/kay/.julia/v0.6/IJulia/src/kernel.jl:31
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /Users/kay/.julia/v0.6/IJulia/src/kernel.jl:31
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /Users/kay/.julia/v0.6/IJulia/src/kernel.jl:31
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /Users/kay/.julia/v0.6/IJulia/src/kernel.jl:31
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /Users/kay/.julia/v0.6/IJulia/src/kernel.jl:31
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /Users/kay/.julia/v0.6/IJulia/src/kernel.jl:31
WARNING: Compat.UTF8String is deprecated, use String instead.
likely near /Users/kay/.julia/v0.6/IJulia/src/kernel.jl:31
accuracy: 0.96
1 | using ScikitLearn.CrossValidation: cross_val_score |
5-element Array{Float64,1}:
1.0
0.966667
0.933333
0.9
1.0
1 | using ScikitLearn.GridSearch: GridSearchCV |
1 | gridsearch = GridSearchCV(LogisticRegression(), Dict(:C => 0.1:0.1:2.0)) |
Best parameters: Dict{Symbol,Any}(Pair{Symbol,Any}(:C, 1.1))
1 | using PyPlot |
[1m[36mINFO: [39m[22m[36mRecompiling stale cache file /Users/kay/.julia/lib/v0.6/PyPlot.ji for module PyPlot.
[39m
1-element Array{PyCall.PyObject,1}:
PyObject <matplotlib.lines.Line2D object at 0x1a3b5abc50>
1 | ?GridSearchCV |
search: [1mG[22m[1mr[22m[1mi[22m[1md[22m[1mS[22m[1me[22m[1ma[22m[1mr[22m[1mc[22m[1mh[22m
Exhaustive search over specified parameter values for an estimator.
Important members are fit, predict.
GridSearchCV implements a “fit” method and a “predict” method like any classifier except that the parameters of the classifier used to predict is optimized by cross-validation.
Parameters
estimator : object type that implements the “fit” and “predict” methods A object of that type is instantiated for each grid point.
param_grid : dict or list of dictionaries Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
scoring : string, callable or None, optional, default: None A string (see model evaluation documentation) or a scorer callable object / function with signature $scorer(estimator, X, y)$.
fit_params : dict, optional Parameters to pass to the fit method.
n_jobs : int, default 1 Number of jobs to run in parallel.
pre_dispatch : int, or string, optional Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be:
1 | - None, in which case all the jobs are immediately |
iid : boolean, default=True If True, the data is assumed to be identically distributed across the folds, and the loss minimized is the total loss per sample, and not the mean loss across the folds.
cv : integer or cross-validation generator, default=3 If an integer is passed, it is the number of folds. Specific cross-validation objects can be passed, see sklearn.cross_validation module for the list of possible objects
refit : boolean, default=True Refit the best estimator with the entire dataset. If “False”, it is impossible to make predictions using this GridSearchCV instance after fitting.
verbose : integer Controls the verbosity: the higher, the more messages.
error_score : ‘raise’ (default) or numeric Value to assign to the score if an error occurs in estimator fitting. If set to ‘raise’, the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.
Examples
from sklearn import svm, grid_search, datasets iris = datasets.load_iris() parameters = {‘kernel’:(‘linear’, ‘rbf’), ‘C’:[1, 10]} svr = svm.SVC() clf = grid_search.GridSearchCV(svr, parameters) clf.fit(iris.data, iris.target)
… # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS GridSearchCV(cv=None, error_score=…, estimator=SVC(C=1.0, cache_size=…, class_weight=…, coef0=…, degree=…, gamma=…, kernel=’rbf’, max_iter=-1, probability=False, random_state=None, shrinking=True, tol=…, verbose=False), fit_params={}, iid=…, n_jobs=1, param_grid=…, pre_dispatch=…, refit=…, scoring=…, verbose=…)
Attributes
gridscores : list of named tuples Contains scores for all parameter combinations in param_grid. Each entry corresponds to one parameter setting. Each named tuple has the attributes:
1 | * ``parameters``, a dict of parameter settings |
bestestimator : estimator Estimator that was chosen by the search, i.e. estimator which gave highest score (or smallest loss if specified) on the left out data. Not available if refit=False.
bestscore : float Score of best_estimator on the left out data.
bestparams : dict Parameter setting that gave the best results on the hold out data.
scorer_ : function Scorer function used on the held out data to choose the best parameters for the model.
Notes
The parameters selected are those that maximize the score of the left out data, unless an explicit score is passed in which case it is used instead.
If n_jobs
was set to a value higher than one, the data is copied for each point in the grid (and not n_jobs
times). This is done for efficiency reasons if individual jobs take very little time, but may raise errors if the dataset is large and not enough memory is available. A workaround in this case is to set pre_dispatch
. Then, the memory is copied only pre_dispatch
many times. A reasonable value for pre_dispatch
is 2 * n_jobs
.
See Also
:class:ParameterGrid
: generates all the combinations of a an hyperparameter grid.
:func:sklearn.cross_validation.train_test_split
: utility function to split the data into a development set usable for fitting a GridSearchCV instance and an evaluation set for its final evaluation.
:func:sklearn.metrics.make_scorer
: Make a scorer from a performance metric or loss function.
1 | Grid |
UndefVarError: Grid not defined
Stacktrace:
[1] include_string(::String, ::String) at ./loading.jl:515
1 | using PyCall |
1 | PyCall.np |
UndefVarError: np not defined
Stacktrace:
[1] include_string(::String, ::String) at ./loading.jl:515
1 | numpy |
1 | pytypeof(model[:predict](X)) |
PyObject <type 'numpy.ndarray'>
1 | pyisinstance(model[:predict](X), numpy.ndarray) |
true
1 | sklearn.decomposition as decomp |
1 | log_mod = lin.LinearRegression()[:fit](randn(50, 3), rand(0:1, 50)) |
50-element Array{Float64,1}:
0.162175
0.722845
0.362414
0.549103
0.763084
0.270936
0.485301
0.41325
0.393773
0.663121
0.595053
0.498473
0.42885
⋮
0.639132
0.702459
0.60041
0.229709
0.296202
0.34932
-0.0609968
0.680865
0.0816659
0.433398
0.463019
0.542642
1 | m = decomp.PCA()[:fit](randn(20,10)) |
30×10 Array{Float64,2}:
0.0492857 -0.740589 0.0282217 … 0.607899 0.582307 1.04095
-0.766687 -0.00563516 -0.972977 -0.111523 -1.61246 1.21119
-0.304345 -0.195561 0.523944 1.17019 -0.159978 -2.34993
-1.5165 0.725348 -0.589536 -0.077829 0.384159 -2.53808
0.10421 -0.592984 -0.579348 0.581532 0.0428926 -0.231679
-1.92898 1.20708 -1.7819 … -1.17005 -0.994895 -1.03124
-0.436318 0.954802 -0.300581 -0.849914 2.33628 0.810238
-1.10481 0.913266 0.872128 -0.0772121 -0.562754 -0.689544
-0.456214 0.16222 0.0980988 -0.773041 -0.764416 0.683189
-0.273398 1.21917 0.627378 -1.00523 0.514595 -2.58339
0.570575 0.247493 -1.07453 … -0.366462 0.36321 -0.639821
-0.233897 -1.67526 -1.66674 1.33278 0.122427 1.04181
-0.624457 0.741128 -1.67955 -0.311017 1.28079 -0.685025
⋮ ⋱
-0.388898 -2.29054 0.498946 -2.33793 1.6837 -0.488352
0.403224 0.317897 -0.290501 -1.13959 0.55129 1.11115
0.857659 -0.918163 -0.0997422 … -1.3764 0.581696 -0.0762093
0.304933 0.549334 1.4302 0.335545 -0.934283 0.110987
-0.255592 -1.78458 -0.831169 0.572008 1.22434 0.335873
0.31237 -1.03264 -1.22471 -0.258204 -0.299442 -0.995692
0.405265 -0.508662 0.657679 -0.404234 1.092 -0.439828
0.390309 -1.77777 0.132518 … 2.10857 -1.26122 0.442358
-0.687856 1.27136 1.76452 -1.73486 -0.400697 0.084941
-0.963835 -0.826341 -1.03699 -0.11351 2.36443 -1.27699
-1.94291 -1.02019 0.786294 -2.08115 -0.954 -1.44567
-0.926151 1.12885 -0.56298 -1.75913 0.183775 -0.987595
1 | model[:predict](X) |
PyObject array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor',
'versicolor', 'versicolor', 'versicolor', 'versicolor',
'versicolor', 'versicolor', 'versicolor', 'versicolor',
'versicolor', 'versicolor', 'versicolor', 'versicolor',
'versicolor', 'virginica', 'versicolor', 'versicolor', 'versicolor',
'virginica', 'versicolor', 'versicolor', 'versicolor', 'versicolor',
'versicolor', 'versicolor', 'versicolor', 'versicolor',
'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica',
'virginica', 'virginica', 'versicolor', 'versicolor', 'versicolor',
'versicolor', 'versicolor', 'versicolor', 'versicolor',
'versicolor', 'versicolor', 'versicolor', 'versicolor',
'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica',
'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
'virginica', 'virginica', 'versicolor', 'virginica', 'virginica',
'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
'virginica', 'virginica', 'virginica'],
dtype='|S10')