Dublin R El Dorado Competition

Dublin R El Dorado Competition

Posted on Posted in Business, Data Analysis Resources

Quick Summary

5,000 data points with pseudo-geological information (including proven gold reserves).

50 (or more) new sites up for auction with limited data.

Each team starts with $50,000,000.00 budget to bid with.

Blind, sealed-bid auctions for rights to mine the parcel of land (min bid $100,000.00)

Auctions happen in order by parcel_id

Extraction costs are non-trivial.

Winning team has most cash at the end.

The link for the competition

In [1]:
import numpy as np
import pandas as pd
In [2]:
#Load data
costs_data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\costs_data.csv")
elevation_data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\elevation_data.csv")
sample_data = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\sample_data.csv")
In [3]:
#merge sample_data and elevation datasets on parcel_id
train = pd.merge(sample_data,elevation_data, on='parcel_id', how = 'inner',suffixes=('_left', '_right'))
#drop duplicate
train.rename(columns={'Easting_left': 'Easting','Northing_left': 'Northing'}, inplace=True)
train = train.drop(['Easting_right','Northing_right'],axis=1)
In [4]:
#Join the the test and train dataframes
def get_combined_data():
    # reading train data
    train = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\train_data.csv")

    
    # reading test data
    test = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\auction_parcels.csv")

    # extracting and then removing the targets from the training data 
    targets = train.gold_available
    train.drop('gold_available',1,inplace=True)
    

    # merging train data and test data for future feature engineering
    combined = train.append(test)
    combined.reset_index(inplace=True)
    combined.drop('index',inplace=True,axis=1)
   
    return combined
combined = get_combined_data()

Same as total in sample_data and auctin_parcels

In [7]:
#drop unnamed column
combined.drop('Unnamed: 0',inplace=True,axis=1)
In [10]:
combined.tail(2)
Out[10]:
Arkonor Bistot Crokite Easting Gneiss Hedbergite Isogen Megacyte Mercoxit Mexallon Nocxium Northing Plagioclase Pyerite Spudumain Tritanium Veldspar elevation parcel_id
5048 103.044603 105.965291 104.451764 106.5 NaN NaN NaN NaN 108.523457 NaN NaN 146.5 NaN NaN NaN NaN NaN 185.783466 22007
5049 100.591409 119.167135 114.396146 63.5 NaN NaN NaN NaN 117.289582 NaN NaN 148.5 NaN NaN NaN NaN NaN -128.664883 22264
In [9]:
combined.shape
Out[9]:
(5050, 19)

Fill in the NaN values with mean. (mean and median is the same)

In [11]:
combined["Gneiss"].fillna(combined["Gneiss"].mean(), inplace=True)
combined["Hedbergite"].fillna(combined["Hedbergite"].mean(), inplace=True)
combined["Isogen"].fillna(combined["Isogen"].mean(), inplace=True)
combined["Mexallon"].fillna(combined["Mexallon"].mean(), inplace=True)
combined["Nocxium"].fillna(combined["Nocxium"].mean(), inplace=True)
combined["Plagioclase"].fillna(combined["Plagioclase"].mean(), inplace=True)
combined["Pyerite"].fillna(combined["Pyerite"].mean(), inplace=True)
combined["Spudumain"].fillna(combined["Spudumain"].mean(), inplace=True)
combined["Tritanium"].fillna(combined["Tritanium"].mean(), inplace=True)
combined["Veldspar"].fillna(combined["Veldspar"].mean(), inplace=True)
combined["Megacyte"].fillna(combined["Megacyte"].mean(), inplace=True)
In [12]:
print('Reading Training data')
print('\nSize of Training  data: ' + str(combined.shape))
print('Columns:' + str(combined.columns.values))

print('dtypes')
print('\n')
print(combined.dtypes)
print('\n')
print('Info: ')
print('\n')
print(combined.info)
print('Shape: ')
print('\n')
print(combined.shape)
print('\n')
print('numerical columns statistcs')
print('\n')
print(combined.describe())
import re
# Review input features (train set) - Part 2A
missing_values = []
nonumeric_values = []

print ("========================\n")

for column in combined:
    # Find all the unique feature values
    uniq = combined[column].unique()
    print ("'{}' has {} unique values" .format(column,uniq.size))
    
    
    # Find features with missing values
    if (True in pd.isnull(uniq)):
        s = "{} has {} missing" .format(column, pd.isnull(combined[column]).sum())
        missing_values.append(s)
    
    # Find features with non-numeric values
    for i in range (1, np.prod(uniq.shape)):
        if (re.match('nan', str(uniq[i]))):
            break
        if not (re.search('(^\d+\.?\d*$)|(^\d*\.?\d+$)', str(uniq[i]))):
            nonumeric_values.append(column)
            break
  
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
print ("Features with missing values:\n{}\n\n" .format(missing_values))
print ("Features with non-numeric values:\n{}" .format(nonumeric_values))
print ("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n")
Reading Training data

Size of Training  data: (5050, 19)
Columns:['Arkonor' 'Bistot' 'Crokite' 'Easting' 'Gneiss' 'Hedbergite' 'Isogen'
 'Megacyte' 'Mercoxit' 'Mexallon' 'Nocxium' 'Northing' 'Plagioclase'
 'Pyerite' 'Spudumain' 'Tritanium' 'Veldspar' 'elevation' 'parcel_id']
dtypes


Arkonor        float64
Bistot         float64
Crokite        float64
Easting        float64
Gneiss         float64
Hedbergite     float64
Isogen         float64
Megacyte       float64
Mercoxit       float64
Mexallon       float64
Nocxium        float64
Northing       float64
Plagioclase    float64
Pyerite        float64
Spudumain      float64
Tritanium      float64
Veldspar       float64
elevation      float64
parcel_id        int64
dtype: object


Info: 


<bound method DataFrame.info of          Arkonor      Bistot     Crokite  Easting      Gneiss  Hedbergite  \
0     105.515728   94.116871  108.024488      3.5  114.181143  130.352913   
1     116.793372  130.856900  113.132975      8.5  119.666173  128.958155   
...................
...................
...................
5048   185.783466      22007  
5049  -128.664883      22264  

[5050 rows x 19 columns]>
Shape: 


(5050, 19)


numerical columns statistcs


           Arkonor       Bistot      Crokite      Easting       Gneiss  \
count  5050.000000  5050.000000  5050.000000  5050.000000  5050.000000   
mean    106.709128   112.246741   108.443258    74.849901   113.603848   
std       7.392234    11.814510     5.667297    43.485051    11.728807   
min      76.066022    62.569692    87.749107     0.500000    73.248565   
25%     101.730300   104.244920   104.365096    37.500000   105.437298   
50%     105.824887   110.654264   107.636471    74.500000   111.882421   
75%     110.826709   118.851113   111.608339   112.500000   120.233122   
max     146.902636   178.824065   136.771507   149.500000   172.571132   

        Hedbergite       Isogen     Megacyte     Mercoxit     Mexallon  \
count  5050.000000  5050.000000  5050.000000  5050.000000  5050.000000   
mean    139.148678   114.327617   111.709613   111.324899   108.480667   
std      18.674376    17.131647    18.261035     7.664135     6.898296   
min      66.462890    22.157819    29.903756    86.084987    84.861319   
25%     126.282651   103.298648   100.280583   105.724847   103.597919   
50%     138.011340   112.623030   109.973120   110.406653   107.598094   
75%     150.591629   123.366388   121.685964   115.890090   112.391147   
max     265.540884   201.763854   199.787150   149.420274   140.378133   

           Nocxium     Northing  Plagioclase      Pyerite    Spudumain  \
count  5050.000000  5050.000000  5050.000000  5050.000000  5050.000000   
mean    112.205520    75.130495   140.363009   114.690044   146.831750   
std       9.716745    42.836894    26.335917    22.276036    21.565886   
min      75.707857     0.500000    51.715582    19.856209    51.141385   
25%     105.354748    38.500000   121.565490   100.826389   132.081777   
50%     111.018515    74.500000   137.285460   112.509560   145.839033   
75%     117.861481   111.500000   155.777135   127.173027   160.392205   
max     167.262641   149.500000   295.440483   233.739627   234.105254   

         Tritanium     Veldspar    elevation     parcel_id  
count  5050.000000  5050.000000  5050.000000   5050.000000  
mean    107.407900   105.865743   409.016291  11269.924158  
std       9.626466     5.232573   504.030248   6425.763158  
min      66.911702    91.864452 -1359.020531      4.000000  
25%     101.303085   102.227744    62.020927   5754.000000  
50%     106.238235   105.095715   411.312916  11215.000000  
75%     112.514500   108.842587   760.448410  16759.250000  
max     156.583507   137.119403  2078.685616  22498.000000  
========================

'Arkonor' has 5050 unique values
'Bistot' has 5050 unique values
'Crokite' has 5050 unique values
'Easting' has 150 unique values
'Gneiss' has 5001 unique values
'Hedbergite' has 5001 unique values
'Isogen' has 5001 unique values
'Megacyte' has 5001 unique values
'Mercoxit' has 5050 unique values
'Mexallon' has 5001 unique values
'Nocxium' has 5001 unique values
'Northing' has 150 unique values
'Plagioclase' has 5001 unique values
'Pyerite' has 5001 unique values
'Spudumain' has 5001 unique values
'Tritanium' has 5001 unique values
'Veldspar' has 5001 unique values
'elevation' has 5050 unique values
'parcel_id' has 5050 unique values

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Features with missing values:
[]


Features with non-numeric values:
['elevation']

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Feature Extraction

How to distinguish between classification and regression?

  • if you have y as floats, but only a finite number of different values can be obtained, and all of them are obtained in training set, then this is classification – just convert your values to strings or integers and you are good to go. Or if the values in target column are discrete.
  • if you have y as floats, and this are actuall real values, and you can have plenty of values, even not seen in the training set and you expect your model to somehow “interpolate” this is regression and you are supposed to use DecisionTreeRegressor instead.Or if the values in target column are continuous.
In [75]:
#Recover the train set and the test set from the combined dataset
def recover_train_test_target():
    global combined
    
    train0 = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\sample_data.csv")
    #classifier can only have integer but no floats..
    targets = train0.gold_available
    train = combined.ix[0:4999]
    test = combined.ix[5000:]
    
    return train,test,targets
In [76]:
train,test,targets = recover_train_test_target()
In [93]:
#Tree-based estimators can be used to compute feature importances, which in turn can be used to discard irrelevant features.
from sklearn.tree import DecisionTreeRegressor
from sklearn.feature_selection import SelectFromModel
clf = DecisionTreeRegressor(max_depth=200,max_features='auto',random_state=2)
clf = clf.fit(train, targets)
In [94]:
#Let's have a look at the importance of each feature.
features = pd.DataFrame()
features['feature'] = train.columns
features['importance'] = clf.feature_importances_
In [98]:
features.sort_values (by = 'importance',ascending=False)
Out[98]:
feature importance
7 Megacyte 0.320334
12 Plagioclase 0.242324
15 Tritanium 0.094173
1 Bistot 0.073765
10 Nocxium 0.073336
5 Hedbergite 0.031163
4 Gneiss 0.025366
0 Arkonor 0.019466
2 Crokite 0.018449
13 Pyerite 0.016798
6 Isogen 0.014705
8 Mercoxit 0.013774
16 Veldspar 0.012940
14 Spudumain 0.010196
17 elevation 0.009505
9 Mexallon 0.008693
3 Easting 0.007104
11 Northing 0.004448
18 parcel_id 0.003461
In [96]:
#Let's now transform our train set and test set in a more compact datasets.
model = SelectFromModel(clf, prefit=True)
train_new = model.transform(train)
train_new.shape
Out[96]:
(5000, 5)
In [97]:
test_new = model.transform(test)
test_new.shape
Out[97]:
(50, 5)
In [82]:
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(n_estimators = 200,)
forest.fit(train_new, targets)
Out[82]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=200, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)
In [88]:
submission_file = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\elDorado\output.csv")

The selling price for gold is 1500 per ounce as per the competition, Variable cost is calculated with (Elevation_Level / Threshhold) gold_available. Profit is Total Selling price – Total Cost. I decided to calculate the ratio for profit from total profit and for bidding amount I multiplied total amount (50000000) profit ratio. So the business can maximize ROI for their money accordingly.

In [89]:
submission_file
Out[89]:
parcel_id gold_available Total Selling Price elevation Fixed Cost Variable Cost Total Cost Profit Profit Ratio bid_amount
0 515 11282.37 16923556.29 23.19 3000000 4720543.61 7720543.61 9203012.68 0.024964 1248208.72
1 914 11481.55 17222318.01 784.99 8000000 3181537.51 11181537.51 6040780.51 0.016386 819313.76
2 1538 11354.35 17031522.28 1258.73 8000000 3146290.39 11146290.39 5885231.89 0.015964 798216.63
3 1790 11349.07 17023599.53 474.18 3000000 4748450.89 7748450.89 9275148.64 0.025160 1257992.55
4 2416 11725.36 17588046.58 1584.86 8000000 3249097.26 11249097.26 6338949.33 0.017195 859754.53
5 3332 7495.10 11242648.24 209.17 3000000 3204155.25 6204155.25 5038492.99 0.013667 683373.05
6 3697 10478.70 15718050.71 152.18 3000000 4384288.08 7384288.08 8333762.63 0.022606 1130311.94
7 4006 15647.48 23471223.65 1128.92 8000000 4101875.11 12101875.11 11369348.54 0.030841 1542029.82
8 4125 10392.41 15588607.82 -770.37 4000000 4539404.69 8539404.69 7049203.13 0.019122 956086.57
9 4577 10414.18 15621273.62 -198.78 4000000 4548913.82 8548913.82 7072359.80 0.019185 959227.32
10 5712 7796.76 11695138.64 169.53 3000000 3333114.90 6333114.90 5362023.74 0.014545 727253.67
11 6558 16057.82 24086728.75 885.70 8000000 41451255.10 49451255.10 0.00 0.000000 0.00
12 6875 13018.41 19527611.91 798.47 8000000 3491103.62 11491103.62 8036508.29 0.021800 1089995.21
13 7146 10875.55 16313323.47 -157.97 4000000 4750440.24 8750440.24 7562883.23 0.020515 1025757.23
14 7837 13152.77 19729155.97 684.70 8000000 6327578.43 14327578.43 5401577.54 0.014652 732618.37
15 8081 10413.35 15620018.11 -501.93 4000000 4548551.28 8548551.28 7071466.83 0.019182 959106.20
16 8712 11252.48 16878724.89 1007.17 8000000 3118062.21 11118062.21 5760662.68 0.015626 781321.25
17 9543 11372.08 17058113.56 41.61 3000000 4758078.27 7758078.27 9300035.29 0.025227 1261367.94
18 10226 12476.05 18714070.90 1092.87 8000000 3345660.74 11345660.74 7368410.16 0.019988 999380.76
19 10495 11637.79 17456686.28 432.31 3000000 4869251.34 7869251.34 9587434.95 0.026007 1300348.08
20 10905 12033.79 18050682.70 583.50 8000000 4995025.67 12995025.67 5055657.03 0.013714 685701.02
21 11489 11593.39 17390091.18 -390.65 4000000 5063992.75 9063992.75 8326098.43 0.022585 1129272.45
22 11614 7658.14 11487202.98 266.58 3000000 3273854.85 6273854.85 5213348.13 0.014142 707088.73
23 12221 11520.68 17281026.94 634.89 8000000 4820252.51 12820252.51 4460774.43 0.012100 605016.83
24 12810 10410.31 15615459.76 -296.70 4000000 4547223.41 8547223.41 7068236.35 0.019173 958668.05
25 13071 11519.26 17278897.48 499.42 3000000 4819658.38 7819658.38 9459239.09 0.025659 1282960.82
26 13249 11193.29 16789932.92 -492.42 4000000 4889499.89 8889499.89 7900433.03 0.021431 1071539.26
27 13683 11438.85 17158277.89 195.77 3000000 4786014.84 7786014.84 9372263.05 0.025423 1271164.22
28 13905 8501.99 12752992.12 -334.67 4000000 3782322.80 7782322.80 4970669.32 0.013483 674174.10
29 13917 10433.56 15650335.63 -181.76 4000000 4557379.01 8557379.01 7092956.62 0.019240 962020.87
30 14154 12173.61 18260414.44 933.84 8000000 3264556.42 11264556.42 6995858.03 0.018977 948851.35
31 14655 10643.02 15964525.89 407.39 3000000 4453039.57 7453039.57 8511486.32 0.023088 1154416.69
32 15178 11539.16 17308746.74 286.29 3000000 4827984.54 7827984.54 9480762.20 0.025718 1285880.01
33 15815 13321.23 19981843.97 1596.11 8000000 3572309.85 11572309.85 8409534.12 0.022812 1140588.87
34 16067 14903.51 22355271.52 1304.02 8000000 3906848.69 11906848.69 10448422.83 0.028342 1417124.25
35 16475 11239.91 16859861.93 460.39 3000000 4702778.34 7702778.34 9157083.58 0.024840 1241979.33
36 16838 11641.73 17462587.63 328.22 3000000 4870899.83 7870899.83 9591687.79 0.026018 1300924.90
37 16925 11694.14 17541208.10 1051.53 8000000 3240446.19 11240446.19 6300761.91 0.017092 854575.15
38 17186 7658.14 11487202.98 -178.90 4000000 3483177.34 7483177.34 4004025.64 0.010861 543067.78
39 18992 10415.55 15623322.67 709.52 8000000 2886148.91 10886148.91 4737173.76 0.012850 642504.99
40 19114 7400.12 11100184.77 496.51 3000000 3163551.30 6163551.30 4936633.47 0.013391 669557.80
41 19448 10858.16 16287235.09 314.58 3000000 4543054.14 7543054.14 8744180.94 0.023720 1185977.16
42 19984 12426.43 18639647.94 528.30 8000000 5158003.99 13158003.99 5481643.95 0.014870 743477.82
43 20194 11432.95 17149430.52 413.19 3000000 4783546.28 7783546.28 9365884.24 0.025406 1270299.06
44 20402 16276.82 24415234.47 248.42 3000000 6675530.80 9675530.80 14739703.67 0.039983 1999152.59
45 20433 12009.33 18013988.76 784.49 8000000 3220502.00 11220502.00 6793486.77 0.018428 921403.64
46 21042 11251.45 16877178.04 466.96 3000000 4707619.23 7707619.23 9169558.81 0.024873 1243671.35
47 21278 10461.92 15692885.04 656.77 8000000 4377267.33 12377267.33 3315617.71 0.008994 449698.71
48 22007 11272.51 16908772.02 185.78 3000000 4716418.18 7716418.18 9192353.83 0.024935 1246763.05
49 22264 12427.10 18640651.40 -128.66 4000000 5344688.59 9344688.59 9295962.80 0.025216 1260815.58
In [99]:
#select the columns 
submission_file = submission_file[['parcel_id','bid_amount']]
In [100]:
submission_file
Out[100]:
parcel_id bid_amount
0 515 1248208.72
1 914 819313.76
2 1538 798216.63
3 1790 1257992.55
4 2416 859754.53
5 3332 683373.05
6 3697 1130311.94
7 4006 1542029.82
8 4125 956086.57
9 4577 959227.32
10 5712 727253.67
11 6558 0.00
12 6875 1089995.21
13 7146 1025757.23
14 7837 732618.37
15 8081 959106.20
16 8712 781321.25
17 9543 1261367.94
18 10226 999380.76
19 10495 1300348.08
20 10905 685701.02
21 11489 1129272.45
22 11614 707088.73
23 12221 605016.83
24 12810 958668.05
25 13071 1282960.82
26 13249 1071539.26
27 13683 1271164.22
28 13905 674174.10
29 13917 962020.87
30 14154 948851.35
31 14655 1154416.69
32 15178 1285880.01
33 15815 1140588.87
34 16067 1417124.25
35 16475 1241979.33
36 16838 1300924.90
37 16925 854575.15
38 17186 543067.78
39 18992 642504.99
40 19114 669557.80
41 19448 1185977.16
42 19984 743477.82
43 20194 1270299.06
44 20402 1999152.59
45 20433 921403.64
46 21042 1243671.35
47 21278 449698.71
48 22007 1246763.05
49 22264 1260815.58
In [92]:
#export the file for submission
submission_file.to_csv('submission.csv')
For future, I need to know more about Vickrey Auction method.

Finally, any and all suggestions for improvements are more than welcome.

Leave a Reply

Your email address will not be published. Required fields are marked *