Driver Telematics Analysis is a Kaggle challenge. For more details, see the challenge page. Besides solving a machine learning problem, we want to learn how to use git and scikit-learn.
Submissions can be generated by running scripts from scripts
directory, using root as working directory. Features implement a common interface and are stored inside features
package. Utilities like plotting, i/o are part of utils
package. Working notes are stored as IPython notebooks in notebooks
directory.
The repository is now closed, the project has been finished.
My participation is now over. Together with scigor we achieved place 613/1528
which brings us right to the lower end of top 40%. As reported by other participants, 77% accuracy is pretty much all you could achieve without doing trip matching and sophisticated model ensembling.
Not the best achievement ever, but the competition has taught us a lot of things. We learnt ipython notebooks, mastered git with branching and many troublesome merge conflicts, developed an object-oriented framework for evaluating different models, acquainted ourselves with scikit-learn, matplotlib, employed parallelization, numpy persistence, zipping and csv I/O - all thanks to one challenge.
#Features In the end, we based our classification model on the following features:
- AccelerationFeature(10, 31, True, np.median),
- AccelerationFeature(30, 51, True, np.median),
- AccelerationFeature(50, 71, True, np.median),
- AccelerationFeature(5, 130, True, np.median),
- AccelerationFeature(10, 31, True, np.mean),
- AccelerationFeature(30, 51, True, np.mean),
- AccelerationFeature(50, 71, True, np.mean),
- AccelerationFeature(5, 130, True, np.mean),
- AccelerationFeature(10, 31, False, np.median),
- AccelerationFeature(30, 51, False, np.median),
- AccelerationFeature(50, 71, False, np.median),
- AccelerationFeature(5, 130, False, np.median),
- AccelerationFeature(10, 31, False, np.mean),
- AccelerationFeature(30, 51, False, np.mean),
- AccelerationFeature(50, 71, False, np.mean),
- AccelerationFeature(5, 130, False, np.mean),
- AngleFeature(0, np.mean),
- AngleFeature(1, np.mean),
- SpeedPercentileFeature(5),
- SpeedPercentileFeature(95),
- AccelerationPercentileFeature(5),
- AccelerationPercentileFeature(95),
- TripLengthFeature(),
- AccelerationFeature(10, 31, True, np.mean, False),
- AccelerationFeature(30, 51, True, np.mean, False),
- AccelerationFeature(50, 71, True, np.mean, False),
- AccelerationPercentileFeature(1)
- AccelerationPercentileFeature(10)
- AccelerationPercentileFeature(25)
- AccelerationPercentileFeature(50)
- AccelerationPercentileFeature(75)
- AccelerationPercentileFeature(90)
- AccelerationPercentileFeature(99)
- AnglePercentileFeature(1)
- AnglePercentileFeature(5)
- AnglePercentileFeature(10)
- AnglePercentileFeature(25)
- AnglePercentileFeature(50)
- AnglePercentileFeature(75)
- AnglePercentileFeature(90)
- AnglePercentileFeature(95)
- AnglePercentileFeature(99)
- SpeedPercentileFeature(1)
- SpeedPercentileFeature(10)
- SpeedPercentileFeature(25)
- SpeedPercentileFeature(50)
- SpeedPercentileFeature(75)
- SpeedPercentileFeature(90)
- SpeedPercentileFeature(99)
For feature code, see features module. The compiled features are not available in the git repository, but can easily be compiled locally using this script.
We have evaluated a number of different approaches to classification and ended up with a GradientBoosting algorithm used in a cross-validation setting. For our models, see scripts.
- compute best RDP epsilon value -> dismissed, RDP is far too expensive
- create script that reduces trips using RDP and stores them as *.npy -> completed
- analyze article by Olariu -> completed
- use sklearn's cross-correlation -> completed
- understand how to measure score offline (maybe use cross-correlation's built-in score) -> completed
- compute more features: -> completed
- more percentiles
- angle features
- use speed w/o interpolation