Last Updated on September 27, 2016
It is difficult to stay motivated when self-studying machine learning.
The standard test datasets can be quite obtuse and disconnected from you and from your everyday life. Boring even. A trick that you might like to use is to find and work on a dataset that matters to you.
In this post, we will look at some ideas for datasets that you could use to motivate and even accelerate your journey into applied machine learning.
What You Will Learn
Problems with Impact
We have looked before at the need to work on problems that have an impact. The problems that have the biggest impact are the problems in which the outcome affects you directly.
These may be problems related to your personal life, hobbies or even your work. They are problems that may or may not be addressed right now. The size and scope of the problem does not matter as long as you are invested in the outcome in some way. The results matter to you.
This is a powerful method for two reasons:
- It gives you permission to treat the problem objectively and apply your rational problem-solving skills to it which may result in some interesting results.
- Caring about the outcome is more likely to motivate you to learn new and different methods, to go deep into the definition of the problem and to write up your findings. Because you care about the outcome, you will treat the project more seriously.
You can’t pick any old problem. There are some additional considerations:
- Data: Machine learning algorithms model problems with data, and the quality of the modeling is typically proportional to the quality of the data. You need to be able to have access and be able to collect data for the problem.
- Public: Can the data and/or the results be made public? This may matter to you if you want to use the project as a part of your machine learning portfolio, which I strongly encourage you to do.
- Question: Start with a question to be sure that there is a problem to be solved. The question will clarify the data you need to collect and the impact the answer will have on you.
In the next sections, we will look at three areas of your life that you might find problems that you could investigate with machine learning.
Machine Learning at Home
Are there problems and sources of data in your personal life that you can model using machine learning methods?
Five examples that come to my mind are:
- Personal Finance: You can model some aspect of your personal finance. This could be something like weekly expenditure prediction or large purchase prediction. It could also be something related to your investment portfolio if that is your thing.
- Transport: You can model some aspect of your personal transport. This may be which train or bus you take on your commute on a given day, the commute time or some detail like work arrival time prediction or fuel consumption.
- Food: You can model something about the food that you consume. This could be the quantity, calories, snack prediction or a model of what you need to purchase in a given week.
- Media: You could model your media consumption, such as TV, movies, books, music or websites. An obvious approach would be to model it as a recommendation problem, but also consider models of consumption volume such as how much you consume when you consume it and other related patterns you could predict.
- Fitness: You could model some aspect of personal fitness. This could be weight, BMI, a body measurement, or an aspect of endurance like the number of sit-ups or time to complete your routine. How about modeling whether you will go to the gym or not on a given day (what would the inputs be?).
Remember, you have to have access to the data, which very likely means you have to spend some time measuring and collecting the data.
Machine Learning with a Hobby
Do you have a hobby other than machine learning? Consider what data you could collect model related to your hobby.
Five examples of hobbies you might have or want to model include:
- Sports: You can model the performance of a team or a league. You may be into fantasy sports teams and be interested in modeling the performance of individual players. There is also a gambling side to sports outcomes that might spark your interest (be careful). Maybe you have a child or family member that plays a sport on weeks that might provide a problem and source of data a little more connected to you.
- Games: You can model an aspect of game you play. This may be a boardgame, card game or computer game. You could model and predict win/loss outcomes, specific outcome scores or specific moves within the game.
- Arts/Crafts: Maybe you’re an amateur artist or crafts person and post your photos to a public social photo album of your creations. You could model and predict whether a given photo you post is liked or interesting to third parties (in the form of views or comments). A similar approach could be used in-person with control groups (family members?) and for various other art forms that may require a subjective assessment of interest or quality (painting, music, paper mache, etc).
- Language: You could model some aspect of a language you or a friend or family member is learning. If flash cards are being used, you could get into the interesting problem of modeling whether a given card’s contents will remembered. You could also model other aspects of language learning such as rate of new works acquired and frequency of errors. Collecting data may be an interesting challenge.
- Photography: Maybe you’re a bird watcher, nature lover, or have some other reason to photograph nature in all of its variety. You could model the problem of classifying photos of leaves/birds/animals into their groups. You could also model the problem of whether a given photo includes an object of interest, like your pet dog or your own face.
Gravitate towards hobbies that have datasets readily available for you to draw upon and model.
Machine Learning at Work
Do you have access to data at work or the things you work on? This could be your blog or something else online, or it could be data on or related to something your work creates or releases.
- Visitors: Can you model something about the visits to your website (this could be your own blog or web property). Perhaps a demographic feature of a visitor such as platform, browser, etc., or perhaps the source of visitors or volume of page views in a period based on content posted.
- Customers: Like visitors, are their properties of customers that can be modeled? This might be purchase volumes, shopping cart contents, purchase times or similarly demographics information. I like this area because it can flush out a lot of new knowledge (support with data) about a business that was taken for granted.
- Conversion: Are their quality of conversion that can be modeled? This may be aspects of conversion such as time or customer demographics. It may be the prediction of conversion chains such as trial, paid, up-sell.
- Churn: For service industries, churn is something that is very important is likely already being modelled. Is there some form of churn that is not being modelled? Churn from trials perhaps. Churn from email lists or from RSS subscriptions?
- Proprietary data: Is their some unique or interesting data that you organization creates or has access to. What questions you can ask of the data that might be worth modeling. For example, meteorological data, manufacturing data, mining data, etc.
Be mindful of privacy concerns and data ownership. You may require permission before accessing the data and have to keep the results confidential or internal to your organization.
I hope you have found this useful and perhaps thought of a problem that you could investigate that will give you that push to dive deeper into applied machine learning.
If so, leave a comment, I’d love to hear what you came up with.
About Jason Brownlee
Jason Brownlee, PhD is a machine learning specialist who teaches developers how to get results with modern machine learning methods via hands-on tutorials.