By AI Trends Staff
Once you have decided to explore a career in data science, and you need to engage in a project to get yourself going, you need to decide what dataset to use.
Fortunately, a guide to the best datasets for machine learning has been published in edureka!, written by Disha Gupta, a computer science and technology writer based in India. She notes that without training datasets, machine-learning algorithms would not have a way to learn text mining or text classification. Five to 10 years ago, it was difficult to find datasets for machine learning and data science projects. Today the challenge is not finding data, but to find the relevant data.
Here is an excerpt referring to datasets good for Natural Language Processing projects, which need text data. She recommended:
Enron Dataset – Email data from the senior management of Enron that is organized into folders.
Amazon Reviews – It contains approximately 35 million reviews from Amazon spanning 18 years. Data includes user information, product information, ratings, and text review.
Newsgroup Classification – Collection of almost 20,000 newsgroup documents, partitioned evenly across 20 newsgroups. It is great for practicing topic modeling and text classification.
For Finance projects:
Quandl: A great source of economic and financial data that is useful to build models to predict stock prices or economic indicators.
IMF Data: The International Monetary Fund (IMF) publishes data on international finances, foreign exchange reserves, debt rates, commodity prices, and investments.
And for Sentiment Analysis projects:
IMDB Reviews – Dataset for binary sentiment classification. It features 25,000 movie reviews.
Sentiment140 – Uses 160,000 tweets with emoticons pre-removed.
Two Questions for Your Data Science Project
Once you have selected a dataset, you might need some more suggestions for getting your project off the ground. First, ask yourself two questions, suggests a recent article in Data Science Weekly: How would you make some money with it? And how would you save some money with it?
The answers will help you focus on what is important and useful when looking at your data. You will often find that before you get to the modeling or serious math, you may have to work through problems with the data, such as missing, erroneous or biased data. “You will find frequently in the real world that data is incredibly messy and nothing like the squeaky clean data sets found online in contests on Kaggle or elsewhere,” the author states.
Maybe at this stage you feel you need more education on AI. Fortunately, BestColleges has arrived. The company is a partnership with HigherEducation.com to provide students with direct connections to schools and programs that suit their education goals. The site provides college planning, access to financial aid and career resources.
Tune Up Your AI Education
Success in the AI field usually requires an undergraduate degree in computer science or a related discipline such as mathematics. More senior positions may require a master’s of PhD. Motivation is important. “Curiosity, confidence and perseverance are good traits for any student looking to break into an emerging field and AI is no exception,” states Dan Ayoub, Education Manager for Microsoft. “Unlike careers where a path has been laid over decades, AI is still in its infancy, which means you may have to form your own path and get creative.”
The article sketches out sample core subjects in an AI curriculum in math and statistics, computer science and “core AI,” such as machine learning, neural networks and natural language processing. Once you cover some fundamentals, you can begin to explore subjects that interest you personally. Clusters include machine learning, robotics, and human-AI interaction.
Whether you are a college student or already in the workforce, it’s important to proactively define your own AI curriculum, Ayoub suggested.
Example skills that can help you check off the right boxes in your response to the AI job posting include:
- Programming Languages: Python, Java, C/C++, SQL, R, Scala, Perl
- Machine Learning Frameworks: TensorFlow, Theano, Caffe, PyTorch, Keras, MXNET
- Cloud Platforms: AWS, Azure, GCP
- Workflow Management Systems: Airflow, Luigi, Pinball
- Big Data Tools: Spark, HBase, Kafka, HDFS, Hive, Hadoop, MapReduce, Pig
- Natural Language Processing Tools: spaCy, NLTK
Jobs of the future will require a willingness to stay curious. It takes a little time and some patience.
An IBM AI researcher encourages an attitude that AI needs to be adopted by more people with data science and software engineering skills, as demand for workers skilled in machine learning is doubling every few months. “If we leave it as some mythical realm, this field of AI, that’s only accessible to the select PhDs that work on this, it doesn’t really contribute to its adoption,” said Dario Gil, research director at IBM, in an article in VentureBeat.