My Thoughts about CSE 6250 Big Data Analytics in Healthcare (taking in Spring 2019)

5 Comments

This is my first class with Georgia Tech OMSCS program. I would prefer to take a different class as my first class (Machine Learning) so I have a better understanding of machine learning algorithms before trying to apply them, but as a newcomer you’re the last in a priority list. So, the other classes were completely full by the time I could make my selection.

Prerequisites

It would help if you are familiar with Python and at least some machine learning algorithms.

The second homework involves some math that requires you to use a chained derivative rule. However, the majority of the tasks are more practical.

Effort

Very intense. You’ll have to use multiple languages and tools to accomplish your homework. For this year (Spring 2019) this includes Python, a variation of SQL, Scala; Hadoop, Pig, Sparkā€¦ This class is taking more of my time than what I wanted to spend on it with a family and a full-time job.

Grading

The automated code grader has bugs.

My first homework had some points taken off because of the tasks we were not even supposed to do. I contacted the teaching assistant (TA) and had a full credit restored.

The second homework had some points taken off, because they split the script into parts and ran each part separately, while my code was expecting that the whole script would run as a whole. The assignment did not mention anything about this. I again had the full credit restored after talking to teaching assistants and demonstrating that the issue was with this unstated requirement. The teaching assistant was very responsive.

For my third homework (Spark + Scala), I initially received 0 points, because I was trying out some plugins and modified the scala project file. Then I forgot to remove it, and my homework could not be run with the automated grader. This time the first TA never responded (I waited for about 4 days and followed up once), but the second TA replied right away. He manually reran my code and I only lost a few points due to the bad project file.

The last, fifth homework (PyTorch + deep learning) requires a lot of time. You can take a part in Kaggle competition with other classmates as a part of this homework. I totally sucked at this one. I think I had some bugs in the data preprocessing stage, even though I passed the included unit tests.

A note about the homework submission process – if you miss a file or make a typo, you won’t know about it until you homework is officially graded. There is no immediate feedback on submission.

Tools

Docker – there are several ways you can run your homework assignments. If you don’t want to set up your home environment for each task. I used the provided docker image (there is also an option to use Azure virtual machine, but I did not use that option).

TEX editor – I used TeXstudio on Mac. You can use a regular Word and save to pdf for homework assignments which require a written answer. But, some of them require you to type formulas. And, although I found using TEX format extremely frustrating, at least the original homework assignment is provided both in tex and pdf formats. So you can start with that provided tex file and adjust fill out the answers.

Overleaf – something I discovered at the end of the class. This is a an online LaTex editor that allows you to collaborate with other students. As long as you sign up with your Georgia Tech email, it’s free.

Professor Involvement

Nonexistent. Your only chance to see a professor is through Udacity. The professor did not answer a single question on Piazza; it was 100% TAs.

Overall Impression

There is no need to jam so many technologies in a single class. Sometimes, I felt like I was just going through different sections of the homework filling out the missing parts (they usually provide a method signature and you’re supposed to write the code), without actually understanding the bigger picture. Not a bad class, especially if you can dedicate enough time to it, but would not recommend it as your first class.

Categories: Machine Learning Tags: Tags:

5 Replies to “My Thoughts about CSE 6250 Big Data Analytics in Healthcare (taking in Spring 2019)”

  1. Hi, I am thinking to take this course in Fall 2019 (I have taken ML etc. before). I am trying to go over the lab set-up pages in http://www.sunlab.org/teaching/cse6250/spring2019/ and having trouble to se-up Hadoop. It seems that instructions jump straight from Docker set-up to Hadoop without explaining how to do the first step there. Any hints please?

    1. Can’t remember all of the details now, but try to use http://sunlab.org/teaching/cse6250/spring2019/env/env-local-docker.html#_0-system-environment
      Once you have docker set up, you attach to your docker instance:
      docker ps -a — list all instances
      docker start {instanceId} – start a docker instance
      docker attach {instanceId} – attach to a docker instance

      Then start off the services:
      /scripts/start-services.sh

      And, fingers crossed, this should do it.

  2. Nice post! Thanks!
    I will take this course as my first one for OMSCS program too. I heard it is hard(est) one! Will open my eyes on your posts.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.