Using Azurite with Active Storage

Install Azurite in your preferred way: npm install azurite

Install Microsoft Azure Storage Explorer

Create some directory to run azurite from: `~/azurite`

Add storage.yml configuration for azurite (using the default dev account and key):

azurite_emulator:
  service: AzureStorage
  storage_account_name: 'devstoreaccount1'
  storage_access_key: 'Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=='
  container: 'container-name'
  storage_blob_host: 'http://127.0.0.1:10000/devstoreaccount1'

Update development.rb to use azurite_emulator:

config.active_storage.service = :azurite_emulator

Start azurite from the directory you created for azurite: azurite --location ~/azurite --debug ~/azurite/debug.log

Start Azure Storage Explorer, connect to local emulator, and create container-name blob container – the same container name you specified in the storage.yml file.

Start uploading to Azurite.

Note for Rails 5.2

Some changes have not been backported as of this post, and you have to monkey-patch ActiveStorage file as described here – http://www.garytaylor.blog/index.php/2019/01/30/rails-active-storage-and-azure-beyond-config/ – this allows us to work with azurite locally.

If you want to use the newer azure-storage-blob instead of the deprecated azure-storage and you’re on Rails 5.2, you have to do a bit more monkey-patching – otherwise, you’ll start getting No such file to load — azure/storage.rb“:

Add two empty files: lib/azure/storage/core/auth/shared_access_signature.rb, and lib/azure/storage.rb

Add this to config/initializers/active_storage_6_patch.rb (this is the current master version of the ActiveStorage module):

require "azure/storage/blob"
require 'active_storage/service/azure_storage_service'
module ActiveStorage
  # Wraps the Microsoft Azure Storage Blob Service as an Active Storage service.
  # See ActiveStorage::Service for the generic API documentation that applies to all services.
  class Service::AzureStorageService < Service
    attr_reader :client, :container, :signer

    def initialize(storage_account_name:, storage_access_key:, container:, public: false, **options)
      @client = Azure::Storage::Blob::BlobService.create(storage_account_name: storage_account_name, storage_access_key: storage_access_key, **options)
      @signer = Azure::Storage::Common::Core::Auth::SharedAccessSignature.new(storage_account_name, storage_access_key)
      @container = container
      @public = public
    end

    def upload(key, io, checksum: nil, filename: nil, content_type: nil, disposition: nil, **)
      instrument :upload, key: key, checksum: checksum do
        handle_errors do
          content_disposition = content_disposition_with(filename: filename, type: disposition) if disposition && filename

          client.create_block_blob(container, key, IO.try_convert(io) || io, content_md5: checksum, content_type: content_type, content_disposition: content_disposition)
        end
      end
    end

    def download(key, &block)
      if block_given?
        instrument :streaming_download, key: key do
          stream(key, &block)
        end
      else
        instrument :download, key: key do
          handle_errors do
            _, io = client.get_blob(container, key)
            io.force_encoding(Encoding::BINARY)
          end
        end
      end
    end

    def download_chunk(key, range)
      instrument :download_chunk, key: key, range: range do
        handle_errors do
          _, io = client.get_blob(container, key, start_range: range.begin, end_range: range.exclude_end? ? range.end - 1 : range.end)
          io.force_encoding(Encoding::BINARY)
        end
      end
    end

    def delete(key)
      instrument :delete, key: key do
        client.delete_blob(container, key)
      rescue Azure::Core::Http::HTTPError => e
        raise unless e.type == "BlobNotFound"
        # Ignore files already deleted
      end
    end

    def delete_prefixed(prefix)
      instrument :delete_prefixed, prefix: prefix do
        marker = nil

        loop do
          results = client.list_blobs(container, prefix: prefix, marker: marker)

          results.each do |blob|
            client.delete_blob(container, blob.name)
          end

          break unless marker = results.continuation_token.presence
        end
      end
    end

    def exist?(key)
      instrument :exist, key: key do |payload|
        answer = blob_for(key).present?
        payload[:exist] = answer
        answer
      end
    end

    def url_for_direct_upload(key, expires_in:, content_type:, content_length:, checksum:)
      instrument :url, key: key do |payload|
        generated_url = signer.signed_uri(
          uri_for(key), false,
          service: "b",
          permissions: "rw",
          expiry: format_expiry(expires_in)
        ).to_s

        payload[:url] = generated_url

        generated_url
      end
    end

    def headers_for_direct_upload(key, content_type:, checksum:, filename: nil, disposition: nil, **)
      content_disposition = content_disposition_with(type: disposition, filename: filename) if filename

      { "Content-Type" => content_type, "Content-MD5" => checksum, "x-ms-blob-content-disposition" => content_disposition, "x-ms-blob-type" => "BlockBlob" }
    end

    private
      def private_url(key, expires_in:, filename:, disposition:, content_type:, **)
        signer.signed_uri(
          uri_for(key), false,
          service: "b",
          permissions: "r",
          expiry: format_expiry(expires_in),
          content_disposition: content_disposition_with(type: disposition, filename: filename),
          content_type: content_type
        ).to_s
      end

      def public_url(key, **)
        uri_for(key).to_s
      end


      def uri_for(key)
        client.generate_uri("#{container}/#{key}")
      end

      def blob_for(key)
        client.get_blob_properties(container, key)
      rescue Azure::Core::Http::HTTPError
        false
      end

      def format_expiry(expires_in)
        expires_in ? Time.now.utc.advance(seconds: expires_in).iso8601 : nil
      end

      # Reads the object for the given key in chunks, yielding each to the block.
      def stream(key)
        blob = blob_for(key)

        chunk_size = 5.megabytes
        offset = 0

        raise ActiveStorage::FileNotFoundError unless blob.present?

        while offset < blob.properties[:content_length]
          _, chunk = client.get_blob(container, key, start_range: offset, end_range: offset + chunk_size - 1)
          yield chunk.force_encoding(Encoding::BINARY)
          offset += chunk_size
        end
      end

      def handle_errors
        yield
      rescue Azure::Core::Http::HTTPError => e
        case e.type
        when "BlobNotFound"
          raise ActiveStorage::FileNotFoundError
        when "Md5Mismatch"
          raise ActiveStorage::IntegrityError
        else
          raise
        end
      end
  end
end

Ad Blocking with ddwrt

This was done on Asus RT-AC68U, running DD-WRT v3.0-r41686 std (12/10/19)

Let’s start with the script. This downloads two adlists and combines them into one. Then we’re restarting the service to pick up the changes. The reason for using curl instead of wget is because wget refused to work with https on my ddwrt build 🤷.

wget -qO /tmp/mvps http://winhelp2002.mvps.org/hosts.txt
curl -k https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts|grep "^0.0.0.0" >> /tmp/mvps
killall -HUP dnsmasq
stopservice dnsmasq && startservice dnsmasq

Go to Administration -> Commands. Paste it in there and execute “Run Commands”.

Then, use the same command but execute it as Save Startup. Why? Well, I wanted to use a cron scheduler to run the script on a regular basis, but it just refused to work. Thus, I’m just scheduling a weekly reboot, which will trigger this command to update the ad lists 🤷

All you have left is to enable DNSMasq and Local DNS in Services tab. Then in the Additional Dnsmasq options add this:

addn-hosts=/tmp/mvps

Then go to Administration->Keep Alive and schedule a weekly/monthly reboot. Although, if the cron is broken in your build, this may not work either (to check, you can ssh to your router and check the timestamp on the /tmp/mvps file). In that case, you may just have to manually rerun the script from time to time to get the latest ad list.

Update 2020: I finally switched to pfsense and using pfBlockerNG provides a much better experience. Pi-hole is another great option – also much better than tinkering with dd-wrt.

Starting with PySpark – configuration

PySpark is a pain to configure.

For this guide I am using macOS Mojave.
Spark version 2.4.0
Python 3

Start by downloading the Spark https://spark.apache.org/downloads.html. Extract wherever – can be your home directory.

Install Java SDK. Important – some later versions don’t seem to be compatible with spark 2.4.0. Version 8 seems to work- https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Install pyspark: pip install pyspark

Configure your zshrc/bash_profile – depending on what shell you use:

export SPARK_PATH=~/spark-2.4.0-bin-hadoop2.7
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]'

export SPARK_HOME=~/spark-2.4.0-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH

export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"

export JAVA_HOME=$(/usr/libexec/java_home)

Remember to reload your console.

Now, when you enter pyspark on your console, it’ll open a notebook.

You can validate if Spark context is available by entering this in your new notebook:

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

References: https://medium.com/@yajieli/installing-spark-pyspark-on-mac-and-fix-of-some-common-errors-355a9050f735

My Thoughts about CSE 6250 Big Data Analytics in Healthcare (taking in Spring 2019)

This is my first class with Georgia Tech OMSCS program. I would prefer to take a different class as my first class (Machine Learning) so I have a better understanding of machine learning algorithms before trying to apply them, but as a newcomer you’re the last in a priority list. So, the other classes were completely full by the time I could make my selection.

Prerequisites

It would help if you are familiar with Python and at least some machine learning algorithms.

The second homework involves some math that requires you to use a chained derivative rule. However, the majority of the tasks are more practical.

Effort

Very intense. You’ll have to use multiple languages and tools to accomplish your homework. For this year (Spring 2019) this includes Python, a variation of SQL, Scala; Hadoop, Pig, Spark… This class is taking more of my time than what I wanted to spend on it with a family and a full-time job.

Grading

The automated code grader has bugs.

My first homework had some points taken off because of the tasks we were not even supposed to do. I contacted the teaching assistant (TA) and had a full credit restored.

The second homework had some points taken off, because they split the script into parts and ran each part separately, while my code was expecting that the whole script would run as a whole. The assignment did not mention anything about this. I again had the full credit restored after talking to teaching assistants and demonstrating that the issue was with this unstated requirement. The teaching assistant was very responsive.

For my third homework (Spark + Scala), I initially received 0 points, because I was trying out some plugins and modified the scala project file. Then I forgot to remove it, and my homework could not be run with the automated grader. This time the first TA never responded (I waited for about 4 days and followed up once), but the second TA replied right away. He manually reran my code and I only lost a few points due to the bad project file.

The last, fifth homework (PyTorch + deep learning) requires a lot of time. You can take a part in Kaggle competition with other classmates as a part of this homework. I totally sucked at this one. I think I had some bugs in the data preprocessing stage, even though I passed the included unit tests.

A note about the homework submission process – if you miss a file or make a typo, you won’t know about it until you homework is officially graded. There is no immediate feedback on submission.

Tools

Docker – there are several ways you can run your homework assignments. If you don’t want to set up your home environment for each task. I used the provided docker image (there is also an option to use Azure virtual machine, but I did not use that option).

TEX editor – I used TeXstudio on Mac. You can use a regular Word and save to pdf for homework assignments which require a written answer. But, some of them require you to type formulas. And, although I found using TEX format extremely frustrating, at least the original homework assignment is provided both in tex and pdf formats. So you can start with that provided tex file and adjust fill out the answers.

Overleaf – something I discovered at the end of the class. This is a an online LaTex editor that allows you to collaborate with other students. As long as you sign up with your Georgia Tech email, it’s free.

Professor Involvement

Nonexistent. Your only chance to see a professor is through Udacity. The professor did not answer a single question on Piazza; it was 100% TAs.

Overall Impression

There is no need to jam so many technologies in a single class. Sometimes, I felt like I was just going through different sections of the homework filling out the missing parts (they usually provide a method signature and you’re supposed to write the code), without actually understanding the bigger picture. Not a bad class, especially if you can dedicate enough time to it, but would not recommend it as your first class.

Machine Learning Algorithms Problem Types

Types of problems we can solve with machine learning:

  • Regression- helps establish a relationship between one or more sets of data

    • Algorithms
      • Simple linear regression
      • Multiple Linear Regression
      • Polynomial Regression
      • Support Vector Machines (SVR)
      • Decision Tree
      • Random Forest Regression
    • Sample problem: calculate the time I get to work based on the route I take and the day of the week
  • Classification – helps us answer a yes/no type of question based on one or more sets of data

    • Algorithms
      • K Nearest Neighbors (KNN)
      • Kernel SVM
      • Logistic Regression
      • Naïve Bayes
      • Decision Tree
      • Random Forest Classification
    • Sample problem: will I be late or on time based on the route I take and the day of the week
  • Clustering – helps us discover clusters of data

    • Algorithms
      • Hierarchical Clustering
      • K Means
    • Sample problem: classify the customers into specific groups based on their income and spending
  • Association – helps determine an association among multiple events

    • Algorithms
      • Apriori
      • Eclat
    • Sample problem: if I like movie A, what other movies will likely to enjoy
  • Reinforcement – helps to better exploit while exploring

    • Algorithms
      • Thomson Sampling
      • UCB
    • Sample problem: we want to determine the most effective treatment. Instead of conduction a long-term random trial, use UCB or Thompson Sampling to determine the best treatment in a shorter interval
  • Natural Language Processing

    • Algorithms
      • Any classification algorithm, but most popular are Naïve Bayes and Random Forest
    • Sample problem: determine if an amazon review is positive or negative
  • Deep Learning – can help determine hard to establish non-linear relationships between multiple input parameters and some expected outcome

    • Algorithms
      • Artificial Neural Networks (ANN)
      • Convolutional Neural Networks (CNN) – especially helpful when processing images
    • Sample problem: based on the credit score, age, balance, salary, tenure… determine if a customer is likely to continue using your service or leave

Checking/Cleaning Disk Space on Linux

Check the disk space (may need to install ncdu first):

sudo ncdu /

Clean up unused stuff:

sudo apt-get clean
sudo apt-get autoclean
sudo apt-get autoremove

clean: clean clears out the local repository of retrieved package files. It removes everything but the lock file from /var/cache/apt/archives/ and /var/cache/apt/archives/partial/. When APT is used as a dselect(1) method, clean is run automatically. Those who do not use dselect will likely want to run apt-get clean from time to time to free up disk space.

autoclean: Like clean, autoclean clears out the local repository of retrieved package files. The difference is that it only removes package files that can no longer be downloaded, and are largely useless. This allows a cache to be maintained over a long period without it growing out of control. The configuration option APT::Clean-Installed will prevent installed packages from being erased if it is set to off.

autoremove: is used to remove packages that were automatically installed to satisfy dependencies for some package and that are no more needed.

See a related question on askubuntu: https://askubuntu.com/questions/3167/what-is-difference-between-the-options-autoclean-autoremove-and-clean

Country Blocking with Rails and Cloudflare

Enable IP Geolocation in your Cloudflare panel – it should be in the Network tab.

The country code will come in HTTP_CF_IPCOUNTRY header.

Now we can add a before_action filter to block or redirect the users from a specific country (in the example below we redirect all EU countries… because who has the time to figure out GDPR):

class ApplicationController < ActionController::Base
  before_action :block_gdpr_countries

  GDPR_COUNTRIES = [
    'BE', 'EL', 'LT', 'PT',
    'BG', 'ES', 'LU', 'RO',
    'CZ', 'FR', 'HU', 'SI',
    'DK', 'HR', 'MT', 'SK',
    'DE', 'IT', 'NL', 'FI',
    'EE', 'CY', 'AT', 'SE',
    'IE', 'LV', 'PL', 'UK'
  ]

  def block_gdpr_countries
    return unless GDPR_COUNTRIES.include?(request.env['HTTP_CF_IPCOUNTRY'])
    redirect_to gdpr_path
  end
end

Remember to skip this action in the corresponding controller (in our case gdpr_controller) if you use a redirect:

skip_before_action :block_gdpr_countries

Python with VSCode (using Anaconda)

Let’s install Anaconda from https://www.anaconda.com/download/

Update your .bashrc or .zshrc with

export PATH="$HOME/anaconda3/bin:$PATH"

To change VS Code Python version:

Cmd Shift P -> Python - select interpreter (python 3)

Install auto-formatting and linting packages

conda install pylint
conda install autopep8

Create a Python debug configuration
stopOnEntry option is buggy with Python as of this post writing and makes it impossible to create breakpoints – so we set it to false for now.
Setting pythonPath to anaconda only necessary if your default VS code interpreter is different from Anaconda’s (and assuming you want to use Anaconda’s interpreter). Otherwise, you can leave it pointing to “${config:python.pythonPath}”

{
      "name": "Python conda",
      "type": "python",
      "request": "launch",
      "stopOnEntry": false,
      "pythonPath": "~/anaconda3/bin/python",
      "program": "${file}",
      "cwd": "${workspaceFolder}",
      "env": {},
      "envFile": "${workspaceFolder}/.env",
      "debugOptions": ["RedirectOutput"],
      "args": ["-i"]
    }