rake ml:learn - AWS Machine Learning through rake

Automating the creation of your machine learning (ML) model can allow your services to evolve over time, automatically. This article assumes you have used the AWS ML Web Interface or have some understanding of AWS ML.

Ideally, we would have a model that would use Stochastic Gradient Descent and would learn per query. But when you want a low maintenance solution such as AWS ML, that isn’t an option, since AWS ML models are immutable by design.

You can, however, retrain a new model and then flip the switch so queries go to the new endpoint.

In our setup, achieving this requires a classic Extract, Transform, Load or ETL. This article will be focusing on that last step Load. And the hardest step in the AWS ML load: Creating Data Sources.

Before we talk about load though, let’s set up a simple extract and transform so we all have contexa and are on the same page.


  • Specific to your business logic
  • Can be skipped in this simple example and fed directly into the transform
  • More complicated examples might require getting data dumps from multiple databases or other teams
module Machine
  class Extractor
    def self.perform
      CSV.open(Rails.root.join("tmp", "extracted_data.csv"), "w") do |csv|
        ImportantData.find_each do |data|
          csv << data.to_csv


  • Critical step involves converting your raw data into features to be ingested by AWS Machine Learning models
  • What those features are depends on your data and your machine learning model, whether it be simple linear regression or a deep neural network
  • Figuring out features is outside the scope of this article but read Machine Learning Long Ids for some more insight, or better yet, take Andrew Ng’s Machine Learning Course
module Machine
  class Transformer
    def self.perform(filename = Rails.root.join("tmp", "extracted_data.csv"))
      Dir.mkdir('tmp') unless File.exists?('tmp')
      AwsMlTransformer.new(filename, "tmp/intermediary.csv").write
      FeatureExpanderWriter.new("tmp/intermediary.csv", "tmp/features.csv").write


This is the good stuff. Check out the steps listed in the method below and we’ll walk through each line.

module Machine
  class Loader
    def self.perform(filename=Rails.root.join("tmp","features.csv"))
      instance = new


All of these make use of the AWS SDK v2 for Ruby.

Upload Data to S3

  • AWS ML needs access to your S3 bucket
    • This can be done with a bash script as explained on Amazon’s website
    • Example below uses awscli, which can be installed with brew install awscli
    • Replace dev.machinelearningservice.dimroc.com with your bucket
  • Using the AWS Ruby SDK v2, upload the features.csv file
  def upload_data_to_s3(filename)
    puts "uploading #{filename} to S3..."
    self.data_file = filename

    client = Aws::S3::Client.new
    File.open(filename, "rb") do |f|
        bucket: bucket_name,
        key: "uploads/#{File.basename(filename)}",
        server_side_encryption: "AES256",
        body: f)

Create Data Sources

This is the hardest step in the load stage. If all you read is this section, you’ll be much better for it.

  def create_data_sources
    puts "Creating data sources from S3..."

    # Create Data Source For both Model and Evaluation
    self.model_data_source_id = "learn-mds-#{timestamp}"
    self.evaluation_data_source_id = "learn-eds-#{timestamp}"

    # Training
      data_source_id: model_data_source_id,
      data_source_name: "Model Source: LearnSample 0-70 #{timestamp}",
      compute_statistics: true, # Required to create ML Model
      data_spec: {
        data_location_s3: data_location,
        data_schema: data_schema.to_json, # More on this below
        data_rearrangement: data_rearrangement.to_json

    wait_for_ml(:data_source_available) # Block until complete

    # Evaluation
      data_source_id: evaluation_data_source_id,
      data_source_name: "Evaluation Source: LearnSample 70-100 #{timestamp}",
      compute_statistics: true,
      data_spec: {
        data_location_s3: data_location,
        data_schema: data_schema.to_json,
        data_rearrangement: data_rearrangement(true).to_json


The code above performs a few key steps:

  • Compute statistics
    • Necessary to train the ML Model as mentioned in the documentation
  • Data Rearragement
    • Used when you want to split a data source into two using complement
      • complement tells AWS to split a data source into two, one for training and one for evaluation. Documentation.
      •   def data_rearrangement(complement=false)
              splitting: {
                percentBegin: 0,
                percentEnd: 70,
                strategy: "random",
                complement: complement
    • More documentation can be found here
  • Data Schema
    • Describes each csv column with AWS Machine Learning metadata. The example below is for a Multiclass Classification Model
    •   {
          "version" : "1.0",
          "rowId" : null,
          "rowWeight" : null,
          "targetAttributeName" : "policy",
          "dataFormat" : "CSV",
          "dataFileContainsHeader" : true,
          "attributes" : [ {
            "attributeName" : "policy",
            "attributeType" : "CATEGORICAL"
          }, {
            "attributeName" : "length",
            "attributeType" : "NUMERIC"
          }, {
            "attributeName" : "base_10",
            "attributeType" : "NUMERIC"
          }, {
            "attributeName" : "digit_1",
            "attributeType" : "NUMERIC"
          }, {
            "attributeName" : "digit_2",
            "attributeType" : "NUMERIC"
          } ],
          "excludedAttributeNames" : [ ]
    • More information here

That was a lot. Each section warrants a decent write up, so for not, I recommend reading the high level information about data rearrangement and data schema.

Rest assured though, it’s far simpler from here on out. The rest is really just API calls using the IDs you just received.

Create Model From Data Source

  def create_model
    puts "Creating model..."
    self.ml_model_id = "learn-ml-#{timestamp}"
      ml_model_id: ml_model_id,
      ml_model_name: "ML model: LearnSample #{timestamp}",
      ml_model_type: "MULTICLASS",
      training_data_source_id: model_data_source_id,
      parameters: {
        "sgd.maxPasses" => "20",
        "sgd.shuffleType" => "auto"


Create Evaluation

  def create_evaluation
    puts "Creating evaluation..."
    self.evaluation_id = "learn-ev-#{timestamp}"
      evaluation_id: evaluation_id,
      evaluation_name: "Evaluation: LearnSample #{timestamp}",
      ml_model_id: ml_model_id,
      evaluation_data_source_id: evaluation_data_source_id


F1 Heat Map

Create Realtime Endpoint

Expose your service!

  def create_realtime_endpoint
    puts "Creating realtime endpoint..."
    response = ml_client.create_realtime_endpoint({ml_model_id: ml_model_id})

    puts "Machine Learning Complete! Update your ENV variables with the following:"
    puts "export AWS_ML_MODEL_ID=#{ml_model_id}"
    puts "export AWS_ML_PREDICTION_URL=#{response.realtime_endpoint_info.endpoint_url}"

Wire it all up with rake

namespace :ml do
  desc "Transforms raw extracted data into format that's ingestable by AWS Machine Learning"
  task :transform => :environment do

  desc "Loads pre transformed data into AWS Machine Learning"
  task :load => :environment do

  desc "Transform and load raw insurance number policy data into AWS Machine Learning"
  task :learn => [:transform, :load]

In bash or your scheduled job:

rake ml:learn

Run it weekly to keep your machines learning.

comments powered by Disqus