Create your first Azure ML training experiment is a relatively easy task compared to data preparation phase.
First step is to navigate to
https://studio.azureml.net/ and login using your Microsoft Account, you don't need to have an active Microsoft Azure subscription or credit card. LUIS and Bot framework were also free but now you need to have Azure subscription to use these two other services, don't know whether this will change in the future. However, till the time of writing this blog post it's still free.
From the left menu choose experiments and choose new , choose blank experiment as we will build this together from scratch
Let's name our experiment (our shiny new experiment) and we will use BBC news data set, you can substitute this with your own prepared set which will have the text representation of Office document with the current category
Then we will do some data cleansing, select News and category from the data set and clean up empty rows by removing any rows that does have any missing columns by setting minimum missing values range 0 to 1 ( that means if only single column has a missing value the action "cleaning mode" will be triggered , we will choose to remove the entire row
Now it's getting more exciting, we will use some text analysis technique called Extract N-Gram features, this will use the News column (text representation of the document) as an input and based on the repetition of a single word or tuple of words it can analyse the tuple effect in the categorization.
First thing we will change the default column selection from any string column to only analyse the News column which represents the text extraction from Office document.
Secondly, we will choose to create a vocabulary mode, the result vocabulary can be used later on in the predictive experiment. the next option is the N-Gram size which dictate to what extend you want the tuple to grow. for example if you keep the default 1 it will only consider a single word. However, if you choose 2 it will consider single word and any couple of consecutive words. In our example we will use three that means the text analysis will consider a single word, two words and three words tuples.
In the weighting function there are multiple options, I will choose TF-IDF which means term frequency and inverse document frequency.
This technique is give more weight for terms that appears more than others with the consideration of giving a negative score for terms that appears in different documents( news items in our case) with different classification. The final score for the vocabulary is a mixture of both TF and IDF values.
There are lots of other options which can be viewed in more details in this excellent guide for N-Gram module guide
here
One important option is the desired output features which is basically how many tuples you want to use to categorize your data, this might be a trial and error for a newbie like me until I can see the top n effective tuples to make it easier and faster for the trained model to compare future text against, in my scenario I used 5000 Features as a result of N-Gram Feature extraction step.
After the data preparation and vocabulary preparation , we will use 4 steps that is common across any training experiment which are (Split, Train, Score and Evaluate)
The first Step is to split the data randomly with some condition or purely based on a random seed, train the model using part of the data then score the model using the other portion, last step is to evaluate the model and view how accurate your model is.
Based on the evaluation result you can see overall model accuracy, at this point if the model is not hitting the mark ( you can set target accuracy based on business requirements). One possible solution is to change N-Gram tuple values, output features or even use a complete different training algorithms. Sometimes it's useful to to include additional columns or metadata to help categorize the documents maybe a department name or author not only the document content.
In my sample I got overall accuracy of 80% which is accepted to me so relying on the text extraction from the documents is sufficient to me.
Now let's run the experiment and confirm that all steps are executed successfully which we can validate with the green check mark on every step