Auto classify Documents in SharePoint using Azure Machine learning Studio: Part 2
This is part 2 of a two parts blog series which explains briefly how to use azure machine learning to auto classify SharePoint documents. In part one, we covered the end to end solution skeleton, which relies on using Microsoft flow. The flow is set to be triggered whenever a new document is uploaded to our target SharePoint library.
The main challenge we faced is how to extract text representation from the Microsoft office file .docx and as explained in the previous blog post. I end up using the .NET version of Open Source Tika to extract the text. In the previous blog post we referred to the Azure Machine learning as web service we call using flow HTTP invoke action. Today we will explore this black box in more details:
- The Data
This is by far the most important and complicated step of the whole process as it's too specific to the problem you're trying to solve. Also the availability of data that is sufficient to train your model with high probability of correctness is something you will spend most of the time trying to figure out.
In my example as it's merely a POC, I did choose an easy path. I used an existing dataset available to everyone has access to Azure ML studio (BBCNews Dataset) and then I tailored the SharePoint content to match the data set. - The Training Experiment
Create your first Azure ML training experiment is a relatively easy task compared to data preparation phase.
First step is to navigate to https://studio.azureml.net/ and login using your Microsoft Account, you don't need to have an active Microsoft Azure subscription or credit card. LUIS and Bot framework were also free but now you need to have Azure subscription to use these two other services, don't know whether this will change in the future. However, till the time of writing this blog post it's still free.
From the left menu choose experiments and choose new , choose blank experiment as we will build this together from scratch
Let's name our experiment (our shiny new experiment) and we will use BBC news data set, you can substitute this with your own prepared set which will have the text representation of Office document with the current category
Then we will do some data cleansing, select News and category from the data set and clean up empty rows by removing any rows that does have any missing columns by setting minimum missing values range 0 to 1 ( that means if only single column has a missing value the action "cleaning mode" will be triggered , we will choose to remove the entire row
Now it's getting more exciting, we will use some text analysis technique called Extract N-Gram features, this will use the News column (text representation of the document) as an input and based on the repetition of a single word or tuple of words it can analyse the tuple effect in the categorization.
First thing we will change the default column selection from any string column to only analyse the News column which represents the text extraction from Office document.
Secondly, we will choose to create a vocabulary mode, the result vocabulary can be used later on in the predictive experiment. the next option is the N-Gram size which dictate to what extend you want the tuple to grow. for example if you keep the default 1 it will only consider a single word. However, if you choose 2 it will consider single word and any couple of consecutive words. In our example we will use three that means the text analysis will consider a single word, two words and three words tuples.
In the weighting function there are multiple options, I will choose TF-IDF which means term frequency and inverse document frequency.
This technique is give more weight for terms that appears more than others with the consideration of giving a negative score for terms that appears in different documents( news items in our case) with different classification. The final score for the vocabulary is a mixture of both TF and IDF values.
There are lots of other options which can be viewed in more details in this excellent guide for N-Gram module guide here
One important option is the desired output features which is basically how many tuples you want to use to categorize your data, this might be a trial and error for a newbie like me until I can see the top n effective tuples to make it easier and faster for the trained model to compare future text against, in my scenario I used 5000 Features as a result of N-Gram Feature extraction step.
After the data preparation and vocabulary preparation , we will use 4 steps that is common across any training experiment which are (Split, Train, Score and Evaluate)
The first Step is to split the data randomly with some condition or purely based on a random seed, train the model using part of the data then score the model using the other portion, last step is to evaluate the model and view how accurate your model is.
Based on the evaluation result you can see overall model accuracy, at this point if the model is not hitting the mark ( you can set target accuracy based on business requirements). One possible solution is to change N-Gram tuple values, output features or even use a complete different training algorithms. Sometimes it's useful to to include additional columns or metadata to help categorize the documents maybe a department name or author not only the document content.
In my sample I got overall accuracy of 80% which is accepted to me so relying on the text extraction from the documents is sufficient to me.
Now let's run the experiment and confirm that all steps are executed successfully which we can validate with the green check mark on every step
- The Predictive Experiment
Let's use the generated vocabulary of the training experiment as an input (we need to save the result vocabulary of the training experiment as Data-set so we ca use it later)
We will also remove any transformation steps , we will only select single column (news text representation) which will be the single input for our web service.
- The Webservice
Let's run the predictive experiment now and make sure it's working, then we can deploy the web-service which we can use to classify text representation of documents. This will open a new page which allows us to test our web service by supplying text as "News" input and provide scored Label output as text.
P.S. you must pass the API key as authentication header to make it work ;)
Which is the configuration of the "Extract N-gram" module. Because I have an error with that
ReplyDeleteI think you have to set the Extract N-gram from the Predictive Web Service tab to Read-Only instead of Create. This will eliminate the error when you run.
ReplyDeleteWell, there are different types of consumer reviews. Wertgutachten
ReplyDeleteThis is such a great resource that you are providing and you give it away for free. I love seeing blog that understand the value of providing a quality resource for free. hurtownia motoryzacyjna szczecin
ReplyDeleteAutomotive advertising agencies have extended their areas of responsibility to include their involvement in all aspects of day to day operations at an auto dealership. Toolschief
ReplyDeletePlease share more like that. machine learning interview questions
ReplyDeleteWhile the lash is being utilized, people arranged external the engine vehicles associated with the recuperation cycle must - steel strapping tools
ReplyDeleteAlso, an SUV, as well as a sports vehicle, will quite often be more costly to safeguard than a standard vehicle. In the event that you own more thaPlease choose a profile
ReplyDeleten one automobile, a tremendous method to get markdown auto protection might be to switch the entirety of your arrangements over to one auto insurance agency. f&i solutions
Hello I am so delighted I located your blog, I really located you by mistake, while I was watching on google for something else, Anyways I am here now and could just like to say thank for a tremendous post and a all round entertaining website. Please do keep up the great work. where to buy truck parts online
ReplyDeletePackaging is used in wide ranges. There are many things need to be packed in our daily lives. But do you know how does a product pack? It needs packaging machinery. tips for drivers
ReplyDeleteIn 1830 Barthlemy Thimonnier from France protected a working machine that was equipped for sewing straight creases with a chain join. K cup filling machine
ReplyDeleteNice information, valuable and excellent design, as share good stuff with good ideas and concepts, lots of great information and inspiration, both of which I need, thanks to offer such a helpful information here. 2020 Keystone RV Fuzion 410
ReplyDeleteThis is very significant, and yet necessary towards just click this unique backlink: 2021 Winnebago Micro Minnie 2306BHS
ReplyDeleteYou need to ensure the throwing machine tantrums your necessities and you additionally need to mull over how long do you figure you will utilize it. túlméretes szállÃtás Europa-Road Kft.
ReplyDeleteI wanted to put you one very little word just to say thank you over again over the spectacular ideas you’ve provided at this time. It was certainly surprisingly open-handed of people like you to present freely all many individuals would’ve sold for an e-book to help with making some cash for themselves, specifically seeing that you might have tried it in case you wanted. These concepts as well acted to be a great way to realize that most people have similar dream the same as my very own to learn somewhat more in regard to this matter. I am sure there are some more enjoyable opportunities ahead for individuals who read carefully your blog post. Suzuki Cultus AGS Car Review
ReplyDeleteExcellent post. I was reviewing this blog continuously, and I am impressed! Extremely helpful information especially this page. Thank you and good luck. Text Auto Reply
ReplyDeleteSome truly wonderful work on behalf of the owner of this internet site , perfectly great articles . Drive Safe
ReplyDeleteThis web journal is truly awesome. The data here will definitely be of some help to me. Much appreciated!. motorcycle shipping service
ReplyDeleteAbsolutely fantastic posting! Lots of useful information and inspiration, both of which we all need!Relay appreciate your work.Zellige Tile
ReplyDeletei am continually searching for some free stuffs over the web. there are likewise a few organizations which gives free examples.vintage clothing online
ReplyDeleteThanks for taking the time to discuss this, I feel strongly about it and love learning more on this topic. If possible, as you gain expertise, would you mind updating your blog with more information? It is extremely helpful for me.private investigation dallas tx
ReplyDeleteHey what a splendid post I have gone over and trust me I have been hunting out down this comparable sort of post for recent week and scarcely ran over this. Much thanks and will search for more postings from you. vintage mens clothing 1940s
ReplyDeleteAny pit, hole, break or deficient weld joint can frame a spot for the liquid inside the tubing to be caught and structure a microorganisms harbor. best tig welder for beginner
ReplyDeleteI visit your blog regularly and recommend it to all of those who wanted to enhance their knowledge with ease. The style of writing is excellent and also the content is top-notch. Thanks for that shrewdness you provide the readers!
ReplyDeleteAt SPF Insurance, we don't just give you "here's the news that happened," we tell you what you can do to to make your situation better despite all of the changes taking place. buy 10mg cbd capsules
ReplyDeleteBitcoin uses state-of-the-art cryptography, can be issued in any fractional denomination, and has a decentralized distribution system, is in high demand globally and offers several distinct advantages over other currencies such as the US dollar.Bitcoin ATM near me
ReplyDeleteused auto parts It is quite beneficial, although think about the facts when it reaches this target.
ReplyDeleteThe expense is driven by the elements the machine has and furthermore the sturdiness of the machine. You need to ensure the throwing machine tantrums your requirements and you additionally need to think about how long do you figure you will utilize it. CNC router machine
ReplyDeleteStubsondemand's california pay stub generator creates pay stubs online. This easy paystub generator that handles calculations automatically for you.
ReplyDeleteI think this is an informative post and it is very beneficial and knowledgeable. Therefore, I would like to thank you for the endeavors that you have made in writing this article. All the content is absolutely well-researched. Thanks...Vintage clothing online
ReplyDelete
ReplyDeleteGet reliable, affordable and flexible Off-Site Cloud Connect veeam backup and replication at Sora Solutions Services. Reach out today for the best service and the best price. +41432338888
Sora Solutions Services brings the complete project management for app development to the table for customer application development and planning.
ReplyDelete