I was trying to figure out a way to get the text representation of documents stored either in OneDrive or SharePoint Online to execute some text analysis techniques using Azure machine learning studio. I know for sure (not really just guessing) that SharePoint search index store text representation of the document but I guess this version of the document is not exposed to us


I also happen to know (this one is not a guess) from the good old days that SharePoint search uses IFilter to get file content as text then store this text in the index. I tried to do it in a different way.
So I figured how about doing this document to text conversion myself, I found Tika text extraction library  handy apache open source tool which has been ported as .NET nuget package 

I've create a simple Azure function using Visual studio, It has been a while since I used the full fledged Visual studio as I've been using mostly Visual studio code lately, as you guys can see the azure function is pretty straight forward just 4 lines of code to convert the docx files to text representation so we can use any text analysis techniques on our SharePoint documents.

Now let's hook this to a simple Flow which been triggered when a new file been uploaded to specific SharePoint library.
The flow will start then it will trigger the azure function which will extract the text representation of the office document and send it to a web-service to do some text analysis and return the document classification value.



Then within the flow itself we can update the SharePoint document and update the classification as per the text analysis result.

Hint: We will consider this web service call used in this flow as HTTP2 as a black box for now. to give you a sneak peak It's based on multi-class neural network classification algorithm built using Azure Machine Learning Studio and we will discuss this particular building block in more details in part 2 of this series.

now let's upload a new word document that have a text represents a business article and let's see the updated category text value

Here we go , our smart document categorization flow is able to classify the document as business document.
In the next part of this blog series, we will discuss the azure machine learning studio experiment in more details.