What is text analysis?

Text analysis is the process of using computer systems to read and understand human-written text for business insights. Text analysis software can independently classify, sort, and extract information from text to identify patterns, relationships, sentiments, and other actionable knowledge. You can use text analysis to efficiently and accurately process multiple text-based sources such as emails, documents, social media content, and product reviews, like a human would.

Why is text analysis important?

Businesses use text analysis to extract actionable insights from various unstructured data sources. They depend on feedback from sources like emails, social media, and customer survey responses to aid decision making. However, the immense volume of text from such sources proves to be overwhelming without text analytics software.

With text analysis, you can get accurate information from the sources more quickly. The process is fully automated and consistent, and it displays data you can act on. For example, using text analysis software allows you to immediately detect negative sentiment on social media posts so you can work to solve the problem

Sentiment analysis

Sentiment analysis or opinion mining uses text analysis methods to understand the opinion conveyed in a piece of text. You can use sentiment analysis of reviews, blogs, forums, and other online media to determine if your customers are happy with their purchases. Sentiment analysis helps you spot new trends, track sentiment changes, and tackle PR issues. By using sentiment analysis and identifying specific keywords, you can track changes in customer opinion and identify the root cause of the problem. 

Record management

Text analysis leads to efficient management, categorization, and searches of documents. This includes automating patient record management, monitoring brand mentions, and detecting insurance fraud. For example, LexisNexis Legal & Professional uses text extraction to identify specific records among 200 million documents.

Personalizing customer experience

You can use text analysis software to process emails, reviews, chats, and other text-based correspondence. With insights about customers’ preferences, buying habits, and overall brand perception, you can tailor personalized experiences for different customer segments. 

How does text analysis work?

The core of text analysis is training computer software to associate words with specific meanings and to understand the semantic context of unstructured data. This is similar to how humans learn a new language by associating words with objects, actions, and emotions. 

Text analysis software works on the principles of deep learning and natural language processing.

Deep learning

Artificial intelligence is the field of data science that teaches computers to think like humans. Machine learning is a technique within artificial intelligence that uses specific methods to teach or train computers. Deep learning is a highly specialized machine learning method that uses neural networks or software structures that mimic the human brain. Deep learning technology powers text analysis software so these networks can read text in a similar way to the human brain.

Natural language processing

Natural language processing (NLP) is a branch of artificial intelligence that gives computers the ability to automatically derive meaning from natural, human-created text. It uses linguistic models and statistics to train the deep learning technology to process and analyze text data, including handwritten text images. NLP methods such as optical character recognition (OCR) convert text images into text documents by finding and understanding the words in the images.

What are the types of text analysis techniques?

The text analysis software uses these common techniques.

Text classification

In text classification, the text analysis software learns how to associate certain keywords with specific topics, users’ intentions, or sentiments. It does so by using the following methods: 

  • Rule-based classification assigns tags to the text based on predefined rules for semantic components or syntactic patterns.
  • Machine learning-based systems work by training the text analysis software with examples and increasing their accuracy in tagging the text. They use linguistic models like Naive Bayes, Support Vector Machines, and Deep Learning to process structured data, categorize words, and develop a semantic understanding between them.

For example, a favorable review often contains words like good, fast, and great. However, negative reviews might contain words like unhappy, slow, and bad. Data scientists train the text analysis software to look for such specific terms and categorize the reviews as positive or negative. This way, the customer support team can easily monitor customer sentiments from the reviews.

Text extraction

Text extraction scans the text and pulls out key information. It can identify keywords, product attributes, brand names, names of places, and more in a piece of text. The extraction software applies the following methods:

  • Regular expression (REGEX): This is a formatted array of symbols that serves as a precondition of what needs to be extracted.
  • Conditional random fields (CRFs): This is a machine learning method that extracts text by evaluating specific patterns or phrases. It is more refined and flexible than REGEX. 

For example, you can use text extraction to monitor brand mentions on social media. Manually tracking every occurrence of your brand on social media is impossible. Text extraction will alert you to mentions of your brand in real time. 

Topic modeling

Topic modeling methods identify and group related keywords that occur in an unstructured text into a topic or theme. These methods can read multiple text documents and sort them into themes based on the frequency of various words in the document. Topic modeling methods give context for further analysis of the documents.

For example, you can use topic modeling methods to read through your scanned document archive and classify documents into invoices, legal documents, and customer agreements. Then you can run different analysis methods on invoices to gain financial insights or on customer agreements to gain customer insights.

PII redaction

PII redaction automatically detects and removes personally identifiable information (PII) such as names, addresses, or account numbers from a document. PII redaction helps protect privacy and comply with local laws and regulations.

For example, you can analyze support tickets and knowledge articles to detect and redact PII before you index the documents in the search solution. After that, search solutions are free of PII in documents.

What are the stages in text analysis?

To implement text analysis, you need to follow a systematic process that goes through four stages.

Stage 1—Data gathering

In this stage, you gather text data from internal or external sources.

Internal data

Internal data is text content that is internal to your business and is readily available—for example, emails, chats, invoices, and employee surveys. 

External data

You can find external data in sources such as social media posts, online reviews, news articles, and online forums. It is harder to acquire external data because it is beyond your control. You might need to use web scraping tools or integrate with third-party solutions to extract external data.

Stage 2—Data preparation

Data preparation is an essential part of text analysis. It involves structuring raw text data in an acceptable format for analysis. The text analysis software automates the process and involves the following common natural language processing (NLP) methods. 

Tokenization 

Tokenization is segregating the raw text into multiple parts that make semantic sense. For example, the phrase text analytics benefits businesses tokenizes to the words textanalyticsbenefits, and businesses.

Part-of-speech tagging

Part-of-speech tagging assigns grammatical tags to the tokenized text. For example, applying this step to the previously mentioned tokens results in text: Noun; analytics: Noun; benefits: Verb; businesses: Noun.

Parsing

Parsing establishes meaningful connections between the tokenized words with English grammar. It helps the text analysis software visualize the relationship between words. 

Lemmatization 

Lemmatization is a linguistic process that simplifies words into their dictionary form, or lemma. For example, the dictionary form of visualizing is visualize.

Stop words removal

Stop words are words that offer little or no semantic context to a sentence, such as andor, and for. Depending on the use case, the software might remove them from the structured text. 

Stage 3—Text analysis

Text analysis is the core part of the process, in which text analysis software processes the text by using different methods. 

Text classification

Classification is the process of assigning tags to the text data that are based on rules or machine learning-based systems.

Text extraction

Extraction involves identifying the presence of specific keywords in the text and associating them with tags. The software uses methods such as regular expressions and conditional random fields (CRFs) to do this.

Stage 4—Visualization

Visualization is about turning the text analysis results into an easily understandable format. You will find text analytics results in graphs, charts, and tables. The visualized results help you identify patterns and trends and build action plans. For example, suppose you’re getting a spike in product returns, but you have trouble finding the causes. With visualization, you look for words such as defectswrong size, or not a good fit in the feedback and tabulate them into a chart. Then you’ll know which is the major issue that takes top priority. 

What is text analytics?

Text analytics is the quantitative data that you can obtain by analyzing patterns in multiple samples of text. It is presented in charts, tables, or graphs. 

Text analysis vs. text analytics

Text analytics helps you determine if there’s a particular trend or pattern from the results of analyzing thousands of pieces of feedback. Meanwhile, you can use text analysis to determine whether a customer’s feedback is positive or negative.

What is text mining?

Text mining is the process of obtaining qualitative insights by analyzing unstructured text. 

Text analysis vs. text mining

There is no difference between text analysis and text mining. Both terms refer to the same process of gaining valuable insights from sources such as email, survey responses, and social media feeds.

How can Amazon Comprehend help?

Amazon Comprehend is a natural language processing service that uses machine learning to uncover valuable insights and connections in text. You can use it to simplify document processing workflows by automatically classifying and extracting information from them. For example, you can use Amazon Comprehend to do the following tasks:

  • Perform sentiment analysis on customer support tickets, product reviews, social media feeds, and more. 
  • Integrate Amazon Comprehend with Amazon Lex to develop an intelligent, conversational chatbot.
  • Extract medical terms from documents and identify the relationship between them with Amazon Comprehend Medical.

Get started by creating an AWS account today.

Next steps on AWS