AI-Powered Data Classification: Unleash Valuable Insights Within Snowflake

Snowflake Cortex’s Private LLM models offer a powerful solution for data classification tasks. These models are adept at analyzing and organizing data, providing valuable insights that can drive strategic decision-making.

One of the interesting use cases we’ve found is to leverage AI to do data classification tasks without having to engage your Data Science Engineers or building complex classifier models. You just ask the AI specifically what you want analyzed and how to think about it — that’s really all there is to it.

What I want to highlight today is how to ask the classification questions, and how to parse the answer in a way that is usable.

The Situation

Let’s say you have a table named blog_​post” with a column named body” which contains the body of blog posts that you have pulled from the internet.

I think a good logical next step would be to derive some classification data out of the post. For example, we might want to know:

  • Industry — What industries does this post seem to be related to?
  • Topic — What is the classification of topic such as self-help, data warehousing, etc.?
  • Target Audience — Is this for business professionals or casual video game players?

This type of analysis is very easy to solve using any modern AI LLM (Large Language Model). We can use OpenAI but there are issues with OpenAI or most other LLM providers:

  1. Most critically: Your data has to leave the protected Snowflake data lake — i.e., you are effectively sending all of your data to a 3rd party all the risks that come with this.
  2. They tend to be fairly slow, especially when you are dealing with thousands or millions of records.
  3. They can cost a lot because the latest models like GPT4 are rather expensive at the moment.

However, Snowflake has a new feature, currently in preview, called Snowflake Cortex LLM Functions. These allow you to easily use an opensource LLM directly withing Snowflake. This solves all 3 of the above issues — the data never leaves Snowflake, they run much faster than external API calls to OpenAI, and the cost is very reasonable, depending on the model you select to use.

Enough preamble, let’s get into some code.

Let’s assume you have this simple table:

CREATE OR REPLACE TABLE BLOG_POST (
   id int,
   url varchar,
   body varchar
);

Populated with the text of this blog post. To keep the post simple I’ve put this insert statement in a gist on github.

Here is a basic example of a prompt you can write to get Industry, Topic and Target Audience.

  ### General Instructions:
   - You are a classification AI that will help to classify or extract information from blog posts.
   - Output only in valid JSON. For example:
  
  {
    "industry": "Tech",
    "topic": "Opinion & Editorial",
    "audience": "Professionals"
  }
  
   - Do not explain yourself or your reasoning.
  
  ### Classification Instructions:
  **Industry Classification:**
  
  Please classify the blog post\\' industry by selecting the most appropriate category from the list below. Each category includes a brief description to guide your selection.
  
  - **Tech**: Focuses on high tech, software, internet, and future technology innovations. Excludes manufacturing and industrial technology.
  - **Healthcare**: Pertains to medical advances, health advice, wellness, fitness, and the healthcare industry.
  - **Finance**: Covers banking, investment, personal finance, market trends, and financial advice.
  - **Education**: Related to educational techniques, institutions, e-learning, pedagogical research, and academic advancements.
  - **Entertainment**: Includes movies, music, video games, celebrity news, and the broader entertainment industry.
  - **Food & Beverage**: Encompasses cooking, restaurant reviews, food trends, beverages, and the culinary arts.
  - **Travel**: Involves travel guides, tips, destination reviews, and discussions about cultures around the world.
  - **Fashion**: Deals with fashion trends, clothing, personal style, the fashion industry, and beauty tips.
  - **Automotive**: Relates to cars, motorcycles, industry news, vehicle technology, and automotive lifestyle.
  - **Real Estate**: Covers housing market trends, property investment, home improvement, and architecture.
  
  **Topic Classification:**
  
  Please identify the main topic of the blog post from the following categories. Each category comes with a definition to assist in classification.
  
  - **Innovation & Trends**: New ideas, trends, and innovations within the industry.
  - **Guides & How-To**: Educational content providing step-by-step instructions or guidance.
  - **News & Analysis**: Current events, news, and analytical commentary on industry-specific developments.
  - **Opinion & Editorial**: Personal opinions or editorial pieces reflecting individual viewpoints.
  - **Interviews & Stories**: Conversations with industry figures or storytelling related to personal experiences.
  - **Reviews & Comparisons**: Evaluations or comparative analysis of products, services, or experiences.
  - **Case Studies & Research**: Detailed analysis of specific instances, research findings, or in-depth exploration of subjects.
  
  **Target Audience Classification:**
  
  Select the target audience for the blog post from the list below. Descriptions are provided to ensure accurate classification.
  
  - **General Public**: Content is intended for a broad audience, with no specific knowledge required.
  - **Professionals**: Tailored for individuals with expertise or a professional background in the field.
  - **Enthusiasts & Hobbyists**: Aimed at those with a keen interest or hobby related to the topic.
  - **Students & Educators**: Designed for students or educators seeking informational or educational content.
  - **Industry Insiders**: For individuals with deep knowledge or involvement in the specific industry.
  - **Investors & Entrepreneurs**: Content geared towards financial or business stakeholders in the industry.
  
  ---
  
  ### Blog Post:

This is called prompt engineering and is how you instruct the AI on how it should think and what you expect as a result. Notice how we’re also instructing the AI to return only JSON data. This is critical as we can then programmatically parse the results and use this data to update our table with the resulting classification data.

Here is the complete statement for updating our blog_​post table with new columns for industry, topic and audience and populated using our above prompt for the AI:

-- adding new classification columns to our blog_post table:
ALTER TABLE BLOG_POST ADD COLUMN industry VARCHAR;
ALTER TABLE BLOG_POST ADD COLUMN topic VARCHAR;
ALTER TABLE BLOG_POST ADD COLUMN audience VARCHAR;

-- Now doing the AI logic and updating the data in the table all at once:
UPDATE BLOG_POST 
SET industry = PARSED_ANALYSIS.industry,
    topic    = PARSED_ANALYSIS.topic,
    audience = PARSED_ANALYSIS.audience
FROM (
    WITH llm_result as (
        SELECT id,
        SNOWFLAKE.CORTEX.COMPLETE('mistral-7b', 
        '
        ### General Instructions:
         - You are a classification AI that will help to classify or extract information from blog posts.
         - Output only in valid JSON. For example:
        
        {
          "industry": "Tech",
          "topic": "Opinion & Editorial",
          "audience": "Professionals"
        }
        
         - Do not explain yourself or your reasoning.
        
        ### Classification Instructions:
        **Industry Classification:**
        
        Please classify the blog post\\' industry by selecting the most appropriate category from the list below. Each category includes a brief description to guide your selection.
        
        - **Tech**: Focuses on high tech, software, internet, and future technology innovations. Excludes manufacturing and industrial technology.
        - **Healthcare**: Pertains to medical advances, health advice, wellness, fitness, and the healthcare industry.
        - **Finance**: Covers banking, investment, personal finance, market trends, and financial advice.
        - **Education**: Related to educational techniques, institutions, e-learning, pedagogical research, and academic advancements.
        - **Entertainment**: Includes movies, music, video games, celebrity news, and the broader entertainment industry.
        - **Food & Beverage**: Encompasses cooking, restaurant reviews, food trends, beverages, and the culinary arts.
        - **Travel**: Involves travel guides, tips, destination reviews, and discussions about cultures around the world.
        - **Fashion**: Deals with fashion trends, clothing, personal style, the fashion industry, and beauty tips.
        - **Automotive**: Relates to cars, motorcycles, industry news, vehicle technology, and automotive lifestyle.
        - **Real Estate**: Covers housing market trends, property investment, home improvement, and architecture.
        
        **Topic Classification:**
        
        Please identify the main topic of the blog post from the following categories. Each category comes with a definition to assist in classification.
        
        - **Innovation & Trends**: New ideas, trends, and innovations within the industry.
        - **Guides & How-To**: Educational content providing step-by-step instructions or guidance.
        - **News & Analysis**: Current events, news, and analytical commentary on industry-specific developments.
        - **Opinion & Editorial**: Personal opinions or editorial pieces reflecting individual viewpoints.
        - **Interviews & Stories**: Conversations with industry figures or storytelling related to personal experiences.
        - **Reviews & Comparisons**: Evaluations or comparative analysis of products, services, or experiences.
        - **Case Studies & Research**: Detailed analysis of specific instances, research findings, or in-depth exploration of subjects.
        
        **Target Audience Classification:**
        
        Select the target audience for the blog post from the list below. Descriptions are provided to ensure accurate classification.
        
        - **General Public**: Content is intended for a broad audience, with no specific knowledge required.
        - **Professionals**: Tailored for individuals with expertise or a professional background in the field.
        - **Enthusiasts & Hobbyists**: Aimed at those with a keen interest or hobby related to the topic.
        - **Students & Educators**: Designed for students or educators seeking informational or educational content.
        - **Industry Insiders**: For individuals with deep knowledge or involvement in the specific industry.
        - **Investors & Entrepreneurs**: Content geared towards financial or business stakeholders in the industry.
        
        ---
        
        ### Blog Post:
        ' || body
        ) as analysis_result
        FROM BLOG_POST
    )
    SELECT id,
       JSON_EXTRACT_PATH_TEXT(parse_custom_json(analysis_result), 'industry') as industry,
       JSON_EXTRACT_PATH_TEXT(parse_custom_json(analysis_result), 'topic') as topic,
       JSON_EXTRACT_PATH_TEXT(parse_custom_json(analysis_result), 'audience') as audience
    FROM LLM_RESULT
) AS PARSED_ANALYSIS
WHERE BLOG_POST.ID = PARSED_ANALYSIS.ID
 AND (BLOG_POST.industry is NULL AND BLOG_POST.TOPIC IS NULL AND BLOG_POST.AUDIENCE IS NULL)
;

Now if we query our BLOG_POST table we get:

SELECT id, industry, topic, audience FROM BLOG_POST;
ID INDUSTRY TOPIC AUDIENCE
1 Tech Technology & Innovation Professionals

Conclusion

In conclusion, Snowflake Cortex’s LLM (Large Language Model) functions offer a potent solution for data classification tasks. By leveraging these built-in models, you can bypass the need for external services, ensuring your data remains secure within Snowflake while also benefiting from faster performance and cost-effectiveness. With the provided code examples, you can easily implement this AI-driven approach for data classification in your own projects. It’s yet another demonstration of how Snowflake continues to innovate, helping you to unlock valuable insights from your data and drive your strategic decision-making process.

P.S.

One thing you may notice in the main query in this blog post is a function called parse_​custom_​json” — Sometimes the LLM models hosted within Cortex will return slightly invalid JSON. This parse_​custom_​json” handles this format, if needed. Here is the syntax for parse_custom_json:

CREATE OR REPLACE FUNCTION parse_custom_json(input_string VARCHAR)
RETURNS VARIANT
LANGUAGE SQL
AS
$$
    -- Strip starting ```json and trailing ``` if present, then parse as JSON
    WITH cleaned_input AS (
        SELECT
            IFF(
                LEFT(input_string, 7) = '```json' AND RIGHT(input_string, 3) = '```',
                -- Strip both markers
                SUBSTRING(input_string, 8, LENGTH(input_string) - 10),
                IFF(
                    LEFT(input_string, 7) = '```json',
                    -- Strip only the starting marker
                    SUBSTRING(input_string, 8),
                    IFF(
                        RIGHT(input_string, 3) = '```',
                        -- Strip only the ending marker
                        LEFT(input_string, LENGTH(input_string) - 3),
                        -- No markers to strip
                        input_string
                    )
                )
            ) AS cleaned_string
    )
    SELECT cleaned_string::variant FROM cleaned_input
$$;