blackmaria - Enhance Web Scraping Capabilities Through Natural Language Processing in Python

Project Introduction: Black Maria

Getting Started with Black Maria

Prerequisites

To get started with Black Maria, one must have Python version 3.6 or above installed on their system. Python is a widely-used programming language, and its up-to-date version can be downloaded from its official website.

Installation

Once Python is ready, the next step is to set up the Black Maria library. This involves exporting the OPEN_AI_KEY to the environment variables, which is an essential step for accessing certain functionalities. Following this, the Black Maria can be installed using the Python package manager with the command: pip install blackmaria.

What is Black Maria?

Black Maria is a versatile Python library designed for web scraping, which allows users to extract data from any webpage efficiently using natural language. Web scraping is the process of automatically retrieving information from the internet, and Black Maria simplifies this task by using intuitive language inputs.

How to Use Black Maria?

Black Maria employs a fascinating feature known as "guardrails." These are essentially a set of prescribed instructions that direct the language model on how the output should be formatted. An example code snippet illustrates this process:

from blackmaria import maria

url="https://yellowjackets.fandom.com/wiki/F_Sharp"
spec=("""
    <rail version="0.1">

    <output>
        <object name="movie" format="length: 2">
            <string
                name="summary"
                description="the summary section of the movie"
                format="length: 200 240"
                on-fail-length="noop"
            />
            <object name="cast" description="The cast in the movie" format="length: 3">
            <list name="starring">
        
                <string format="two-words"
                on-fail-two-words="reask"
                description="The starring section for the movie and roles"
        
                
                />
            </list>
            <list name="guest_starring">
            
            <string format="two-words"
                on-fail-two-words="reask"
                description="The Guest starring section and roles"
                />
            </list>
            <list name="co-starring">
            
            <string format="two-words"
                on-fail-two-words="reask"
                description="the starring section"
                />
            </list>
            
            </object>


        </object>
    </output>


    <prompt>

    Query string here.

    @xml_prefix_prompt

    {output_schema}

    @json_suffix_prompt_v2_wo_none
    </prompt>
    </rail>
    """)
query="provide details about the movie,summary,cast,cast.starring,cast.guest_starring,cast.co-starring"
query_response=maria.night_crawler(url=url,spec=spec,query=query)
print(query_response)

Example Output

The output from the example code snippet might look like this:

{
  "movie": {
    "summary": "As the teens get their bearings among the wreckage, Misty finds hell on earth quite becoming. In the present: revenge, sex homework and the policeman formerly known as Goth.",
    "cast": {
      "starring": [
        "Lottie Matthews",
        "Vanessa Palmer",
        "Misty Quigley",
        "Shauna Sadecki",
        "Natalie Scatorccio",
        "Taissa Turner"
      ],
      "guest_starring": [
        "Akilah",
        "Laura Lee",
        "Mari",
        "Adam Martin",
        "Javi Martinez",
        "Travis Martinez",
        "Jessica Roberts",
        "Jeff Sadecki",
        "Ben Scott",
        "Jackie Taylor"
      ],
      "co-starring": ["Kevyn Tan", "Simone"]
    }
  }
}

This output illustrates Black Maria's capability to accurately parse and present detailed information from a web page, demonstrating its effectiveness in web scraping using natural language. Through the use of guardrails and specific query specifications, users can retrieve precisely formatted data, making Black Maria a powerful tool for developers and data enthusiasts alike.