Text Analysis Pipeline

This project is a comprehensive text analysis pipeline developed in Python. It extracts content from a list of URLs, performs various text analyses, and generates insightful metrics.

Key Features

Web scraping from multiple URLs
Sentiment analysis
Readability metrics calculation
Error handling for 404 pages

Components

1. Extraction Module (`extraction.py`)

Reads URLs from an Excel file
Extracts text content from web pages
Handles 404 errors and generates an error log
Saves extracted text to individual files

2. Analysis Module (`analysis.py`)

Processes the extracted text files
Performs sentiment analysis using TextBlob
Calculates various readability metrics:
- Positive/Negative scores
- Polarity and Subjectivity scores
- FOG index
- Complex word count
- Syllable count
- Personal pronouns count
Generates an Excel file with comprehensive metrics

alt text

Tech Stack

Python
Pandas for data manipulation
BeautifulSoup for web scraping
TextBlob for sentiment analysis
NLTK for natural language processing tasks

This project demonstrates my proficiency in web scraping, text processing, and data analysis, providing valuable insights from web content.

Key Features

Components

1. Extraction Module (extraction.py)

2. Analysis Module (analysis.py)

Tech Stack

1. Extraction Module (`extraction.py`)

2. Analysis Module (`analysis.py`)