Lecture 8 – Web Mining and Social Network Mining

Lecture 8 explains Web Mining and Social Network Mining, including content mining, structure mining, usage mining, link analysis, PageRank, HITS, graph theory, community detection, sentiment analysis, examples, diagrams, and Python code.

The modern web contains billions of interconnected documents, pages, videos, and social posts. Every search query, click, comment, reaction, and share generates data. Organizations use this data to understand user behavior, improve search engines, detect spam, analyze trends, personalize recommendations, and strengthen cybersecurity.

This lecture explores how Data Mining techniques are applied specifically to the web and social networks.

Complete Natural Language Processing Course

Introduction to Web & Social Network Mining

What Is Web Mining?

Web Mining refers to extracting useful information from the web using:

  • Data Mining
  • Machine Learning
  • Natural Language Processing
  • Graph analysis

It covers content, structure, and user behavior on the web.

Why Web Mining Is Important

Web Mining helps:

  • Improve search engine results
  • Detect fake reviews
  • Recommend products
  • Understand user behavior
  • Track trends
  • Analyze website performance

What Is Social Network Mining?

Social Network Mining analyzes:

  • Relationships
  • User interactions
  • Network structure
  • Influence flow
  • Community formation

Examples include:

  • Facebook
  • Twitter/X
  • Instagram
  • TikTok
  • LinkedIn

Lecture 7 – Outlier and Anomaly Detection in Data Mining

Types of Web Mining

Web Mining is divided into three main categories.

1. Web Content Mining

Extracting useful content from:

  • HTML pages
  • Blogs
  • Articles
  • Images
  • Audio
  • Video

2. Web Structure Mining

Analyzing link structures:

  • Hyperlinks
  • Inbound & outbound links
  • Site architecture
  • Graph representation

3. Web Usage Mining

Understanding how users interact with websites:

  • Clickstream
  • Session logs
  • Browsing patterns

Web Content Mining

Web content mining focuses on extracting meaningful data from web pages.

Text Extraction & Web Scraping

Web scraping tools:

  • BeautifulSoup
  • Scrapy
  • Selenium

Example HTML extraction:

<h1>Data Mining Course</h1>
<p>Learn classification, clustering, and more.</p>

Python example:

from bs4 import BeautifulSoup
import requests

page = requests.get("https://example.com")
soup = BeautifulSoup(page.text, 'html.parser')
print(soup.find('h1').text)

NetworkX Documentation

NLP in Web Content Mining

Natural Language Processing techniques:

  • Tokenization
  • Named-Entity Recognition (NER)
  • Topic Modeling
  • Sentiment Analysis

Applications:

  • Summarizing news articles
  • Extracting keywords
  • Detecting spam comments

Multimedia Mining

Extracting features from:

  • Images
  • Videos
  • Audio

Tools used:

  • CNNs for image classification
  • OCR for image text extraction
  • Video content summarization

Example:
A product image is analyzed to classify color, size, or type.

Web Structure Mining

Web structure mining views the internet as a graph.

Nodes = web pages
Edges = hyperlinks

Link popularity determines importance.

PageRank Algorithm

Used by Google to rank webpages.

Formula (simplified):
PR(A) = PR(B)/links(B) + PR(C)/links(C) + ...

High PageRank = important page.

HITS Algorithm

Hyperlink-Induced Topic Search
Divides pages into:

  • Authorities (valuable pages)
  • Hubs (pages that link to many authorities)

Example:

  • A university homepage → hub
  • Research papers → authorities

Web Usage Mining

Web usage mining discovers patterns in user interactions.

Log File Sources

  • Web server logs
  • Browser logs
  • Cookies
  • Session history

Example log:

IP: 192.168.0.1  
Page Visited: /products  
Time: 10:23 AM

Clickstream Analysis

Sequence of pages visited by a user.

Example:

Homepage → Category → Product → Checkout

Useful for:

  • Improving website design
  • Personalized recommendations
  • Identifying drop-off pages

User Behavior Tracking

Using heatmaps, scroll depths, session duration to improve UX.

Introduction to Social Network Mining

Social networks are represented as graphs.

Graph Theory Basics

  • Node: A user
  • Edge: Relationship between two users

Example:

Alice — Bob  
Bob — Charlie  
Charlie — Alice

Degree

Number of connections a node has.

Centrality

Importance of a node:

  • Degree Centrality
  • Betweenness Centrality
  • Closeness Centrality

Social Network Analysis Techniques

Community Detection

Grouping people with similar interests.

Algorithms:

  • Louvain Method
  • Girvan-Newman Algorithm

Used in:

  • Marketing segmentation
  • Political campaign analysis
  • Recommendation systems

Influence Propagation

How ideas spread in networks.

Examples:

  • Viral tweets
  • Trends
  • News spread

Mathematical models:

  • Independent Cascade
  • Linear Threshold

Sentiment Analysis on Social Data

Sentiment analysis detects user emotions in posts.

Lexicon-Based Approach

Uses predefined dictionaries:

  • Positive words
  • Negative words

Example:

Happy, excited → positive
Angry, terrible → negative

Machine Learning Approach

Algorithms:

  • SVM
  • Naive Bayes

Deep Learning Approach

Models:

  • LSTMs
  • Transformers
  • BERT

Used for:

  • Brand monitoring
  • Hate-speech detection
  • Customer feedback analysis

Case Studies of Web & Social Mining

1. E-Commerce Personalization

Amazon uses:

  • Clickstream
  • Customer behavior
  • PageRank-style link analysis

To recommend:

  • Products
  • Bundles
  • Similar items

2. Security & Threat Detection

Social networks reveal:

  • Spam accounts
  • Fake news spread
  • Bot networks

Machine learning helps detect:

  • Fake profiles
  • Suspicious activity

3. Political & Social Trend Analysis

Twitter trends reflect:

  • Public opinion
  • Election predictions
  • Movement growth

Sentiment models analyze millions of tweets.

Python Examples

Web Scraping

import requests
from bs4 import BeautifulSoup

response = requests.get("https://example.com")
soup = BeautifulSoup(response.text, "html.parser")
print(soup.title.text)

PageRank (simplified)

import networkx as nx

G = nx.DiGraph()
G.add_edges_from([("A","B"), ("B","C"), ("C","A")])
pr = nx.pagerank(G)
print(pr)

Sentiment Analysis

from textblob import TextBlob

text = "This product is amazing!"
print(TextBlob(text).sentiment.polarity)

Common Challenges & Mistakes

  • Scraping dynamic websites incorrectly
  • Ignoring robots.txt rules
  • Misinterpreting sentiment polarity
  • Incorrect PageRank assumptions
  • Oversampling or undersampling in social graphs
  • Incomplete log data

Summary

Lecture 8 covered Web Mining and Social Network Mining comprehensively. It explained web content, structure, and usage mining, link analysis algorithms, social graph analysis, sentiment analysis, and influence modeling. You also learned practical Python examples, real-world case studies, and important techniques for analyzing large-scale web and social data.

Next. Lecture 9 – Data Mining Trends and Research Frontiers

People also ask:

What is Web Mining?

Web Mining is extracting useful patterns from web content, structure, and user interactions.

What is Social Network Mining?

It analyzes relationships and structures in social media networks using graph theory.

What algorithms are used in Web Structure Mining?

PageRank and HITS.

Which techniques are used in sentiment analysis?

Lexicon-based, machine learning, and deep learning models.

What is clickstream analysis?

Tracking the sequence of pages visited by a user.

Leave a Reply

Your email address will not be published. Required fields are marked *