Skip to content

Scrapes facebook's pages front end with no limitations & provides a feature to turn data into structured JSON or CSV

License

Notifications You must be signed in to change notification settings

shaikhsajid1111/facebook_page_scraper

Repository files navigation

Facebook Page Scraper

Maintenance PyPI license Python >=3.6.9

No need of API key, No limitation on number of requests. Import the library and Just Do It !

Table of Contents

Table of Contents
  1. Getting Started
  2. Usage
  • Tech
  • License
  • Prerequisites

    • Internet Connection
    • Python 3.7+
    • Chrome or Firefox browser installed on your machine

    Installation:

    Installing from source:

    git clone https://github.com/shaikhsajid1111/facebook_page_scraper
    

    Inside project's directory

    python3 setup.py install
    

    Installing with pypi

    pip3 install facebook-page-scraper
    


    How to use?

    #import Facebook_scraper class from facebook_page_scraper
    from facebook_page_scraper import Facebook_scraper
    
    #instantiate the Facebook_scraper class
    
    page_or_group_name = "Meta"
    posts_count = 10
    browser = "firefox"
    proxy = "IP:PORT" #if proxy requires authentication then user:password@IP:PORT
    timeout = 600 #600 seconds
    headless = True
    # get env password
    fb_password = os.getenv('fb_password')
    fb_email = os.getenv('fb_email')
    # indicates if the Facebook target is a FB group or FB page
    isGroup= False
    meta_ai = Facebook_scraper(page_or_group_name, posts_count, browser, proxy=proxy, timeout=timeout, headless=headless, isGroup=isGroup)

    Parameters for Facebook_scraper(page_name, posts_count, browser, proxy, timeout, headless) class

    Parameter Name Parameter Type Description
    page_or_group_name String Name of the facebook page or group
    posts_count Integer Number of posts to scrap, if not passed default is 10
    browser String Which browser to use, either chrome or firefox. if not passed,default is chrome
    proxy(optional) String Optional argument, if user wants to set proxy, if proxy requires authentication then the format will be user:password@IP:PORT
    timeout Integer The maximum amount of time the bot should run for. If not passed, the default timeout is set to 10 minutes
    headless Boolean Whether to run browser in headless mode?. Default is True
    isGroup Boolean Whether the Facebook target is a group or page. Default is False
    username String username to log into Facebook when scraping (recommended to use .env)
    password String password to log into Facebook when scraping (recommended to use .env)



    ⚠️ Warning: Use Logged-In Scraping at Your Own Risk ⚠️

    Using logged-in scraping methods may result in the permanent suspension of your account. Proceed with caution, as violating a platform's terms of service can lead to severe consequences. Exercise discretion and adhere to ethical practices when collecting data through scraping. The library/provider assumes no responsibility for any consequences resulting from the misuse of scraping methods.

    Done with instantiation?. Let the scraping begin!


    For post's data in JSON format:

    #call the scrap_to_json() method
    
    json_data = meta_ai.scrap_to_json()
    print(json_data)

    Output:

    {
      "2024182624425347": {
        "name": "Meta AI",
        "shares": 0,
        "reactions": {
          "likes": 154,
          "loves": 19,
          "wow": 0,
          "cares": 0,
          "sad": 0,
          "angry": 0,
          "haha": 0
        },
        "reaction_count": 173,
        "comments": 2,
        "content": "We’ve built data2vec, the first general high-performance self-supervised algorithm for speech, vision, and text. We applied it to different modalities and found it matches or outperforms the best self-supervised algorithms. We hope this brings us closer to a world where computers can learn to solve many different tasks without supervision. Learn more and get the code:  https://ai.facebook.com/…/the-first-high-performance-self-s…",
        "posted_on": "2022-01-20T22:43:35",
        "video": [],
        "image": [
          "https://scontent-bom1-2.xx.fbcdn.net/v/t39.30808-6/s480x480/272147088_2024182621092014_6532581039236849529_n.jpg?_nc_cat=100&ccb=1-5&_nc_sid=8024bb&_nc_ohc=j4_1PAndJTIAX82OLNq&_nc_ht=scontent-bom1-2.xx&oh=00_AT9us__TvC9eYBqRyQEwEtYSit9r2UKYg0gFoRK7Efrhyw&oe=61F17B71"
        ],
        "post_url": "https://www.facebook.com/MetaAI/photos/a.360372474139712/2024182624425347/?type=3&__xts__%5B0%5D=68.ARBoSaQ-pAC_ApucZNHZ6R-BI3YUSjH4sXsfdZRQ2zZFOwgWGhjt6dmg0VOcmGCLhSFyXpecOY9g1A94vrzU_T-GtYFagqDkJjHuhoyPW2vnkn7fvfzx-ql7fsBYxL5DgQVSsiC1cPoycdCvHmi6BV5Sc4fKADdgDhdFvVvr-ttzXG1ng2DbLzU-XfSes7SAnrPs-gxjODPKJ7AdqkqkSQJ4HrsLgxMgcLFdCsE6feWL7rXjptVWegMVMthhJNVqO0JHu986XBfKKqB60aBFvyAzTSEwJD6o72GtnyzQ-BcH7JxmLtb2_A&__tn__=-R"
      }, ...
    
    }
    Output Structure for JSON format:
    {
        "id": {
            "name": string,
            "shares": integer,
            "reactions": {
                "likes": integer,
                "loves": integer,
                "wow": integer,
                "cares": integer,
                "sad": integer,
                "angry": integer,
                "haha": integer
            },
            "reaction_count": integer,
            "comments": integer,
            "content": string,
            "video" : list,
            "image" : list,
            "posted_on": datetime,  //string containing datetime in ISO 8601
            "post_url": string
        }
    }



    For saving post's data directly to CSV file

    #call scrap_to_csv(filename,directory) method
    
    
    filename = "data_file"  #file name without CSV extension,where data will be saved
    directory = "E:\data" #directory where CSV file will be saved
    meta_ai.scrap_to_csv(filename, directory)

    content of data_file.csv:

    id,name,shares,likes,loves,wow,cares,sad,angry,haha,reactions_count,comments,content,posted_on,video,image,post_url
    2024182624425347,Meta AI,0,154,19,0,0,0,0,0,173,2,"We’ve built data2vec, the first general high-performance self-supervised algorithm for speech, vision, and text. We applied it to different modalities and found it matches or outperforms the best self-supervised algorithms. We hope this brings us closer to a world where computers can learn to solve many different tasks without supervision. Learn more and get the code:  https://ai.facebook.com/…/the-first-high-performance-self-s…",2022-01-20T22:43:35,,https://scontent-bom1-2.xx.fbcdn.net/v/t39.30808-6/s480x480/272147088_2024182621092014_6532581039236849529_n.jpg?_nc_cat=100&ccb=1-5&_nc_sid=8024bb&_nc_ohc=j4_1PAndJTIAX82OLNq&_nc_ht=scontent-bom1-2.xx&oh=00_AT9us__TvC9eYBqRyQEwEtYSit9r2UKYg0gFoRK7Efrhyw&oe=61F17B71,https://www.facebook.com/MetaAI/photos/a.360372474139712/2024182624425347/?type=3&__xts__%5B0%5D=68.ARAse4eiZmZQDOZumNZEDR0tQkE5B6g50K6S66JJPccb-KaWJWg6Yz4v19BQFSZRMd04MeBmV24VqvqMB3oyjAwMDJUtpmgkMiITtSP8HOgy8QEx_vFlq1j-UEImZkzeEgSAJYINndnR5aSQn0GUwL54L3x2BsxEqL1lElL7SnHfTVvIFUDyNfAqUWIsXrkI8X5KjoDchUj7aHRga1HB5EE0x60dZcHogUMb1sJDRmKCcx8xisRgk5XzdZKCQDDdEkUqN-Ch9_NYTMtxlchz1KfR0w9wRt8y9l7E7BNhfLrmm4qyxo-ZpA&__tn__=-R
    ...
    



    Parameters for scrap_to_csv(filename, directory) method.

    Parameter Name Parameter Type Description
    filename String Name of the CSV file where post's data will be saved
    directory String Directory where CSV file have to be stored.



    Keys of the outputs:

    Key Type Description
    id String Post Identifier(integer casted inside string)
    name String Name of the page
    shares Integer Share count of post
    reactions Dictionary Dictionary containing reactions as keys and its count as value. Keys => ["likes","loves","wow","cares","sad","angry","haha"]
    reaction_count Integer Total reaction count of post
    comments Integer Comments count of post
    content String Content of post as text
    video List URLs of video present in that post
    images List List containing URLs of all images present in the post
    posted_on Datetime Time at which post was posted(in ISO 8601 format)
    post_url String URL for that post


    Tech

    This project uses different libraries to work properly.



    If you encounter anything unusual please feel free to create issue here

    LICENSE

    MIT