How-to-Scrape-LinkedIn-for-Public-Company-Data

At X-Byte Enterprise Crawling, we feel very happy that you have visited out page about how to scrape LinkedIn for public company Data and you won’t be disappointed!

Through this tutorial, we will demonstrate you how to scrape LinkedIn public pages. For people, who have come on this page with no understanding about why they need to scrape LinkedIn company data, let’s discuss a few points:

  • Automation in LinkedIn Search: You wish to work for the company having some particular criteria as well as they are not the normal suspects. You may have the shortlist, but that list isn’t short and more like the long list. You need a tool like Google Finance, which could help in filtering companies depending on the criteria they get published on LinkedIn. You may take the “long list” to scrape this data into a well-structured format and create a wonderful analysis tool.

  • Interest: You are interested about the companies on LinkedIn as well as want to collect a good set of data to satisfy your interest.

  • Tinkerer: You want to tinker as well as found that you might like to learn Python as well as need something helpful to begin.

Whatever the reason might be, you have come at the right place!

In the tutorial, basic steps are given about how to scrape data from LinkedIn using Python.

Prerequisites:

In this tutorial, we will use basic Python as well as some python packages – LXML and requests. We won’t use more complex packages like Scrapy for anything simple.

You require to install following things:

Python 2.7 accessible here
( https://www.python.org/downloads/)

Python Requests accessible here (https://docs.python-requests.org/en/latest/user/install/). You could need Python pips to install this accessible here –
https://pip.pypa.io/en/stable/installing/)

Python LXML (Study how to install it here – http://lxml.de/installation.html)

The code is scraping LinkedIn is entrenched below as well as if you are not capable to see that in the browser, this could be downloaded from GIST here.

from lxml import html

import csv, os, json

import requests

from exceptions import ValueError

from time import sleep

def linkedin_companies_parser(url):

   for i in range(5):

      try:

         headers = {

            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)             Chrome/42.0.2311.90 Safari/537.36'

         }

       print "Fetching :",url

       response = requests.get(url, headers = headers,verify=False)

       formatted_response = response.content.replace('', '')

       doc = html.fromstring(formatted_response)

       datafrom_xpath = doc.xpath('//code[@id="stream-promo-top-bar-embed-id-content"]//text()')

       content_about = doc.xpath('//code[@id="stream-about-section-embed-id-content"]')

       if not content_about:

          content_about = doc.xpath('//code[@id="stream-footer-embed-id-content"]')

       if content_about:

              pass

       # json_text = content_about[0].html_content().replace('','')

        if datafrom_xpath:

              try

                 json_formatted_data = json.loads(datafrom_xpath[0])

                 company_name = json_formatted_data['companyName'] if 'companyName' in json_formatted_data.keys() else None

                 size = json_formatted_data['size'] if 'size' in json_formatted_data.keys() else None

                 industry = json_formatted_data['industry'] if 'industry' in json_formatted_data.keys() else None

                 description = json_formatted_data['description'] if 'description' in json_formatted_data. keys() else None

                 follower_count = json_formatted_data['followerCount'] if 'followerCount' in json_form atted_data.keys() else None

                  year_founded = json_formatted_data['yearFounded'] if 'yearFounded' in json_forma tted_data.keys() else None

                 website = json_formatted_data['website'] if 'website' in json_formatted_data.keys() else None

                  type = json_formatted_data['companyType'] if 'companyType' in json_formatted_data .keys() else None

                 specialities = json_formatted_data['specialties'] if 'specialties' in json_formatted_data. keys() else None

                  if "headquarters" in json_formatted_data.keys():

                  city = json_formatted_data["headquarters"]['city'] if 'city' in json_formatted_data["he adquarters"].keys() else None

                 country = json_formatted_data["headquarters"]['country'] if 'country' in json_formatted _data['headquarters'].keys() else None

                 state = json_formatted_data["headquarters"]['state'] if 'state' in json_formatted_data[' headquarters'].keys() else None

                 street1 = json_formatted_data["headquarters"]['street1'] if 'street1' in json_formatted _data['headquarters'].keys() else None

                 street2 = json_formatted_data["headquarters"]['street2'] if 'street2' in json_formatted _data['headquarters'].keys() else None

                 zip = json_formatted_data["headquarters"]['zip'] if 'zip' in json_formatted_data['headq uarters'].keys() else None

                 street = street1 + ', ' + street2

             else:

                 zip = none

                 city = None

                 state = None

                 country = none

                 street = none

                 street1 = None

                 street2 = None

                 data = {

                     'company_name': company_name,

                    'size': size,

                    'industry': industry,

                     'description': description,

                     'follower_count': follower_count,

                    'founded': year_founded,

                     'website': website,

                    'type': type,

                    'specialities': specialities,

                     'city': city,

                     'country': country,

                     'state': state,

                     'street': street,

                    'zip': zip,

                    'url': url

                   }

                   return data

               except:

                  print "cant parse page", url

                 # Retry in case of captcha or login page redirection

                  if len(response.content) < 2000 or "trk=login_reg_redirect" in url:

                   if response.status_code == 404:

                     print "linkedin page not found"

                   else

                      raise ValueError('redirecting to login page or captcha found')

                 except :

                     print "retrying :",url

        def readurls():

           companyurls = ['https://www.linkedin.com/company/tata-consultancy-services']

           extracted_data = []

          for url in companyurls:

             extracted_data.append(linkedin_companies_parser(url))

             f = open('data.json', 'w')

             json.dump(extracted_data, f, indent=4)

    if __name__ == "__main__":

        readurls()

You just need to change a URL in that line

companyurls = ['https://www.linkedin.com/company/xbyte-crawling']

or add some URLs detached by different commas to that list You may save a file as well as run that using Python – python filename.py

The result will be in the file named data.json using the similar directory as well as will look somewhat like this

        "website": "https://www.xbyte.io",

“description”: “X-Byte Enterprise Crawling is among the best web scraping companies in the world for the reason.\r\n We won’t leave you with the \”self-service\” screen for building your individual scrapers.\r\n We have the real humans, which will chat to you inside hours of the request as well as help you in your requirement.\r\n Although we are the leading providers in this field, our investment in the automation has helped us in providing a totally \”full service\” at affordable prices.\r\n Contact us at www.xbyte.io and experience our amazing customer service “

"founded": 2012,

"street": Houston,

"specialities": [

"Web Scraping Service",

"Website Scraping",

"Screen scraping",

"Data scraping",

"Web crawling",

"Data as a Service",

"Data extraction API",

"Scrapy",

"Python",

"DaaS"

],

"size": "100-150 employees",

"city": Houston,

"zip": TX-770143,

"url": "https://www.linkedin.com/company/xbyte-crawling",

"country": USA,

"industry": "Computer Software",

"state": Texas,

"company_name": "X-Byte Enterprise Crawling",

"follower_count": 2262,

"type": "Privately Held"

}

Or in case you are running that for Cisco

companyurls = ['https://www.linkedin.com/company/cisco']

The result will be like this

"website": "http://www.cisco.com",

“description”: “Cisco (NASDAQ: CSCO) allows people to create powerful connections– in education, philanthropy, business, or imagination. Cisco software, hardware, and services offerings are utilized for creating the Internet solutions, which make networks possible–offering easy use to data anywhere an time. \r\n\r\n Cisco was initiated in 1984 by the small group of computer professionals from Stanford University. Ever since the company’s origin, Cisco engineers are the leaders in development of the Internet Protocol (IP)-based networking skills. Today, having over 71,000 employees globally, this practice of revolution continues with the industry-leading solutions and products in company’s key development areas of switching and routing, and with advanced technologies like IP telephony, home networking, security, optical networking, wireless technology, and storage area networking. Besides its products, Cisco offers an extensive range of service offerings like advanced services and technical support. \r\n\r\n Cisco sells its services and products, both directly using its individual sales force or using the channel partners, commercial businesses, larger enterprises, consumers, and service providers.”

"founded": 1984,

"street": "Tasman Way, ",

"specialities": [

"Networking",

"Wireless",

"Security",

"Unified Communication",

"Telepresence",

"Collaboration",

"Data Center",

"Virtualization",

"Unified Computing Systems"

],

"size": "10,001+ employees",

"city": "San Jose","zip": "95134",

"zip": "95134",

"url": "https://www.linkedin.com/company/cisco",

"country": "United States",

"industry": "Computer Networking",

"state": "CA",

"company_name": "Cisco",

"follower_count": 1201541,

"type": "Public Company"

}

Warning: As LinkedIn requires you to log in whenever you open the website, this code might not work for you.

You can easily change the fields or URLs you wish to scrape. Contact us for scraping LinkedIn for public company data!