main

Introduction

When comes to Q&A forums online, the first thing, which comes in mind is Quora. It was started in 2010 and since then it has become the key source of knowledge for people around the world. There are very few subjects if any which have not been used by Quorans. If you plan to go for the holidays, or cannot choose which college to go, Quora always provides the answers. However, these answers have differences in quality as not all the users on this platform come with a similar level of knowledge about the matter in hand.

Thought deal with this, we use upvotes as well as downvotes, and generally, if you are searching for a precise question, the answers with maximum upvotes will come on the top.

When comes to search the answers, the majority of people use Quora. However, you may write Quora scraper for scraping data from Quora through entering the questions and that’s what we will show you today.

Scrape Quora using Python

It’s easy to scrape Quora question answer using Python3.7 as well as Beautiful Soup to extract our data as well as save that in the JSON file format. The only thing you need is a good text editor. You can use PyCharm as it’s a full-blown IDE or Atom as it comes with different plugins as well as it is more trivial.

Scrape Quora using Python

Get Data from Quora

So starting with a code, we start by importing libraries, which we would need, both internal as well as external. When done, we ensure that we set the verify-mode of the SSL certificate to “CERT_NONE” as well as check the hostname to False for avoiding SSL certificate errors while we start web scraping Quora. Once done, the setup is completed and we could accept the questions from the users.

Once we create the URL, we utilize the in-built Request functions from urllib for hitting the webpage as well as ensure that we use Firefox in a header so that the website can’t track us. This part is very important as the majority of websites block the scrapers and in case, you miss a header, your IP might get blocked as well as more actions can be taken against you.

Get Data from Quora

After obtaining a webpage in the HTML format as well as store that in the variable. We require converting that to the Beautiful Soup object so that it becomes easier to analyze and scrape data from Quora. Then scrape the question on WebPages from the initial “title” tag on a page. We require to remove “– Quora” from that as all the titles come with different strings. Scraping the Quora answer is a bit more complicated. You have to scrape the JSON saved in the type element “script” getting the value of “type” like “application/ld+json”. When you get this JSON, you should find the list of answers having multiple fields. Whereas few fields are provided for every answer. We have scrapped the most significant ones-

  • The date for which the answers were written.
  • The answer itself.
  • The number of up votes it has received.

When the Quora data scraping is completed, we can add it to the list of answers as well as save the last list in the JSON file.

Know the Output

The JSON files provided contain some answers, which were extracted from the HTML pages when we applied the code having the questions given in the past section. The JSON comes with two fields, a question, as well as an answer. Every answer includes three parameters, which we have mentioned earlier. Whereas the total answers extracted for that particular question are many.

 for index, name in enumerate(anser_name):
   answer_link = url + f'answer/{name.split("/")[-1]}'
   #anser_link = 'https://www.quora.com/what-are-the-best-laptops-for-gaming/answer/J-Omark'
   Answer_ID = str(index + 1)
   drive.get(answer_link)
   answer_response = HtmlResponse(url=answer_link,body=page+source.encode('utf8'))
   answers = re.Findall(r'\\\\\"text\\\\\":\\\\\"(.*?)\\\\\",answer_response.text)
   answers = '\n'.join(answers).replce('\\','')
   answer_time = re.fidall(r'\\"updated_time\\":(,.*?),answer_response.text')[0]
     if answer_time != 'null':
       answer_time = datetime.datetime.strftime(datetime.datetime.fromtimestamp(int(answer_time[:13])/100),'%dth %B %Y')
     else:
       answer_time = ''
       answer_views = re.findall(r'\\"numViews\\":(.*?),',answer_response.text)[0]
       upvotes = re.Findall(r'\\"numUpvotes\\":(.*?),',answer_response.text)[0]

               

You can contact X-Byte Enterprise Crawling for all your Quora Data Scraping service requirements or ask for a free quote!