So in today's post I thought I'd talk about one of my projects and what's cool is that it relates to my blog posts on TheNibbleByte's website!!!😃
It's a program which scrapes my views off the internet and then uses the data to generate a graph showing a trend line of the views. This will be useful for me as I continue to post more and more posts on this website!
I should point out that I'm only scraping from this website and not Medium because I haven't built my audience on there yet.
I should also point out that this solution is by no means perfect. I'm sure that I could've made it more complex, more efficient and added more features however this is just something I've quickly made that does the job (heuristic solution😉).
In this post I will be showing the code (full source code available on my GitHub)
In my opinion, this post isn't really for Python beginners (although it's still worth reading!!). Nevertheless, enjoy!
Selenium is an open sourced web-automation tool and the webdriver is what we use to access the page and 'scrape' data.
Time was something I used during development to pause the execution for 10 seconds, just so I knew how long to wait for.
NumPy is a linear algebra library which I used to cleanse the data and convert it from a List to Array.
Pandas is what I used to convert the Array(s) into a DataFrame.
SeaBorn is a Data Visualisation library in which I fed the DataFrame in, as a parameter to create the LinePlot.
Getting the Data:
The first 2 lines are where the chrome WebDriver is used and that's what we use to scrape the web on Chrome: https://chromedriver.chromium.org/downloads .
The 2nd line just creates an instance of the WebDriver, in our Python code which allows us to access the website later.
The next line 'gets' the webpage that contains all my blogposts. By running this line on its own it would just open the webpage and do nothing.
Now I find all the span-tags with the class name "M1M61". This might sound quite random however I found this by using 'inspect element' , looking for the number of views and viewing the HTML:
At this point it should be noted that this step varies depending on your project. The point is that I'm looking for the number of views, this will be handy later.
Next I initialise a Python list called 'views' which will store the view-count for each blog post.
Now I loop through all the view-count's , convert their data type from WebElement to int() and then store them in the list.
The 5 second pause was something I used while developing, it isn't really necessary. Once I've collected the data I then close the page as we're done with it.
Cleansing the Data/Putting it into an Appropriate Format:
It's useless having all the data if it's not in a useful format where we can analyse/interpret it and then plot graphs to visualise it.
At this point, I just want to remind you again that this part isn't the most optimal however it does do the job for the time being!!
When I was looking at the HTML, I noticed that some of the comments had the same class name ("M1M61") as the view count. This is quite challenging because it was different to distinguish between views and comments.
The only way I could distinguish between them is by assuming that comments were never going to be more than 10 (up to date, this has been the case!). Of course in the future, this can cause problems as my number of viewers and comments increase however like I said, it works for now!