Web « codeblog

December 20, 2013

DOM scraping

Filed under: Blogging,Debian,General,Ubuntu,Ubuntu-Server,Web — kees @ 11:16 pm

For a long time now I’ve used mechanize (via either Perl or Python) for doing website interaction automation. Stuff like playing web games, checking the weather, or reviewing my balance at the bank. However, as the use of javascript continues to increase, it’s getting harder and harder to screen-scrape without actually processing DOM events. To do that, really only browsers are doing the right thing, so getting attached to an actual browser DOM is generally the only way to do any kind of web interaction automation.

It seems the thing furthest along this path is Selenium. Initially, I spent some time trying to make it work with Firefox, but gave up. Instead, this seems to work nicely with Chrome via the Chrome WebDriver. And even better, all of this works out of the box on Ubuntu 13.10 via python-selenium and chromium-chromedriver.

Running /usr/lib/chromium-browser/chromedriver2_server from chromium-chromedriver starts a network listener on port 9515. This is the WebDriver API that Selenium can talk to. When requests are made, chromedriver2_server spawns Chrome, and all the interactions happen against that browser.

Since I prefer Python, I avoided the Java interfaces and focused on the Python bindings:

#!/usr/bin/env python
import sys
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys

caps = webdriver.DesiredCapabilities.CHROME

browser = webdriver.Remote("http://localhost:9515", caps)

browser.get("https://bank.example.com/")
assert "My Bank" in browser.title

try:
    elem = browser.find_element_by_name("userid")
    elem.send_keys("username")

    elem = browser.find_element_by_name("password")
    elem.send_keys("wheee my password" + Keys.RETURN)
except NoSuchElementException:
    print "Could not find login elements"
    sys.exit(1)

assert "Account Balances" in browser.title

xpath = "//div[text()='Balance']/../../td[2]/div[contains(text(),'$')]"
balance = browser.find_element_by_xpath(xpath).text

print balance

browser.close()

This would work pretty great, but if you need to save any state between sessions, you’ll want to be able to change where Chrome stores data (since by default in this configuration, it uses an empty temporary directory via --user-data-dir=). Happily, various things about the browser environment can be controlled, including the command line arguments. This is configurable by expanding the “desired capabilities” variable:

caps = webdriver.DesiredCapabilities.CHROME
caps["chromeOptions"] = {
        "args": ["--user-data-dir=/home/user/somewhere/to/store/your/session"],
    }

A great thing about this is that you get to actually watch the browser do its work. However, in cases where this interaction is going to be fully automated, you likely won’t have a Xorg session running, so you’ll need to wrap the WebDriver in one (since it launches Chrome). I used Xvfb for this:

#!/bin/bash
# Start WebDriver under fake X and wait for it to be listening
xvfb-run /usr/lib/chromium-browser/chromedriver2_server &
pid=$!
while ! nc -q0 -w0 localhost 9515; do
    sleep 1
done

the-chrome-script
rc=$?

# Shut down WebDriver
kill $pid

exit $rc

Alternatively, all of this could be done in the python script too, but I figured it’s easier to keep the support infrastructure separate from the actual test script itself. I actually leave the xvfb-run call external too, so it’s easier to debug the browser in my own X session.

One bug I encountered was that the WebDriver’s cache of the browser’s DOM can sometimes get out of sync with the actual browser’s DOM. I didn’t find a solution to this, but managed to work around it. I’m hoping later versions fix this. :)

Comments (2)

November 9, 2012

product search in Ubuntu 12.10

Filed under: Blogging,Security,Ubuntu,Web — kees @ 3:18 pm

The EFF has already discussed the product search “feature” in Ubuntu 12.10’s Unity UI. Ways for disabling it are covered:

sudo apt-get remove unity-lens-shopping – it isn’t easy to generally blacklist a package, it might end up getting re-installed later, etc.
System Settings / Privacy / Search Results – the naming says nothing about it disabling product search results.
use a UI other than Unity – this is what I do.

Here’s another way, that overrides the URL used for the product searching (restart your session after making this change):

$ sudo -s
# echo 'OFFERS_URI="https://localhost:0/"' >> /etc/environment

Or, if you run an organization where you build devices that run Ubuntu, and want to snoop on all the things people type into their Unity search bar, just change that to a URL you control.

I’m astonished by Canonical’s blatant disregard for providing a way to opt-in to this gaping privacy hole. This is a dramatic case of “calling home”, and provides a large amount of information about the user, in real-time. Besides sending the content of their searches and the version of the software installed, it also sends every keystroke, which means in some weird cases, even passive observers can examine keystroke timing which has been shown to potentially leak what is being typed:

127.0.0.1 - - [09/Nov/2012:14:29:41 -0800] "GET //v1/search?q=p HTTP/1.1" 404 522 "-" "Unity Shopping Lens 6.8.0"
127.0.0.1 - - [09/Nov/2012:14:29:41 -0800] "GET //v1/search?q=pw HTTP/1.1" 404 521 "-" "Unity Shopping Lens 6.8.0"
127.0.0.1 - - [09/Nov/2012:14:29:41 -0800] "GET //v1/search?q=pwn HTTP/1.1" 404 521 "-" "Unity Shopping Lens 6.8.0"

Ubuntu is a general-purpose OS, with Unity as its default interface. It is not a vendor-tied appliance nor a telephone company device, and Unity is not a browser (in fact, even in a browser there are visual indicators of where what you have typed will go).

Even if the default for this is enabled, there needs to be (likely at install-time) a page describing what to expect, and the system owner can choose “yes, search online” or “no thanks”. This behavior needs to be fixed in 13.04 and SRUed into 12.10. If there is no fast solution, then it just needs to be disabled by default until it has a sane notification flow.

Comments (1)

January 24, 2010

Google is wardriving

Filed under: Blogging,Debian,General,Networking,Security,Ubuntu,Web — kees @ 8:28 pm

So, a while back, Google started providing location services. This seemed pretty cool, but I kind of ignored it until recently when I was playing with my Android’s location API. With the GPS off, and no cell towers visible (my basement gets terrible cell service), my phone knew within about 500 feet of where it actually was. All I was connected to was my wifi.

Bottom line: it seems that Google, among other methods, is likely wardriving while photographing for Street View. They are now able to pinpoint wifi access points if they happened to see it while driving through your city.

I’m really rather astonished that no one is freaking out about this; I’m a bit unnerved. I implemented the location-of-your-wifi API quickly, so I could terrify myself further. You can do lookups via my location website too, if you want.

UPDATE: yeah, it would seem to be crowd-sourced wifi and cell tower triangulation data. I should say “Google is WarCrowding”.

Comments (25)

November 21, 2008

directing firefox url opening

Filed under: Blogging,Debian,Ubuntu,Web — kees @ 5:53 pm

A while ago, I found a way to open multiple firefox instances so I could have multiple cookie/password repositories. One issue that plagued me was that when launching URLs from other applications (e.g. liferea, gnome-terminal, etc), the URL would open in the longest running firefox, so if I opened my Browsing instance first, then my Authenticated instance, URLs would open in the Browsing instance. That’s what I wanted.

However, if I had to restart my Browsing instance (like, say, when flash was crashing), URLs would start opening in the Authenticated instance, which I didn’t want. So, I changed the Gnome preferences to run this script when launching a browser:


#!/bin/sh
firefox -p Browsing -remote 'openURL('"$@"')'

Ta-da, URLs are always directed to the “Browsing” instance, regardless of when it was started.

Comments (2)

November 18, 2008

md5 lookups for 4 chars and common words

Filed under: Blogging,Debian,Security,Ubuntu,Web — kees @ 8:19 pm

Here’s a fun link. This site appears to have seeded their md5 hash list with all lower case character strings of 4 characters or fewer and many english words (probably from some large dictionaries), and they seem to be adding more as they go. This makes me want to put up an interface to the 7 character alpha-numeric-plus-many-special-chars rainbow table I’ve got. But searching the 500G table for a single hash takes… a while. I’d need to batch it up. Go-go-gadget web 2.0!

Comments (4)

February 16, 2008

firefox trick and recovery help

Filed under: Ubuntu,Web — kees @ 10:05 am

To provide myself with slightly more safety through separation, I run two firefox profiles simultaneously. One is the “general” browser for day-to-day viewing of random (and unauthenticated) sites, and the other is the “authenticated” browser, which contains the cookies for known sites I authenticate against. The trick for this is having a launcher that runs firefox without attempting to request a new window from the currently running profile:

bash -c "MOZ_NO_REMOTE=1 firefox -ProfileManager"

And in a recent bug-hunting session, I had a firefox profile that just kind of didn’t load javascript correctly any more (“change_feedback_state is not defined” on facebook). I have no idea what was causing the issue (something not extensions — it didn’t go away in “-safe-mode“), and so I just reconstructed the profile one bit at a time, eventually leaving all of prefs.js out. I used the migration checklist I found at mozillazine.

Comments (3)

February 3, 2007

OpenID and goofy Claims

Filed under: Blogging,Inkscape,Ubuntu,Web — kees @ 8:33 am

I’ve been having fun fighting religious battles and confusing people with in-jokes at jyte.com. Other good claims:

Or just see what’s been claimed about linux in general. Yay for silly social networking sites! :)

Comments Off

January 23, 2007

CVE links via Greasemonkey

Filed under: Blogging,Security,Ubuntu,Web — kees @ 10:00 pm

I spend a good bit of time reading CVEs but their entries are plain text, without links associated with their various recorded URLs. I’m annoyed at having to select/paste to load a URL, so I had to go code a work-around. :)

Since MozDev‘s “linkify.user.js” was a bit heavy-handed, I wrote up a quick hack based on similar code to only look at mitre.org’s LI tags: “cve-links.user.js“.

Comments Off

December 13, 2006

silly things to do with unicode

Filed under: Blogging,Ubuntu,Web — kees @ 12:37 pm

˙ǝɓuɐɹʇs ʎɹǝʌ ʍoH

‮Unicode is so odd.

Comments Off

May 3, 2006

fun with OpenID

Filed under: Blogging,Security,Web — kees @ 6:04 pm

While I can’t log into NetFlix or Amazon with OpenID (or other federated login systems), I still wanted to try it out. The goal is to easily write comments on people’s blogs, edit Wiki pages, etc, all without having to keep logging in every time. So far, so good.

First step was to decide between running my own OpenID server or not. I went with “not”, since there really isn’t an installable OpenID server yet (there are only support libraries, it seems). Since I was given a permanent account with LiveJournal for some XSS testing I did for them, I figured I’d just use their stuff. I wanted to use “outflux.net” as my login everywhere, so I just added two lines to my outflux.net HTML source:

<link rel=”openid.server” href=”http://www.livejournal.com/openid/server.bml” />
<link rel=”openid.delegate” href=”http://keescook.livejournal.com/” />

Poof. Done. I used Videntity to verify that it was all working. Nifty stuff.

My only complaint is that it’s not clear how to get an end-to-end secure login. I can log into LiveJournal securely, but the OpenID server they run doesn’t seem to operate over HTTPS. Future study is needed. :)

Comments Off

April 9, 2006

greasemonkey for RMLS

Filed under: Web — kees @ 8:15 pm

The house-for-sale listings that RMLSweb.com produces are very detailed, and even include a link to show a map for each house’s address. However, this link goes to MapQuest, which I find infuriatingly annoying to use. I wanted the link to at least go to Google Maps instead. Since I live near Portland, I also wanted to search the fantastic Portland Maps site at the same time. That way I could see lot dimensions, crime statistics, etc.

This was clearly a perfect job for GreaseMonkey. The result, after my usual fights with javascript, is my script to override the RMLS address mapping function.

Comments (3)

March 22, 2006

debugging firefox extensions

Filed under: Web — kees @ 6:15 pm

After installing my amd64 machine and getting my desktop moved, I noticed that Firefox seemed to be running really slowly. Especially google maps. After Brian showed me the Firefox Hacks book, I decided to try and dig into the cause.

By setting the environment variable “NSPR_LOG_MODULES=all:5” you see damn near everything Firefox is doing while it does it. I noticed that it was stalling every time it processed a new cookie (since I don’t let Google set cookies). So I started removing each of my cookie extensions.

To get myself back to a sane state, I just backed up my Firefox profile:

cp -a ~/.mozilla/firefox/*.default ~/firefox-profile

Then removed one extension, restarted Firefox, etc, until I found the busted one. Turns out “Extended Cookie Manager” was my problem, so I replaced it with “Cookie Button in the status bar”.

Tedious, but, it worked. And for some reason, getting a list of all the Firefox environment variables proves to be very difficult.

Comments (3)

January 3, 2006

my firefox extensions

Filed under: Web — kees @ 11:50 pm

I’ve gotten all my Firefox extensions straightened out again since the 1.5 shake-up. Now that it’s sane, I wanted to report what I’m using so I’ll remember for the future, and so I can go look somewhere if I’m on my laptop or some other machine and I can’t remember which extension I was looking for.

Also at least one other person has asked me what extensions I’m using, so maybe others would be interested. I used the following to get a human-readable list of my extensions:

egrep ':(version|name|description|type)' ~/.mozilla/firefox/*.default/extensions.rdf | \
perl -pe 'print "\n" if (/:version/); s/^[^=]+="//; s/"$//;' | \
(read EMPTY; while read VER; do \
  read NAME; read DESC; read TYPE; read EMPTY; \
  if echo "$TYPE" | grep \>2\<>/dev/null; then \
    echo "$NAME $VER"; echo "$DESC"; echo ""; \
  fi; \
done)

I bet there is an elegant XPath command to extract this directly with a single “perl” execution, but, uhm, I’m not an XML expert. :)

Almost all of these extensions were gotten from the Firefox extensions list:

BugMeNot (0.9) Bypass compulsory web registration with the context menu via www.bugmenot.com.
Tabbrowser Preferences (1.2.8.8) Enhances control over some aspects of tabbed browsing.
Modify Headers (0.5.1) Add, modify and filter http request headers
View Rendered Source Chart (1.2.03) Creates a Colorful Chart of a Webpage’s Rendered Source
Gcache (0.2.1) Displays a google cached version of the webpage.
Adblock (0.5.2.039) Filters ads from web-pages
Word Count (0.3) Counts the number of words in selected text.
Allow Right-Click (0.3) Defeats web sites’ right-click prevention scripts.
JavaScript Options (1.2.2) Provides advanced JavaScript options for Firefox.
User Agent Switcher (0.6.6) Adds a menu and a toolbar button to switch the user agent of the browser.
Live HTTP Headers (0.11) View HTTP headers of a page and while browsing.
Download Manager Tweak (0.7.1) A modification of the Firefox download manager that changes its appearance and allows it to be opened in a separate window, a new tab, or the sidebar.
View Cookies (1.5) View cookies of the current web page.
udtranslate (0.0.7) UDTranslate: a zombie translation utility for Urban Dead
Stop-or-Reload Button (0.2) Turns the stop and reload buttons into a single one. When you can stop, you have a Stop button, otherwise you have a Reload button. (Like in Safari)
QuickJava (0.4.1) Allows quick enable and disable of Java and Javascript from statusbar.
Flashblock (1.5) Replaces Flash objects with a button you can click to view them.
Greasemonkey (0.6.4) A User Script Manager for Firefox
Fasterfox (1.0.1) Performance and network tweaks for Firefox.
Disable Targets For Downloads (1.0) Prevents download links opening a blank window.
QuickProxy (2005.12.04) Quickproxy creates a statusbar button to quickly turn the proxy on and off.
DownThemAll! (0.9.8.4) The mass downloader for Firefox.
Web Developer (0.9.4) Adds a menu and a toolbar with various web developer tools.
Wayback (0.1.1) Displays an archived version of the webpage.
Extended Cookie Manager (0.5.5) Change the cookie status for websites on demand.

In addition, it seems the option in TabBrowserPreferences for URL pasting into the display window to load was removed. After some Googling, I was directed to the about:config page, under “middlemouse.contentLoadURL”. Set it to true to restore the prior default behavior.

Comments (1)

September 8, 2005

greasemonkey and fantasy football

Filed under: Web — kees @ 2:01 pm

Fantasy Football isn’t something I have any idea how to play, but I thought I’d join a few friends in their league, since they seemed to have so much fun with it. It’s through Yahoo, and is pretty nifty. I’m slowly learning how to play, but I doubt I’ll ever be any good considering how much I don’t watch football. In the Yahoo interface, however, to evaluate a player’s depth (first string, second string, etc.) I have to click on the player, then the team, then the “Depth chart” link. This was annoying, so it was a perfect opportunity to learn some more javascript, xpath, and DOM manipulation. As a result, I wrote a greasemonkey script to add team links where ever a team abbreviation is seen.

Comments (1)

codeblog code is freedom — patching my itch

December 20, 2013

November 9, 2012

January 24, 2010

November 21, 2008

November 18, 2008

February 16, 2008

February 3, 2007

January 23, 2007

December 13, 2006

May 3, 2006

April 9, 2006

March 22, 2006

January 3, 2006

September 8, 2005