Forwarding to list.
the list.
Subject: Re: [Tutor] Parsing/Crawling test College Class Site.
forgot to mention why I was posting this...
Hi Alan.
Thanks. So, here goes!
https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL
The following is a sample of the test code, as well as the url/posts
of the pages as produced by the Firefox/Firebug process.
Basically, a user accesses the initial url, and then selects a couple
of items on the page, followed by the "Search" btn at the bottom of
the page.
-subject (insert ACC) for accounting
-uncheck "Show Open Classes Only"
-select the "Additional Search Criteria" expansion (bottom of the page)
--In the "Days of Week" dropdown, select the "include any of these days"
--select all days except Sat/Sun
finally, select the "Search" btn, which generates the actual class
list for the ACC dept.
During each action, the app might generate ajax which
updates/interfaces with the backend. All of this can be seen/tracked
(I think) if you have the Firebug plugin for firefox running, where
you can then track the cookies/post actions. The same data can be
generated running LiveHttpHeaders (or some other network app).
The process is running on centos, using V2.6.6.
The test app is a mix of standard py, and liberal use of the system
curl cmd. In order to generate one of the post vars, XPath is used to
extract the value from the initial generated file/content.
#!/usr/bin/python
#-------------------------------------------------------------
#
# unlvClassTest.py
#
# jun/1/15
#
#
#
# test generating of the psoft dept data
#
# cmdline unlvClassTest.py
#
#
#
#
#
#
#
#-------------------------------------------------------------
#test python script
import subprocess
import re
import libxml2dom
import urllib
import urllib2
import sys, string
import time
import os
import os.path
from hashlib import sha1
from libxml2dom import Node
from libxml2dom import NodeList
import hashlib
import pycurl
import StringIO
import uuid
import simplejson
from string import ascii_uppercase
#=======================================
execfile('/apps/parseapp2/ascii_strip.py')
execfile('dir_defs_inc.py')
appDir="/apps/parseapp2/"
# data output filename
datafile="unlvDept.dat"
# global var for the parent/child list json
plist={}
cname="unlv.lwp"
#----------------------------------------
# main app
#
# get the input struct, parse it, determine the level
#
cmd="echo '' > "+datafile
proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
res=proc.communicate()[0].strip()
cmd="echo '' > "+cname
proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
res=proc.communicate()[0].strip()
cmd='curl -vvv '
cmd=cmd+'-A "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
cmd=cmd+' --cookie-jar '+cname+' --cookie '+cname+' '
cmd=cmd+'-L "http://www.lonestar.edu/class-search.htm"'
#proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
#res=proc.communicate()[0].strip()
#print res
cmd='curl -vvv '
cmd=cmd+'-A "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
cmd=cmd+' --cookie-jar '+cname+' --cookie '+cname+' '
cmd=cmd+'-L "https://campus.lonestar.edu/classsearch.htm"'
#proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
#res1=proc.communicate()[0].strip()
#print res1
#initial page
cmd='curl -vvv '
cmd=cmd+'-A "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
cmd=cmd+' --cookie-jar '+cname+' --cookie '+cname+' '
cmd=cmd+'-L "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL"'
proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
res2=proc.communicate()[0].strip()
#print cmd+"\n\n"
print res2
sys.exit()
# s contains HTML not XML text
d = libxml2dom.parseString(res2, html=1)
#-----------Form------------
sel_ = d.xpath(selpath)
#--print svpath
#--print "llllll"
#--print " select error"
sys.exit()
val=""
ndx=0
val=a.textContent.strip()
print val
#sys.exit()
sys.exit()
#build the 1st post
ddd=1
post=""
post="ICAJAX=1"
post=post+"&ICAPPCLSDATA="
post=post+"&ICAction=DERIVED_CLSRCH_SSR_EXPAND_COLLAPS%24149%24%241"
post=post+"&ICActionPrompt=false"
post=post+"&ICAddCount="
post=post+"&ICAutoSave=0"
post=post+"&ICBcDomData=undefined"
post=post+"&ICChanged=-1"
post=post+"&ICElementNum=0"
post=post+"&ICFind="
post=post+"&ICFocus="
post=post+"&ICNAVTYPEDROPDOWN=0"
post=post+"&ICResubmit=0"
post=post+"&ICSID="+urllib.quote(val)
post=post+"&ICSaveWarningFilter=0"
post=post+"&ICStateNum="+str(ddd)
post=post+"&ICType=Panel"
post=post+"&ICXPos=0"
post=post+"&ICYPos=114"
post=post+"&ResponsetoDiffFrame=-1"
post=post+"&SSR_CLSRCH_WRK_SSR_OPEN_ONLY$chk$3=N"
post=post+"&SSR_CLSRCH_WRK_SUBJECT$0=ACC"
post=post+"&TargetFrameName=None"
cmd='curl -vvv '
cmd=cmd+'-A "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
cmd=cmd+' --cookie-jar '+cname+' --cookie '+cname+' '
cmd=cmd+'-e "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL?&"
'
cmd=cmd+'-d "'+post+'" '
cmd=cmd+'-L "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL"'
proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
res3=proc.communicate()[0].strip()
print cmd+"\n"
print res3
##2nd post
ddd=ddd+1
post=""
post="ICAJAX=1"
post=post+"&ICAPPCLSDATA="
post=post+"&ICNAVTYPEDROPDOWN=0"
post=post+"&ICType=Panel"
post=post+"&ICElementNum=0"
post=post+"&ICStateNum="+str(ddd)
post=post+"&ICAction=SSR_CLSRCH_WRK_SUBJECT%240"
post=post+"&ICXPos=0"
post=post+"&ICYPos=501"
post=post+"&ResponsetoDiffFrame=-1"
post=post+"&TargetFrameName=None"
post=post+"&FacetPath=None"
post=post+"&ICSaveWarningFilter=0"
post=post+"&ICChanged=-1"
post=post+"&ICAutoSave=0"
post=post+"&ICResubmit=0"
post=post+"&ICSID="+urllib.quote(val)
post=post+"&ICActionPrompt=false"
post=post+"&ICBcDomData=undefined"
post=post+"&ICFind="
post=post+"&ICAddCount="
post=post+"&ICFocus=SSR_CLSRCH_WRK_INCLUDE_CLASS_DAYS%246"
post=post+"&SSR_CLSRCH_WRK_SUBJECT$0=ACC"
post=post+"&SSR_CLSRCH_WRK_SSR_OPEN_ONLY$chk$3=N"
cmd='curl -vvv '
cmd=cmd+'-A "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
cmd=cmd+' --cookie-jar '+cname+' --cookie '+cname+' '
cmd=cmd+'-e "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL?&"
'
cmd=cmd+'-d "'+post+'" '
cmd=cmd+'-L "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL"'
proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
res3=proc.communicate()[0].strip()
print cmd+"\n"
print res3+"\n\n\n\n\n"
print post
##sys.exit()
##3rd post
ddd=ddd+1
post=""
post="ICAJAX=1"
post=post+"&ICNAVTYPEDROPDOWN=0"
post=post+"&ICType=Panel"
post=post+"&ICElementNum=0"
post=post+"&ICStateNum="+str(ddd)
post=post+"&ICAction=CLASS_SRCH_WRK2_SSR_PB_CLASS_SRCH"
post=post+"&ICXPos=0"
post=post+"&ICYPos=501"
post=post+"&ResponsetoDiffFrame=-1"
post=post+"&TargetFrameName=None"
post=post+"&FacetPath=None"
post=post+"&ICFocus="
post=post+"&ICSaveWarningFilter=0"
post=post+"&ICChanged=-1"
post=post+"&ICAutoSave=0"
post=post+"&ICResubmit=0"
post=post+"&ICSID="+urllib.quote(val)
post=post+"&ICActionPrompt=false"
post=post+"&ICBcDomData=undefined"
post=post+"&ICFind="
post=post+"&ICAddCount="
post=post+"&ICAPPCLSDATA="
post=post+"&SSR_CLSRCH_WRK_INCLUDE_CLASS_DAYS$6=J"
post=post+"&SSR_CLSRCH_WRK_MON$chk$6=Y"
post=post+"&SSR_CLSRCH_WRK_MON$6=Y"
post=post+"&SSR_CLSRCH_WRK_TUES$chk$6=Y"
post=post+"&SSR_CLSRCH_WRK_TUES$6=Y"
post=post+"&SSR_CLSRCH_WRK_WED$chk$6=Y"
post=post+"&SSR_CLSRCH_WRK_WED$6=Y"
post=post+"&SSR_CLSRCH_WRK_THURS$chk$6=Y"
post=post+"&SSR_CLSRCH_WRK_THURS$6=Y"
post=post+"&SSR_CLSRCH_WRK_FRI$chk$6=Y"
post=post+"&SSR_CLSRCH_WRK_FRI$6=Y"
cmd='curl -vvv '
cmd=cmd+'-A "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.11)
Gecko/2009061118 Fedora/3.0.11-1.fc9 Firefox/3.0.11"'
cmd=cmd+' --cookie-jar '+cname+' --cookie '+cname+' '
cmd=cmd+'-e "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL?&"
'
cmd=cmd+'-d "'+post+'" '
cmd=cmd+'-L "https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL"'
proc=subprocess.Popen(cmd, shell=True,stdout=subprocess.PIPE)
res3=proc.communicate()[0].strip()
print cmd+"\n"
print res3+"\n\n\n\n\n"
print post
sys.exit()
-------------------------------------------------------------------------------------
-The initianl url
https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL
(performs a get)
--Select the "Additional Search Criteria"
---generates the backend ajax, -- seen by the post action
------https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL
post [ICAJAX=1&ICNAVTYPEDROPDOWN=0&ICType=Panel&ICElementNum=0&ICStateNum=1&ICAction=DERIVED_CLSRCH_SSR_EXPAND_COLLAPS%24149%24%241&ICXPos=0&ICYPos=191&ResponsetoDiffFrame=-1&TargetFrameName=None&FacetPath=None&ICFocus=&ICSaveWarningFilter=0&ICChanged=-1&ICAutoSave=0&ICResubmit=0&ICSID=NwBLGklapJeRFylfen15jatQIwoGcJoQa%2BaO5AyhcwU%3D&ICActionPrompt=false&ICBcDomData=undefined&ICFind=&ICAddCount=&ICAPPCLSDATA=&SSR_CLSRCH_WRK_SSR_OPEN_ONLY$chk$3=N]
--selecting ACC as the dept
--post url --- https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL
post[ICAJAX=1&ICNAVTYPEDROPDOWN=0&ICType=Panel&ICElementNum=0&ICStateNum=2&ICAction=SSR_CLSRCH_WRK_SUBJECT%240&ICXPos=0&ICYPos=362&ResponsetoDiffFrame=-1&TargetFrameName=None&FacetPath=None&ICFocus=SSR_CLSRCH_WRK_INCLUDE_CLASS_DAYS%246&ICSaveWarningFilter=0&ICChanged=-1&ICAutoSave=0&ICResubmit=0&ICSID=NwBLGklapJeRFylfen15jatQIwoGcJoQa%2BaO5AyhcwU%3D&ICActionPrompt=false&ICBcDomData=undefined&ICFind=&ICAddCount=&ICAPPCLSDATA=&SSR_CLSRCH_WRK_SUBJECT$0=ACC]
-selecting the "SearchBTN"
--post URL
https://my.unlv.nevada.edu/psc/lvporprd/EMPLOYEE/HRMS/c/COMMUNITY_ACCESS.CLASS_SEARCH.GBL
post [ICAJAX=1&ICNAVTYPEDROPDOWN=0&ICType=Panel&ICElementNum=0&ICStateNum=3&ICAction=CLASS_SRCH_WRK2_SSR_PB_CLASS_SRCH&ICXPos=0&ICYPos=633&ResponsetoDiffFrame=-1&TargetFrameName=None&FacetPath=None&ICFocus=&ICSaveWarningFilter=0&ICChanged=-1&ICAutoSave=0&ICResubmit=0&ICSID=NwBLGklapJeRFylfen15jatQIwoGcJoQa%2BaO5AyhcwU%3D&ICActionPrompt=false&ICBcDomData=undefined&ICFind=&ICAddCount=&ICAPPCLSDATA=&SSR_CLSRCH_WRK_INCLUDE_CLASS_DAYS$6=J&SSR_CLSRCH_WRK_MON$chk$6=Y&SSR_CLSRCH_WRK_MON$6=Y&SSR_CLSRCH_WRK_TUES$chk$6=Y&SSR_CLSRCH_WRK_TUES$6=Y&SSR_CLSRCH_WRK_WED$chk$6=Y&SSR_CLSRCH_WRK_WED$6=Y&SSR_CLSRCH_WRK_THURS$chk$6=Y&SSR_CLSRCH_WRK_THURS$6=Y&SSR_CLSRCH_WRK_FRI$chk$6=Y&SSR_CLSRCH_WRK_FRI$6=Y]
Post by Alan GauldPost by bruceHi. I'm creating a test py app to do a quick crawl of a couple of
pages of a psoft class schedule site. Before I start asking
questions/pasting/posting code... I wanted to know if this is the kind
of thing that can/should be here..
Probably. we are targeted at beginners to Python and focus
on core language and standard library. If you are using
the standard library modules to build your app then certainly.,
If you are using a third party module then we may/may not
be able to help depending on who, if anyone, within the
group is familiar with it. In that case you may be better
on the <whichever toolset you are using> forum.
Post by bruceThe real issues I'm facing aren't so much pythonic as much as probably
dealing with getting the cookies/post attributes correct. There's
ongoing jscript on the site, but I'm hopeful/confident :) that if the
cookies/post is correct, then the target page can be fetched..
Post sample code, any errors you get and as specific a
description of the issue as you can.
Include OS and Python versions.
Use plain text not HTML to preserve code formatting.
If it turns out to be way off topic we'll tell you (politely)
where you should go for help.
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
http://www.flickr.com/photos/alangauldphotos
_______________________________________________
https://mail.python.org/mailman/listinfo/tutor