Zane Nicholson, Steve Sekowski, Xuefei Ning, Fanshu Jiang

Exam Calendar Parser Zane Nicholson, Steve Sekowski, Xuefei Ning, Fanshu Jiang

ABSTRACT them based on dates or courses. A better solution of planning exams is to use Google For many college students with a busy schedule Calendar, Outlook Calendar, or other task with lots of courses, planning studies and other planning applications. However, though these activities based on exam dates is one of the best applications can help organize dates, one will way to organize college life. Registered student need to type all the information, and look organizations in the university usually make through every single event when finding the efforts to find out their students’ exam calendar exam and prepare for it in advance. Similar and in order to plan events around the schedule of worse things happen for registered student their members. At the same time, businesses organizations and businesses. The college’s around the campus area often offer special deals student organizations regularly provide study around times when they know that lots of supports, foods, tutors, or relaxing events students will be busy studying. We introduce according to the exam dates and thus help their Exam Calendar Parser, a web calendar of exam members prepare for the exams. Hence, they dates which addresses these needs by providing will need to find out a plenty of course exam a user friendly interface to look up exam dates. schedules, which is definitely a time consuming We designed a system to crawl the course task. For businesses, it is important to know the websites, parse exam information based on various courses’ exam schedule in order to plan crawler findings, and finally present the data as special deals for study supplies, to prepare a web based calendar. The system is currently in enough food during exam-preparation days, and the test stage; we show how our system can be even to advertise new products. used to solve the problem of finding campus wide exam schedules by targeting the exam Therefore, our Exam Calendar Parser system information from the University of Illinois at will benefit students, organizations, and Urbana Champaign. businesses. Users can find out exams by month, week, day, or courses. The web calendar clearly Keywords displays the abstract information and the user Search Engine, Exam Schedule, Resource will be able to click on the exam dates to read Discovery, Parser system details, such as locations and remainders. Our team has built upon existing open source web 1. INTRODUCTION crawler which will crawl uiuc.edu and engineering wiki course pages. We created the The average workload for a college student is parser to take lists of URLs and extract exam 15 credits worth of classes per semester, which dates from the html content of those pages. normally includes four or five semester-long Finally, the calendar site takes list of exam classes. For each course, the exams are usually information as JSON and publicly displays that planned as two or three midterms with one final information. exam, final paper, or final presentation. The average number of exams that a student need to 2. RELATED WORK take care of within one semester is therefore around fifteen. In order to get those exam As described previously, the web crawler of schedule and location information, a student our system was built upon existing open source needs to check each course’s website, write web crawler named crawler4j down the time accordingly, and finally order (http://code.google.com/p/crawler4j/). The crawler provides functions which could decide easily readable format. In addition, it needed to whether the given URL should be crawled or not provide the user with the option to view exams as well as the method to get the URLs of the by month, week, and day in order to give as identified and downloaded page. We edited complete a picture as possible of the exam loads those functions to meet our own requirements, at various times. such as to find the pages with keywords “exam,” “midterm,” and “presentation,” and then 4. METHODS implemented the crawler to work within the web We chose to use a crawler because it is the domain of the University of Illinois at Urbana most efficient way to access a potentially large Champaign. and branching website such as a university There are similar web calendars of different course information page. At first we considered type of events. For instance, the website having the parser save the entire html content of http://food-bot.com/home is designed to help each valid page it found, but this was found to college students find free food events on campus require too much storage space and to be and it currently covers twenty universities. The generally inefficient. This method of output also website https://www.scheedule.com/, which is allows for a speedup in the crawler itself, as it is also created by UIUC students, is designed to only writing URL’s to the output document, and help college students plan classes and social not entire documents. However, one events schedules. Our system uses the similar disadvantage of this implementation is that if the crawl and display technology, yet has different page goes down in the time between it being functions. The Exam Calendar Parser system is crawled and it being parsed, any exam designed to help not only the students, but also information it contains will not be included in RSOs and businesses to plan appropriate events. the final calendar. It is useful for almost all kinds of college-related We decided to write the parser part of the people and organizations. system in Python since it provides a wide range 3. PROBLEM DEFINITION of available plugins to aid in development. The parser works by first fetching the HTML content The problems we solved for the crawler were of each URL and doing some minor formatting creating a function to find the appropriate adjustments to each page in order to make the website pages, i.e. the course pages which actual parsing process smoother. It then scans contain keywords such as “exam,” “midterm,” each page and marks the dates and exams which “presentation” and etc. As described previously, may be in a variety of different formats. Then it the existing open source crawler has the function matches the exams with the date that most likely of downloading and copying URL information, correspond to it. The most difficult part of and thus we edited the function to get the list writing the parser getting a correct match of format for the parser system. exam and date when the information can be displayed in many formats. To solve this the The date parser portion of the system was parser concatenated table rows to one line and required to take the list of URLs that the crawler considered a match to be much more likely if outputs and use the HTML content of those both the exam and date are on the same line in pages to find the exam information to be the formatted document. This solution was outputted in a format readable by the calendar chosen because it showed both very high site. The algorithm needed to be robust so that it accuracy, and because the slowdown that it could detect multiple different ways of listing caused was minor. dates or exams. Then, once it finds and matches all the exams with their respective dates, it needs To visualize all exams, the most to write the information to a file using the JSON straightforward way would be a calendar that format that the calendar takes. shows exams in chronological order. After trying several calendar plugins like “EvenTouch The calendar page needed to take a formatted Calendar” and “MagiCalendar”, we decided to list of exam information and display it in an use “FullCalendar”, a JQuery plugin that Figure 1. Sample of the Crawl Controller provides a full-sized calendar with day, week and month views. This plugin has a lot of The HTML parser part of the system can advantages over other plugins, mainly because it match dates to events with a high degree of uses AJAX to fetch events for each month and it accuracy. The main problem it has is in can be easily configured to use different feed differentiating real exam dates from things like format, like JSON feed or a Google Calendar exam review sessions that can have a wide range feed. Our exam data is formatted using JSON of possible names depending on the professor. that contains properties recognized by The date matching was tested on a small number FullCalendar, specifically: A unique id, a of sample class pages and found to work almost string title, a start date and an end date both 100% of the time. This is obviously a small set encoded in ISO8601 format. The most of data so in order to get a better reading on the challenging part for integrating the calendar date matching, a larger test size would be plugin was to figure out how to configure the required. The problem with false exam findings plugin so that it could read and display JSON could possibly be solved with the addition of a feed correctly. natural language processing component that would identify when the event in question is an actual exam or merely some sort of review session or random information about future 5. EVALUATION/SAMPLE RESULTS exams. The web crawler solution works well enough, The calendar is able to show our exam data as it collects information relevant to our goal for crawled from websites correctly with the right the project. We have tested the crawler with start time and the end time of exams. More other inputs (e.g. the default “uci.edu” seeds details can be viewed with the day view and the provided by the developers) and it seems to week view. However, we found that when there work well. Sample output can be viewed in the are more than three exams for a day, the cell for Java project folder in output1.txt. Additional that day has to extend to show all the exams, challenges that may have to be solved include which means the total height of calendar is determining which URL’s are most relevant changed because of that cell. Another problem (perhaps a ranking function could help with due to the size of cell space is that we could only this), and making the existing crawler more display the start time and the title for an exam efficient and intelligent as to which URL’s with the month view, because the cell cannot should be crawled (currently, a full run of the hold long content otherwise it will go to the next crawler takes about 20 minutes). row and extend the height of calendar MapReduce functionality could help with this by allowing the date and exam matching to be parallelized. So far, our calendar is only for display and the data we used to fill in the calendar is static. For future improvement, some user-triggered events could be integrated, for example, creating user account, adding notes to an exam as reminder, clicking the exam to go to the course webpage, etc. It might also be helpful to add a search bar on the top of the webpage that allows users to search a specific exam or a specific date. We can also add “search by courses” feature and thus allow user to access all his/her exams at one time. It will also be a great idea if we could allow chatting or posting in the platform; users could find out their friends’ busy days, plan activities and give out surprise little gift accordingly. Figure 2. Display the exam dates by month. Further improvement could also include extending the exam crawling/parsing to more colleges, and also implementing a component that could crawl/parse courses that have more restrictive access (e.g. courses based mainly on Compass). Another possible avenue for future improvement would be integration with Facebook or Google Calendar. Facebook integration could allow people in the same groups to see when other members have exams, or which of their friends are in the same classes as them. In summary, we’ve implemented an interface that displays exam dates on a calendar. This interface could significantly help students to plan their academic lives and for the families and friends of students to plan around these academic events. The interface could also help the school itself as well as businesses to plan Figure 3. Display the exam date by week. events more effectively due to the improved ease 6. CONCLUSION & FUTURE WORK of finding periods that have a high/low amount of exams scheduled. We will want to improve on the date parsing to improve accuracy and reduce the amount of false results that we get. As well as finding 7. APPENDIX ways to make the parsing more efficient, as with large or many pages the process could slow Xuefei Ning: read through the whole project significantly. Implementing some sort of of the open source crawler4j; implemented the Exam Calendar Parser’s crawler based on the “basic sample” of crawler4j; Fanshu Jiang: Designed implemented the interface for the exam calendar. Zane Nicholson: Wrote the exam parser part of the system and coordinated team members. Steve Sekowski: Implemented the open source web Crawler to crawl relevant UIUC course websites. 8. REFERENCES https://code.google.com/p/crawler4j/

https://www.scheedule.com/

http://food-bot.com/home