AC/E Digital Culture Annual Report 2014

Page 93 - AC/E Digital Culture Annual Report 2014

P. 93

AC/E digital culture ANNUAL REPORT 2014time limits is the second most used tool, as reported by 64% of respondents.Currently 27% of Archive‐It partners run some crawls that capture only PDFs, and we expect to see this percentage increase as PDF’s become more preva‐ lent on the web and increasingly the only record available. The Archive‐It service is researching ad‐ ding this capability for other types of file formats. As social media sites become an increasingly vital com‐ ponent to collecting activities, the service is explo‐ ring singular ways to provide capture and access so‐ lution to social media. Primarily Facebook, Twitter, Facebook and You Tube, as of December 2012.As mentioned above, the scoping process can be quite technical. The complexities involved in effecti‐ ve crawl scoping were a surprise to the team at the University of Alberta. They have found that they need to re‐adjust their policies as they crawl someti‐ mes adapting to the kind of data they actually can collect (personal correspondence and conversation with Geoff Harder, 2012). Similarly, Creighton has also found that scoping a crawl involves some extra work; David Crawford finds that he often needs to educate people on campus about the web space, and he tries to work with web programmers to re‐ quest that they consider crawling needs when ma‐ king changes to sites in the future (conversation with David Crawford, July 2012).3c. Data CaptureOnce institutions have chosen what websites to cap‐ ture and how to do so, they put their plans into ac‐ tion in the data capture phase of the process. Here, they will deal with the nuts and bolts of the crawling software. They will determine the frequency and timing of their crawls and when to cut‐off long crawls, and then they will set their crawls to begin. The Archive‐It application includes features that allow partners to make adjustments to the frequen‐ cy and duration settings in the open source web crawler (Heritrix).AC/EScheduling crawls for ongoing and reiterative data capture is an area where institutions using Archive‐It exercise a lot of control over their crawls. Data gat‐ hered in 2011 showed that 78% of all Archive‐It part‐ ners use more than one crawl frequency. In other words, they do not crawl all of their sites at one in‐ terval, they use different schedules for different co‐ llections and websites. At the time the data was co‐ llected, the most popular crawl frequencies were one time, monthly and quarterly.Given how diverse websites are in terms of their structure and construction, the data capture step of web archiving can produce a number of surprises. For example, a site can be much bigger than antici‐ pated and therefore exhaust storage resources. Si‐ milarly there are ways for web masters to keep their sites from being archived, which can require techno‐ logical intervention or negotiation between the par‐ ties involved. For example, David Crawford from Creighton University experienced issues archivingWHERE WE ARE HEADING: DIGITAL TRENDS IN THE WORLD OF CULTURETHEME 7: THE WEB ARCHIVING LIFE CYCLE MODEL CURRENT PAGE...THE INNER CIRCLE ‐ DATA CAPTURE93

91 92 93 94 95