Type: Package
Title: Tools for Reading Formatted Access Log Files
Version: 0.4.0
Date: 2016-01-23
Author: Oliver Keyes
Maintainer: Oliver Keyes <ironholds@gmail.com>
Description: R is used by a vast array of people for a vast array of purposes - including web analytics. This package contains functions for consuming and munging various common forms of request log, including the Common and Combined Web Log formats and various Amazon access logs.
License: MIT + file LICENSE
BugReports: https://github.com/Ironholds/webreadr/issues
URL: https://github.com/Ironholds/webreadr
Suggests: iptools, urltools, rgeolocate, knitr, testthat
LinkingTo: Rcpp
Imports: Rcpp, readr
VignetteBuilder: knitr
RoxygenNote: 5.0.1
NeedsCompilation: yes
Packaged: 2016-01-23 06:36:52 UTC; ironholds
Repository: CRAN
Date/Publication: 2016-01-23 23:19:32

read Amazon CloudFront access logs

Description

Amazon CloudFront uses access logs with a standard format described on their website. read_aws reads these files in; due to the Amazon treatment of header lines, it is capable of organically detecting whether files lack common fields, and compensating for that. See "Details"

Usage

read_aws(file)

Arguments

file

the full path to the AWS file you want to read.

Details

Amazon CloudFront uses tab-separated files with Amazon-specific fields. This can be changed by individual CloudFront users, however, to exclude particular fields, and historically has contained fewer fields than it now does. Luckily, Amazon's insistence on standardisation in field names means that we can organically detect if fields are missing, and compensate for that before reading in the file.

If no fields are missing, the fields returned will be:

See Also

read_s3, for Amazon S3 files, read_clf for the Common Log Format, read_squid and read_combined.

Examples

#Read in an example CloudFront file provided with the webreadr package.
data <- read_aws(system.file("extdata/log.aws", package = "webreadr"))

read CLF-formatted logs

Description

Read a file of request logs stored in the Common Log Format.

Usage

read_clf(file, has_header = FALSE)

Arguments

file

the full path to the CLF-formatted file you want to read.

has_header

whether or not the file has a header row. Set to FALSE by default.

Details

the CLF is a standardised format for web request logs. It consists of the fields:

While outdated as a standard, systems using the CLF are still around; the Squid caching system, for example, uses the CLF as one of its default log formats (the other, the squid "native" format, can be read with read_squid).

Value

a data.frame consisting of seven fields, as discussed above, with normalised timestamps.

See Also

read_combined for the /Combined/ Log Format, and split_clf for splitting out the "requests" field.

Examples

#Read in an example CLF-formatted file provided with the webreadr package.
data <- read_clf(system.file("extdata/log.clf", package = "webreadr"))

read Combined Log Format files

Description

read requests logs following the Combined Log Format.

Usage

read_combined(file, has_header = FALSE)

Arguments

file

the full path to the CLF-formatted file you want to read.

has_header

whether or not the file has a header row. Set to FALSE by default.

Details

the Combined Log Format (CLF) is the same as the Common Log Format (CLF, because software engineers and naming go together like chalk and cheese), which is documented at read_clf. In addition to the fields described there, the Combined Log Format also includes:

read_combined handles these fields, as well as the CLF-standard ones. This is (amongst other things) the default logging format for nginx servers

See Also

read_clf for the /Common/ Log Format, and split_clf for splitting out the "requests" field.

Examples

#Read in an example Combined-formatted file provided with the webreadr package.
data <- read_combined(system.file("extdata/combined_log.clf", package = "webreadr"))

Read Amazon S3 Access Logs

Description

read_s3 provides a reader for Amazon's S3 service's access logs, described here.

Usage

read_s3(file)

Arguments

file

the full path to the S3 file you want to read.

Details

S3 access logs contain information about requests to S3 buckets, and follow a standard format described here.

The fields for S3 files are:

See Also

read_aws for reading Amazon Web Services (AWS) access log files, and split_clf, which works well on the uri field from S3 files.

Examples

# Using the inbuilt testing dataset
s3_data <- read_s3(system.file("extdata/s3.log", package = "webreadr"))


read Squid files

Description

the Squid default log formats are either the CLF - for which, use read_clf - or the "native" Squid format, which is described in more detail below. read_squid allows you to read the latter.

Usage

read_squid(file, has_header = FALSE)

Arguments

file

the full path to the CLF-formatted file you want to read.

has_header

whether or not the file has a header row. Set to FALSE by default.

Details

The log format for Squid servers can be custom-set, but by default follows one of two patterns; it's either the Common Log Format (CLF), which you can read in with read_clf, or the "native log format", a Squid-specific format handled by this function. It consists of the fields:

See Also

read_clf for the Common Log Format (also used by Squids), and split_squid for splitting the "status_code" field into its component parts.

Examples

#Read in an example Squid file provided with the webreadr package.
data <- read_squid(system.file("extdata/log.squid", package = "webreadr"))

split requests from a CLF-formatted file

Description

CLF (Combined/Common Log Format) files store the HTTP method, protocol and asset requested in the same field. split_clf takes this field as a vector and returns a data.frame containing these elements in distinct columns. The function also works nicely with the uri field from Amazon S3 files (see read_s3).

Usage

split_clf(requests)

Arguments

requests

the "request" field from a CLF-formatted file, read in with read_clf or read_combined.

Value

a data.frame of three columns - "method", "asset" and "protocol" - representing, respectively, the HTTP method used ("GET"), the asset requested ("/favicon.ico") and the protocol used ("HTTP/1.0"). In cases where the request is not intact (containing, for example, just the protocol or just the asset) a row of empty strings will currently be returned. In the future, this will be somewhat improved.

See Also

read_clf and read_combined for reading in these files.

Examples

# Grab CLF data and split out the request.
data <- read_combined(system.file("extdata/combined_log.clf", package = "webreadr"))
requests <- split_clf(data$request)

# An example using S3 files
s3_data <- read_s3(system.file("extdata/s3.log", package = "webreadr"))
s3_requests <- split_clf(s3_data$uri)


split the "status_code" field in a Squid-formatted dataset.

Description

the Squid data format (which can be read in with read_squid) stores the squid response and the HTTP status code as a single field. split_squid allows you to split these into a data.frame of two distinct columns.

Usage

split_squid(status_codes)

Arguments

status_codes

a status_code column from a Squid file read in with read_squid

Value

a data.frame of two columns - "squid_code" and "http_status" - representing, respectively, the Squid response to the request and the HTTP status of it. In cases where the status code is not intact (containing, for example, just the squid_code) a row of empty strings will currently be returned. In the future, this will be somewhat improved.

See Also

read_squid for reading these files in, and split_clf for similar parsing of multi-field columns in Common/Combined Log Format (CLF) data.

Examples

#Read in an example Squid file provided with the webtools package, then split out the codes
data <- read_squid(system.file("extdata/log.squid", package = "webreadr"))
statuses <- split_squid(data$status_code)


A package for reading various common forms of request log

Description

see the introductory vignette for more details!