Tuesday, March 20, 2012

SharePoint 2010 File Type Scan


I was recently asked to develop a report that would list files with specific file extensions in several SharePoint 2010 web applications. We were adding some new blocked file types but desired a report of all preexisting files of those types anywhere in several web applications.

To do this, I needed to check each item in each list in each subsite of each site collection. I also needed to consider the case where files were attached to list items and discussions.

I did this by using PowerShell and the Out-File command to create a .csv (comma separated value) report file which can be opened by Microsoft Excel and many other programs and a .log logging file to use as an audit trail of all the sites and libraries that were scanned.

I determined that I also wanted to make this script generic and reusable, therefore I made it parameter driven. The 3 parameters I used are:
<#
.PARAMETER Urls            : A comma delimited list of web apps or site collection Urls
.PARAMETER fileTypesPattern: A REGEX matching pattern to be used against file names
.PARAMETER scope           : A string indicating if the execution scope is web app or site collection

#>
param(
    [System.String]
    $Urls="http://neerc-win2008r2/,http://neerc.mysites.com/",
    [System.String]
    $fileTypesPattern="(.*?)\.(cgm|iges|igs|jt|wire|x_t)$",
    [System.String]
    $scope="WebApplication")

With these 3 parameters the script can be scoped at the Web Application level or the Site Collection level, it can process one or more Web Applications or Site Collections, and the file name matching pattern can be any valid REGEX pattern.

The script needs to output 2 files: one for the .csv report of files found, and one for the audit log. I didn't want to deal with overwriting files so I used file names that incorporate a time stamp to make them unique. I also initialized the files with some header information, as follows;

#Create some timestamped file names
[DateTime]$StartTime = [DateTime]::Now
$runTime = $StartTime.ToString("yyyyMMdd-HHmmss")
$outPath = (Get-Item $myInvocation.MyCommand.Path).DirectoryName + "\"
$Report_File = (
$outPath + $runTime + "_FileTypeList.csv")
$Log_File = ($outPath + $runTime + "_FileTypeList.log")

#initialize the report and log files
Out-File -inputObject "Site URL,Library,Document Title,File Url,Created By,Last Modified By" -filepath $Report_File
$Log_Row = "Started: {0}" -f
$StartTime.ToString("yyyy/mm/dd HH:mm:ss")
Out-File -inputObject  $Log_Row -filepath $Log_File
$Log_Row = "Urls=[{0}]" -f $Urls 
Out-File -inputObject  $Log_Row -filepath $Log_File -append 
$Log_Row = "fileTypesPattern=[{0}]" -f $fileTypesPattern 
Out-File -inputObject  $Log_Row -filepath $Log_File -append
$Log_Row = "Scope=[{0}]" -f $scope 
Out-File -inputObject  $Log_Row -filepath $Log_File -append

Next, we'll break the passed Urls into an array and process each based on the scope being a web application or a site collection. I put the actual scan into a function where I can pass a site collection to be scanned. So the code below loops through the passed Urls, calls my scan function for each site collection, writes some log information, and totals the file count:

#Initialize total counter
$TotalFileCount = 0
   
#Get the webapps into an array
$UrlArray
= $Urls.Split(",")

#loop through each passed Url
foreach ($Url in $UrlArray)
{
    #if scope is webapplication, loop through each site collection
    if ($scope.ToLower().Contains("webapp"))
    {
        $web
= Get-SPWebApplication $Url
        $Log_Row
= "Scanning Web Application: {0} ({1})" -f $web.url, $web.name
        Out-File -inputObject 
$Log_Row -filepath $Log_File -append
        write-host
       
write-host -fore GREEN $Log_Row
        #loop through each site collection
       
foreach ($site in $web.Sites)
        {
           
$TotalFileCount += ScanSiteCollection $site
           
$site.Dispose()
        }
    }
    else
    {
       
$site = Get-SPSite $Url
       
$Log_Row = "Scanning Site Collection: {0}" -f $site.url
       
Out-File -inputObject  $Log_Row -filepath $Log_File -append
       
write-host
       
write-host -fore GREEN $Log_Row
       
$TotalFileCount += ScanSiteCollection $site
       
$site.Dispose()
    }
}

Now our function will loop through each sub site in the site collection, accessing all of its lists and its root folder.

Discussion Boards:
We'll check the base template of each list to see if it is a Discussion Board since discussion boards can have attachments and are accessed differently than attachments to list items. We will check each item for attachments, and then select and report only those attachments that match our pattern.

All other lists:
We will build a files array for items which are files. Again, we will check each item for attachments, and then select and report only those attachments that match our pattern. If we have any files in our array, we will select and report only those files that match our pattern.

End Of Sub Site loop:
We will write summary data for the sub site to the logging file for audit purposes.

At the end of our function, we will write some information to the console before we return.

<#
Function Name: Scan Site Collection
Purpose      : Scans for file names matching the file types pattern
Parameter    : SiteCollection
#>
function ScanSiteCollection ($siteCollection)
{
    $TotalCount = 0
    #loop through each sub site
    foreach ($subweb in $siteCollection.AllWebs)
    {
        $Log_Row  =  "{0}" -f  $subweb.Url
        Out-File -inputObject  $Log_Row -filepath $Log_File -append
        $libs  =  @()
        $libs += $subweb.Lists        #get all the libs/lists
        $libs += $subweb.RootFolder   #Add the rootfolder
        $SubWebFileCount 0
        for ($i=0;$i -lt $libs.Count;$i++)
        {
            try
            {
                $LibFileCount=0
                #Handle attachments to discussion board items
                if ($libs[$i].BaseTemplate -eq "DiscussionBoard")
                {
                    #Go through all discussions that have attachments
                     foreach  ($disc  in  ($libs[$i].Folders | Where-Object {($_.Attachments.Count -gt 0)}))
                    {
                        #Get those that match our pattern
                         foreach  ($att  in  ($disc.Attachments | Where-Object {[regex]::IsMatch($_.ToLower(), $fileTypesPattern, "IgnoreCase")}))
                        {
                            $SubWebFileCount++
                            $LibFileCount++
                            $Report_Row  =  "{0},{1},{2},{3},{4},{5}" -f $subweb.Url, $libs[$i].Title, $disc.Title, $att, $disc["Author"], $disc["Editor"]
                            Out-File -inputObject $Report_Row -filepath $Report_File -append
                        }
                    }
                }
                else
                {
                     if  ($i -eq  ($libs.Count - 1))
                    {
                        $files = $libs[$i].Files   #Handle the root folder
                    }
                    else
                    {
                        $files = @()
                         foreach  ($item  in  $libs[$i].Items)
                        {
                            #Handle attachments to list items
                             if  ($item.Attachments.Count -gt 0)
                            {
                                #If we have attached files, get those that match our pattern
                                foreach ($att in ($item.Attachments | Where-Object {[regex]::IsMatch($_.ToLower(), $fileTypesPattern, "IgnoreCase")}))
                                {
                                    $SubWebFileCount++
                                    $LibFileCount++
                                    $Report_Row  =  "{0},{1},{2},{3},{4},{5}" -f $subweb.Url, $libs[$i].Title, $item.Title, $att, $item["Author"], $item["Editor"]
                                    Out-File -inputObject $Report_Row -filepath $Report_File -append
                                }
                            }
                             if  ($item.File)
                            {
                                $files += $item.File  #Add to files
                            }
                        }
                    }
                    #If we have files, get those that match our pattern
                     if  (($files.count -gt 0) -and ($files.count))
                    {
                        $libMatchCount = 0
                         foreach  ($file  in  ($files | Where-Object {[regex]::IsMatch($_.Url.ToLower(), $fileTypesPattern, "IgnoreCase")}))
                        {
                            $SubWebFileCount++
                            $LibFileCount++
                            $Report_Row = "{0},{1},{2},{3},{4},{5}" -f $subweb.Url, $file.DocumentLibrary, $file.Title, $file.url, $file.Author, $file.ModifiedBy
                            Out-File -inputObject $Report_Row -filepath $Report_File -append
                        }
                    }
                }
                #Write lib file count to log file
                $Log_Row = "{0,5:0},{1},{2}" -f $LibFileCount, $libs[$i].Title, $libs[$i].BaseTemplate
                Out-File -inputObject $Log_Row -filepath $Log_File -append
            }
            catch [System.Management.Automation.GetValueInvocationException], [System.Management.Automation.MethodInvocationException]
            {
                #Write some error information to the log file and continue
                $Log_Row = "InvocationException,{0},{1}" -f $libs[$i].Title, $libs[$i].BaseTemplate
                Out-File -inputObject $Log_Row -filepath $Log_File -append
                continue
            }
            catch [System.Management.Automation.ExtendedTypeSystemException]
            {
                #Write some error information to the log file and continue
                $Log_Row = "ExtendedTypeSystemException,{0},{1}" -f $libs[$i].Title, $libs[$i].BaseTemplate
                Out-File -inputObject $Log_Row -filepath $Log_File -append
                continue
            }
        }
        #Subweb totals
        $TotalCount += $SubWebFileCount
         if  ($SubWebFileCount -gt 0)
        {
            $Msg = "{0,5:0},{1}" -f  $SubWebFileCount, $subweb.Url
            write-host -fore GREEN $Msg
        }
    }
    return $TotalCount
}

Warning: This script could run for a long time, depending on the number of site collections, sub sites, libraries, and documents it needs to scan. 

If you only need to search document libraries you could use an SPSiteDataQuery, which might look like the following:
#Set the site collection to query
$sc = Get-SPSite -Identity "http://yourWebApp/sites/yourSiteCollection"

#Add SharePoint snapin if needed
Add-PSSnapin Microsoft.SharePoint.Powershell -ErrorAction "SilentlyContinue"

#Create a new SPSiteQuery object
$CAMLQuery = New-Object -typeName Microsoft.SharePoint.SPSiteDataQuery
#Set the query properties
$CAMLQuery.Lists = '<Lists BaseType="1" />'
$CAMLQuery.Query = '<Where><Or><Or><Or><Contains><FieldRef Name="FileRef" /><Value Type="Text">.cgm</Value></Contains><Contains><FieldRef Name="FileRef" /><Value Type="Text">.iges</Value></Contains></Or><Or><Contains><FieldRef Name="FileRef" /><Value Type="Text">.igs</Value></Contains><Contains><FieldRef Name="FileRef" /><Value Type="Text">.jt</Value></Contains></Or></Or><Or><Contains><FieldRef Name="FileRef" /><Value Type="Text">.wire</Value></Contains><Contains><FieldRef Name="FileRef" /><Value Type="Text">.x_t</Value></Contains></Or></Or></Where>'
$CAMLQuery.ViewFields = '<FieldRef Name="FileRef" /><FieldRef Name="Title" /><FieldRef Name="Author" /><FieldRef Name="Editor" />'
$CAMLQuery.Webs = '<Webs Scope="SiteCollection" />'

#Create a data table for the query results
$fileTable = New-Object -typeName System.Data.DataTable
#Execute the query
$fileTable = $sc.RootWeb.GetSiteData($CAMLQuery)
#Write the data to the console
foreach ($f in $fileTable.Rows)
{
$csv = "{0},{1},{2},{3}" -f $f["FileRef"], $f["Title"], $f["Author"], $f["Editor"]
write-host -fore GREEN $csv
}
$sc.Dispose()

This is my first technical blog post, constructive feedback is appreciated!