Sabtu, 07 Januari 2017

Tool and methods of Lean Manufucturing – a literature review

Abstract :
This article presents an overview of methods and tools of Lean Manufacturing, which are used in enterprises to improve production processes. Article aims to introduce the reader to both the most commonly used tools as well as those less known. A description of each method contains an overview of the assumptions, the main objectives and expected results. The article described tools such as VSM, 5S, SMED, Jidoka, standardization work, Poka-Yoke, Heijunka, TPM, Hoshin Kanri, Kamishibai, Kanban and the philosophy of Kaizen.

1.      Introduction
The company's situation depends in large part from the a rapid response to changing customer requirements. Currently, the standard becomes to produce the products exactly on time, in the desired quantity and quality and with the lowest competitive price. All these activities, of course, must also generate a certain profit for the company.
Enterprise to achieve its objectives, and most of all to achieve a competitive advantage in the market must pay special attention to reducing production costs. During the implementation of subsequent operations arises value of manufactured products thus creating a value stream. It is important that the individual a value consisting of the price of the product was acceptable by klienów. That is why more and more importance to improvement of production processes. Improvement is to identify and eliminate losses occurring in production.
The concept, which allows the improvement of production processes is Lean Manufacturing (LM). It presumes the elimination of all waste occurring on the production (Japanese. Muda), which leads to a reduction in the transit time of the material by the process (in English. Lead time). Lean Manufacturing is derived from the production system TPS (Toyota Production System), whose creators are Japanese engineers: Sakichi Toyoda, Ki'ichirō Toyoda and Taiichi Ohno. To achieve its goals manufacturing companies use a variety of tools and methods of Lean Manufacturing. These include: SMED (ang. Single Minute Exchange of Die), TPM (ang. Total productin Maintenance), 5S, Poka-Yoke, and other

2.      Types of waste
The essence of Lean Manufacturing is the elimination of all waste occurring in the enterprise. This shortens the time between ordering and sending the finished goods to the customer and increase productivity and reduce manufacturing costs. Taiichi Ohno in his work dedicated to Lean Manufacturing listed seven types of waste: overproduction, inventory, mistakes and quality defects, waiting, over-processing, unnecessary transport and unnecessary movement. Currently, seven types of waste is enriched by yet another - untapped potential employee's.
Overproduction is understood as the production of products in advance and in greater quantities than required by the customer. Overproduction is considered the most dangerous type of waste, because it translates into significant costs associated eg. Storage, and is the beginning of other waste. Inventories are keeping more materials, raw materials, work in progress and finished products than the required minimum. Wastage is the result of overproduction. Can lead to damage or destruction of products and generates significant transportation and storage costs. Mistakes and quality defects is understood as work that is not completed with positive results. Waiting for a product is the time lost due to the expectations of people, material, information, or tool that does not add value in the manufacturing process. Excessive processing steps are necessary in terms of the value added, which however must be implemented in order to produce the product. This waste is understood as well as taking unnecessary time for the implementation of customer demand, as well as the use of sophisticated and costly technologies without justification. Excess transoprt unnecessary movement of materials, semi-finished or finished products within the company. This leads to increased production costs and increase the risk of destruction or damage to the product. Superfluous movement is not adding any value of physical employee. It results most often from inadequate organization of the work. The last type waste is underspending the potential employee. In this context, it is meant ignoring or underspending ideas, competence, talent and time employee.
  
3.      Methods of Lean Manufacturing
In this review article such tools and methods of Lean Manufacturing as: VSM, 5S, SMED, Kanban, Jidoka, Hoshin Kanri, Heijunka, Standardized Work, Poka-Yoke, Kamishibai, Kanban and Kaizen philosophy will be presented.

VSM - Value Stream Mapping
A tool widely used in enterprises is VSM – Value Stream Mapping. VSM is a graphical way of presenting material and information flow in the production system. Map shows all the tasks undertaken in the process, from the purchase of raw materials and ending with the delivery of finished products to the customer. This analysis allows the identification of all kinds of waste and orientation for further action in order to eliminate them.

5S method
Another method used for the improvement of production processes is the 5S method. 5S is the basis for the implementation of Lean Manufacturing. The method name is derived from the first letters of the Japanese words: Seiri, Seiton, Seiso, Seiketsu, Shitsuke. They are also the names of the five stages of organization of the work :
1.      Seiri – sorting, selection – the elimination of the workstation of all the items that are unnecessary to do the job. Step is carried primarily of decreased inventory, and better use of working space. In accordance with the principle of selection, all unnecessary items should be marked with a red label and placed in a designated area.
2.      2.    Seiton – systematics – arrangement, designation and selecting a suitable place for all tools in the workstation at the selection stage. It can help in this instance. board of shadows or color coding each tool. Step is performed to reduce unnecessary traffic employee performed when searching for tools and elimination of errors the quality of products resulting from mistakes by properly marking items.
3.      Seiso - cleaning – cleaning and maintenance of the workplace and sets out the standard of proper cleaning. Stage aims to: maintain positions in good condition, identify and eliminate the causes of pollution and care of machines.
4.      Seiketsu – standardize – determine the rules for the first three stages of 5S. In this stage, mainly defines the responsibilities of employees and creates instructions, supporting the execution of the previous steps. Stage provides a systematic procedure and repeatability previously entered changes.
5.      Shitsuke – discipline – ratcheting up at the habits of employees to comply with the previously introduced changes and act in accordance with the standards. It is a difficult and long stage, because it forces you to change the habits of both production workers and management.
5S method does not require large financial investment, it allows for the creation and maintenance of jobs in governance functions and cleanliness shapes and proper organization of the working environment. It is also the first step in strengthening the employees a sense of ownership in relation to the workplace.

SMED – Single Minute Exchange of Die.
This method allows for shortening to a single minute changeover time is SMED. Developer methods Shingeo Singo has identified four stages of process improvement changeover equipment.
-      analysis of the current state workstation,
-      separation operations changeover operations internal and external,
-      transform internal operations in external,
-      to improve all aspects of the changeover.
The action of bringing the greatest effect in minimizing changeover time is to transform the operations of internal external. Internal operations, are those whose performance takes place during the machine downtime. External, are activities that are performed before and after the stoppage. The more steps will have to move beyond the stop associated with the changeover, the more time can be spent on production.

Standardized work
Standardization work is a tool used Lean Manufacturing for the improvement of work and improves the sustainability of production processes. Standardization means uniformly operations, or tasks by all operators. Standardized Work is the best method of operation. This allows the exercise of all steps in the same way, in the same order and time, at a fixed cost. Standardization also assumes continuous development of new, better standards, so as to adapt to the constantly changing customer requirements.

TPM – Total Productive Maintenance
TPM is a tool LM used to eliminate waste associated with technological machines in the enterprise. TPM is a way of management, integrating all employees to maintain of production continuity [8]. The main objective of this method is to increase the efficiency and productivity of machinery and equipment by: a marked decrease the number of failures, reducing the time retooling
and adjusting machines and short downtimes and idle (caused frequently absent employee, or waiting for the tools, material, information, etc.), reducing defects in product quality, decreased time spent on start-up of production.

Kanban
Kanban is a Japanese method of production control, which assumes control not based of the production schedule, and through events occurring directly on production. The use of Kanban allows for almost total elimination of pre-magazines (the stock is on the workstation), interoperable, and finished products. The raw materials are delivered from suppliers with hourly precision, , and thanks to reserves, production capacity and flexibility of the production process it is possible to produce almost any product at any time. In contrast, production orders are closely synchronized with orders received from customers [3].

Kaizen Philosophy
Kaizen philosophy is the concept of continuous improvement, which assumes constant search for ideas to improve all areas of the organization. It requires the involvement of all the company's employees, operators, up to the highest level of management. The aim of Kaizen is permanently replacing waste activities adding value. In practice Kaizen comes to collecting and implementing ideas of employees, which serve to improve the organization of work, or improving the production process.

Jidoka
The notion of Jidoka refers to the ability to stop the production line or machine by the operator at the time of the appearance of a malfunction or problems during manufacture. Problems may be related to the quality of products and delays the manufacturing process due to a lack of material, tool information. Equipment operators the ability to detect emerging anomalies and immediately stop the operation, it allows for a more efficient production process. Tools that enable the implementation of the rules Jidoka are: Poka-Yoke and Andon.
 
Poka-Yoke
Poka-Yoke (jap. Show - any error, Japanese. Yoke - prevention) is a method of preventing errors coming from mistakes. The main principle in the system Poka- Yoke is that the errors are to blame processes, not the employees. Poka- Yoke solution is characterized to prevent any errors in the process. With Poka-Yoke is also possible to obtain reduced time required for training employees, eliminating many qualility control operations (or its total elimination), reducing the amount of defects and a 100% control of the process. An example of a Poka-Yoke solution is a SIM card, which can be put on the phone only one way through the angled corner.

Heijunka
Heijunka, or leveling production is mainly aimed at eliminating jumps in production. Leveling production is known as a method of sequencing products in order to balance the production, increase productivity and flexibility by eliminating waste and minimizing differences in load workstations [6]. Balancing production is understood as to avoid sudden jumps in the amount of manufactured products in the schedule [20]. Production leveling consists in determination of the sequence and the amount of flow from the process, so that current demand was realized from the warehouse / supermarket and did not cause sudden changes in the production schedule. Production schedule should be in a given period of time constant (time largely depends on the seasonality of products). The aim is to ensure that the products were produced in a particular sequence in batches of as few pieces. In other words, production leveling is a way of ensuring the availability of products for customers through a repeatable and uniform flow of products and supplies in the warehouse. Repeatable flow of products from production also contributes to load balancing workstations.

Hoshin Kanri
Hoshin Kanri is a method that allows to focus all the company's ability to improve its performance through the development of a unified policy and annual management plans based on the basic concept of the company's management. Hoshin Kanri can have various applications in the enterprise, starting with strategic planning methods and tools to manage complex projects, the quality management system (new products are manufactured in the company as a response to customer demand) up to the operating system, ensuring stable earnings growth. The actions of this method are carried out in the following stages :
-        1. to define the mission and vision in the context of an overall strategy;
2. defining strategic objectives (3 - 5 years);
3. defining annual targets;
4. transferring targets at lower levels;
5. implementation of the objectives;
6. inspections objectives;
7. annual evaluation of the realization of the objectives.

Kamishibai
Kamishibai is a set of simple audits, which are designed to control the work, use of methods LM, as well as to teach the person controlling find possible improvements to the process or position. A key element of this system is an array Kamishibai, which is placed directly on the production line. For the layout prepared a layout line schedule for conducting audits and documentation for the auditor. Application Kamishibai makes that auditor can be any person working in the company , for example bodyguards, the production staff, the accounting, personnel and head office. This is possible thanks to a very simple design of the sheet audit. The sheet contains the most common check list of areas to check in the form of pictures and images along with the location of the place on the map layout.

Conclusion
In the literature, there are different views and descriptions of the various tools of Lean Manufacturing. This paper aims to bring all the tools and methods LM, starting from the general 5S, and ending with the less well-known: Hoshin kanri, leveling production.
Each of the methods of Lean Manufacturing is designed to support the company the elimination of waste occurring on the production and in achieving the objectives of improving production. Table 1 is presented a summary of eight types of waste, and sample tools to help eliminate them.
Table 1. Summary of types of waste and methods of Lean Manufacturing, which help eliminate them,
Waste
Methods of Lean Manufacturing
Overproduction
Kanban, Heijunka, VSM
Excessive stocks
Kanban, Heijunka, VSM
Mistakes and defects in the quality of products
Poka-Yoke, Jidoka, Kamishibai
Unnecessary movement
5S, Standardized work
Unnecessary transport
Kanban
Waiting
TPM, SMED
Excessive processing
Standardized work, Kanban
Untapped potential employee
Kaizen
The article tools and methods of Lean Manufacturing is only a small fraction of all the available literature on methods of improving production. The Polish companies have only recently started to use the basic tools, such as. 5S or SMED. However, over time the company will begin to implement newer methods LM.

Anknowledgment
The presented results of the research, carried out under the theme No. 02/23/DSMK/7677, was funded with a grant to science granted by the Ministry of Science and Higher Education.

Bibliography
[1]      Antosz K., Pacana A., Stadnicka D., Zielecki W. (2015), Lean Manufacturing Doskonalenie Produkcji, Oficyna Wydawnicza Politechniki Rzeszowskiej, Rzeszów
[2]      Czerska J. (2009), Doskonalenie strumenia wartości, Centrum Doradztwa i Informacji Define sp. z o. o., Warszawa
      [3]        Durlik I. (1996), Inżynieria zarządzania. Strategia i projektowanie                     systemów produkcyjnych, cz. 1 i 2, Agencja Wydawnicza Placet,                       Warszawa
     [4]          Fisher M. (1999), Process improvement by poka-yoke, Work Study, Volume             48 Issue 7 pp. 264 – 266
     [5]        Hamrol A. (2015), Strategie i praktyki sprawnego działania. Lean, Six                   Sigma i inne. Wydawnictwo PWN, Warszawa
     [6]        Hüttmeir A., Treville S., Ackere A., Monnier L., Prenninger J. (2009),                   Trading off between heijunka and just-in-sequence, Int. Journal                         Production Economics, Vol. 118, pp. 501-507
[7]      Liker J. K., Meier D. P. (2011), Droga Toyoty Fieldbook. Praktyczny przewodnik wdrażania 4P Toyoty, Wydawnictwo MT Biznes, Warszawa
[8]      Michlowicz E., Smolińska K. (2014), Metoda TPM jako element poprawy ciągłości przepływu, Czasopismo
Logistyka 3/2014
     [9]         Moreira AC., Campos Silva Pais G. (2011), Single Minute Exchange of Die:             A Case Study Implementation, Journal of Technology Management and                 Innovation, pp. 29–46.
   [10]        Niederstadt J. (2014), Kamishibai Boards: A Lean Visual Management               System That Supports Layered Audits, CRC Press
   [11]        Ohno T. (2008), System Produkcyjny Toyoty. Więcej niż produkcja na             dużą skalę, ProdPress.com, Wrocław
   [12]         Rother M., Shook J. (1999), Learning to see: value stream mapping to             create value and eliminate muda.
Brookline, Lean Enterprise Institute, Wrocław
   [13]         Sawik T. (1992), Optymalizacja dyskretna w elastycznych systemach               produkcyjnych. Wydawnictwo Naukowo Techniczne, Warszawa
   [14]         Shook J., Schroeder A. (2010), Leksykon Lean. Ilustrowany słownik                 pojęć z zakresu Lean Management, Wydawnictwo Lean Enterprise                   Institute Polska, Wrocław
  [15]            Sobańska I. (2013), Lean accounting. Integralny system lean                           management, Wolters Kluwer Polska, Warszawa
  [16]            The Productivity Press Development Team (2012), TPM dla każdego,               ProdPublishing.com, Warszawa
  [17]            Witcher B. J., Butterworth R. (2001), Hoshin Kanri: Policy                           Management In Japanese-Owned UK Subsidiaries, Journal of                     Management Studies, Volume 38, Issue 5
  [18]            Womack J.P., Jones D.T. (2001), Odchudzanie firm. Eliminacja                   marnotrawstwa – kluczem do sukcesu,
Centrum Informacji Menedźera CIM, Warszawa
  [19]            Womack, J.P., Jones, D.T. and Roos, D. (1990), The Machine that                     Changed the World: The Triumph of Lean Production, New York,                       Rawson Macmillan
       Zandin, K.B. (2001), Maynard’s Industrial Engineering 

Senin, 02 Januari 2017

Secondary Data Analysis

Secondary data analysis : use of data gathered by someone else for a different purpose – reanalysis of existing data. See methods links page for links to secondary sources of data about recreation & tourism

Sources:
Government agencies: e.g. Population, housing & economic censuses, tax collections, traffic counts, employment, environmental quality measures,  park use, ,…
Internal records of your organization – sales, customers, employees, budgets, web logs,..
Private sector -  industry associations often have data on size and characteristics of industry
Previous surveys – as printed reports or raw data, survey research firms sell data
Library & Electronic sources – the WWW, on-line & CD-ROM literature searches, …
Previously published research – reports have data in summary form, original data often available from the authors.

Issues in using secondary data.
1)   data availability – know what is available & where to find it
2)   relevance – data must be relevant to your problem & situation
3)   accuracy – need to understand accuracy & meaning of the data
4)   sufficiency – often must supplement secondary data with primary data or judgement to completely address the problem

Since you did not collect the secondary data it is imperative that you fully understand the meaning and accuracy of the data before you can intelligently use it. This usually requires you to know how it was collected and by whom. Find complete documentation of the data or ask about details from source of data. For example, most standard government data providers have extensive documentation on methods, data reliability, etc. at their websites. Beware of data that isn’t clearly documented. BTS Guide to Good Statistical Practice identifies some useful guidelines.

Kinds of  Secondary data

1. Regularly gathered time series data : useful for tracking trends and forecasting. Most common sources here are mostly governmental and often economic - international arrivals, sales, jobs, payroll in various industries, census of population and housing, budgets, revenues, employees, some regularly conducted surveys and industry reporting.
2. Reporting for geographic units: useful for spatial analysis, many of the above time series are also reported for countries, states, counties and sometimes smaller geographic units. Again, mostly from governmental sources.
3. Park visit/facility use data : many park systems have regular reports of visitor counts, although accuracy and consistency is sometimes questionable. NPS public use data a good example, also most state parks, and some other park systems. Private sector has good sales data but usually proprietary. Only a few museums

Examples

a)    Trends – Compare surveys in different years or plot time series data -  many tables in Spotts Travel & Tourism Statistical Abstract, Michigan county tourism profiles, economic time series at BLS site, REIS data at Gov Info Clearing House.
b)   Spatial variations - gather data across spatial units, map the result, compare geog. Areas.
c)    Recreation participation – apply rates from national, state and local surveys to local population data from Census, rates at NSGA, ARC web pages (Roper-Starch study),  NSRE 1994-95 survey    
d)   Internal records: use zipcodes or telephone area codes to map market area, track trends in regularly gathered variables (use, sales, costs, employee turnover, customer compliants/satisfaction,  environmental variables, web logs, …)
e)  Combining sources in models : e.g. gravity model would utilize population data, an inventory of supply of facilities, and distances.

TOURISM: Example of  Use of Secondary Data in Estimating Tourism Activity, Spending and Impacts in Michigan
I have  series of models for estimating tourism activity, spending and economic impacts at state and county level. These rely almost completely on secondary data sources – lodging room use taxes, motel, campground and seasonal home inventories, occupancy rates by region, average spending by segment and statewide travel counts. See my economic impact web site.  Also see Leones paper (http://ag.arizona.edu/pubs/marketing/az1113/) on measuring tourism activity in your community.
Secondary data used in my tourism models:
          Room tax collections (state and local CVB’s)
          Resident Population by county (Census)
          Seasonal homes by county (Census)
          Lodging inventory by county (rooms, campsites)
Hotel, restaurant, amusement, retail sales, employment, income by county (IMPLAN, REIS, CBP, BEA, BLS, ES-202)
          National tourism industry ratios (BEA Satellite acounts)
          BLS price indices by commodity
State & local tourism estimates by others (TIA, TTRRC, D.K. Shifflet, ATS 95, CVB’s)
Local area multipliers and ratios (employment to sales, income to sales) for tourism sectors (IMPLAN)
          County to county distance matrix
          Michigan Airport enplaning and deplaning passengers by airport
Parameters from various tourism surveys (ATS 1995, D.K. Shifflet, TTRRC household, variety of ovcal surveys in Michigan)
          Average length of stay, party size by subgroups of visitors
          Hotel room and campsite ocupancy rates
          Average room and campground rates (prices)
          Average days seasonal homes  are occupied, average party size
          Average spending for different visitor segments
          State Total day trips and stays with friends and relatives (ATS95)
          For trips of 50 miles or more, percent that are 100 miles or more.

See my economic impacts of tourism website. Models and data sources are discussed briefly in MITEIM model documentation and at bottom of county tourism spending table (http://www.prr.msu.edu/miteim/michtsm00.htm).

RECREATION : Estimating/forecasting participants, days of participation  (Lakes States Recreation Estimates by County). Forecast by using population projections in the model.


Population  by age (Census) - POPi
Activity participation rates for Michigan (NSGA) – by age group (PARTi)
Frequency of participation  (NSGA ) – by age group (FREQi)
Number of participants = S POPi * PARTi
Number of days of participation = S POPi * PARTi * FREQi

See  Stynes, D.J. 1998. Recreation activity and tourism spending in the Lakes States. IN Lake States Regional Forest Resources Assessment: Techical Papers. J. M. Vasievich and H.H. Webster (eds). USDA Forest Service, NC Forest Expmt Station, Gen’l Tech. Report NC-189., pp. 139-164.

Exercise to practice downloading data for the above recreation model: http://www.msu.edu/course/prr/389/PRR389Exercises.htm#ex4

Conducting High-Value Secondary Dataset Analysis: An Introductory Guide and Resources

ABSTRACT
Secondary analyses of large datasets provide a mechanism for researchers to address high impact questions that would otherwise be prohibitively expensive and time-consuming to study. This paper presents a guide to assist investigators interested in conducting secondary data analysis, including advice on the process of successful secondary data analysis as well as a brief summary of high-value datasets and online resources for researchers, including the SGIM dataset compendium (www.sgim.org/go/datasets). The same basic research principles that apply to primary data analysis apply to secondary data analysis, including the development of a clear and clinically relevant research question, study sample, appropriate measures, and a thoughtful analytic approach. A real-world case description illustrates key steps: (1) define your research topic and question; (2) select a dataset; (3) get to know your dataset; and (4) structure your analysis and presentation of findings in a way that is clinically meaningful. Secondary dataset analysis is a well-established methodology. Secondary analysis is particularly valuable for junior investigators, who have limited time and resources to demonstrate expertise and productivity.

Secondary analyses of large datasets provide a mechanism for researchers to address high impact questions that would otherwise be prohibitively expensive and time-consuming to study. This paper presents a guide to assist investigators interested in conducting secondary data analysis, including advice on the process of successful secondary data analysis as well as a brief summary of high-value datasets and online resources for researchers, including the SGIM dataset compendium (www.sgim.org/go/datasets). The same basic research principles that apply to primary data analysis apply to secondary data analysis, including the development of a clear and clinically relevant research question, study sample, appropriate measures, and a thoughtful analytic approach. A real-world case description illustrates key steps: (1) define your research topic and question; (2) select a dataset; (3) get to know your dataset; and (4) structure your analysis and presentation of findings in a way that is clinically meaningful. Secondary dataset analysis is a well-established methodology. Secondary analysis is particularly valuable for junior investigators, who have limited time and resources to demonstrate expertise and productivity.
KEY WORDS: large datasets, secondary analysis, publicly available, guide, resources
INTRODUCTION
Secondary data analysis is analysis of data that was collected by someone else for another primary purpose. Increasingly, generalist researchers start their careers conducting analyses of existing datasets, and some continue to make this the focus of their career. Using secondary data enables one to conduct studies of high-impact research questions with dramatically less time and resources than required for most studies involving primary data collection. For fellows and junior faculty who need to demonstrate productivity by completing and publishing research in a timely manner, secondary data analysis can be a key foundation to successfully starting a research career. Successful completion demonstrates content and methodological expertise, and may yield useful data for future grants. Despite these attributes, conducting high quality secondary data research requires a distinct skill set and substantial effort. However, few frameworks are available to guide new investigators as they conduct secondary data analysies.
In this article we describe key principles and skills needed to conduct successful analysis of secondary data and provide a brief description of high-value datasets and online resources. The primary target audience of the article is investigators with an interest but limited prior experience in secondary data analysis, as well as mentors of these investigators, who may find this article a useful reference and teaching tool. While we focus on analysis of large, publicly available datasets, many of the concepts we cover are applicable to secondary analysis of proprietary datasets. Datasets we feature in this manuscript encompass a wide range of measures, and thus can be useful to evaluate not only one disease in isolation, but also its intersection with other clinical, demographic, and psychosocial characteristics of patients. 

REASONS TO CONDUCT OR TO AVOID A SECONDARY DATA ANALYSIS
Many worthwhile studies simply cannot be done in a reasonable timeframe and cost with primary data collection. For example, if you wanted to examine racial and ethnic differences in health services utilization over the last 10 years of life, you could enroll a diverse cohort of subjects with chronic illness and wait a decade (or longer) for them to die, or you could find a dataset that includes a diverse sample of decedents. Even for less dramatic examples, primary data collection can be difficult without incurring substantial costs, including time and money—scarce resources for junior researchers in particular. Secondary datasets, in contrast, can provide access to large sample sizes, relevant measures, and longitudinal data, allowing junior investigators to formulate a generalizable answer to a high impact question. For those interested in conducting primary data collection, beginning with a secondary data analysis may provide a “bird’s eye view” of epidemiologic trends that future primary data studies examine in greater detail.
Secondary data analyses, however, have disadvantages that are important to consider. In a study focused on primary data, you can tightly control the desired study population, specify the exact measures that you would like to assess, and examine causal relationships (e.g., through a randomized controlled design). In secondary data analyses, the study population and measures collected are often not exactly what you might have chosen to collect, and the observational nature of most secondary data makes it difficult to assess causality (although some quasi-experimental methods, such as instrumental variable or regression discontinuity analysis, can partially address this issue). While not unique to secondary data analysis, another disadvantage to publicly available datasets is the potential to be “scooped,” meaning that someone else publishes a similar study from the same data set before you do. On the other hand, intentional replication of a study in a different dataset can be important in that it either supports or refutes the generalizability of the original findings. If you do find that someone has published the same study using the same dataset, try to find a unique angle to your study that builds on their findings.

STEPS TO CONDUCTING A SUCCESSFUL SECONDARY DATA ANALYSIS
The same basic research principles that apply to studies using primary data apply to secondary data analysis, including the development of a clear research question, study sample, appropriate measures, and a thoughtful analytic approach. For purposes of secondary data analysis, these principles can be conceived as a series of four key steps, described in Table 1 and the sections below. Table 2 provides a glossary of terms used in secondary analysis including dataset types and common sampling terminology.
Table 1
A Practical Approach to Successful Research with Large Datasets
Table 2
Glossary of Terms Used in Secondary Dataset Analysis Research

Define your Research Topic and Question
Case A fellow in general medicine has a strong interest in studying palliative and end-of-life care. Building on his interest in racial and ethnic disparities, he wants to examine disparities in use of health services at the end of life. He is leaning toward conducting a secondary data analysis and is not sure if he should begin with a more focused research question or a search for a dataset.Investigators new to secondary data research are frequently challenged by the question “which comes first, the question or the dataset?” In general, we advocate that researchers begin by defining their research topic or question. A good question is essential—an uninteresting study with a huge sample size or extensively validated measures is still uninteresting. The answer to a research question should have implications for patient care or public policy. Imagine the possible findings and ask the dreaded question: "so what?" If possible, select a question that will be interesting regardless of the direction of the findings: positive or negative. Also, determine a target audience who would find your work interesting and useful.It is often useful to start with a thorough literature review of the question or topic of interest. This effort both avoids duplicating others’ work and develops ways to build upon the literature. Once the question is established, identify datasets that are the best fit, in terms of the patient population, sample size, and measures of the variables of interest (including predictors, outcomes, and potential confounders). Once a candidate dataset has been identified, we recommend being flexible and adapting the research question to the strengths and limitations of the dataset, as long as the question remains interesting and specific and the methods to answer it are scientifically sound. Be creative. Some measures of interest may not have been ascertained directly, but data may be available to construct a suitable proxy. In some cases, you may find a dataset that initially looked promising lacks the necessary data (or data quality) to answer research questions in your area of interest reliably. In that case, you should be prepared to search for an alternative dataset.A specific research question is essential to good research. However, many researchers have a general area of interest but find it difficult to identify specific research questions without knowing the specific data available. In that case, combing research documentation for unexamined yet interesting measures in your area of interest can be fruitful. Beginning with the dataset and no focused area of interest may lead to data dredging—simply creating cross tabulations of unexplored variables in search of significant associations is bad science. Yet, in our experience, many good studies have resulted from a researcher with a general topic area of interest finding a clinically meaningful yet underutilized measure and having the insight to frame a research question that uses that measure to answer a novel and clinically compelling question (see references for examples).48 Dr. Warren Browner once exhorted, “just because you were not smart enough to think of a research question in advance doesn’t mean it’s not important!” [quote used with permission].

Select a Dataset
Case Continued After a review of available datasets that fit his topic area of interest, the fellow decides to use data from the Surveillance Epidemiology and End Results Program linked to Medicare claims (SEER-Medicare).The range and intricacy of large datasets can be daunting to a junior researcher. Fortunately, several online compendia are available to guide researchers (Table 3), including one recently developed by this manuscript’s authors for the Society of General Internal Medicine (SGIM) (www.sgim.org/go/datasets). The SGIM Research Dataset Compendium was developed and is maintained by members of the SGIM research committee. SGIM Compendium developers consulted with experts to identify and profile high-value datasets for generalist researchers. The Compendium includes a description of and links to over 40 high-value datasets used for health services, clinical epidemiology, and medical education research. The SGIM Compendium provides detailed information of use in selecting a dataset, including sample sizes and characteristics, available measures and how data was measured, comments from expert users, links to the dataset, and example publications (see Box for example). A selection of datasets from this Compendium is listed in Table 4. SGIM members can request a one-time telephone consultation with an expert user of a large dataset (see details on the Compendium website).
An external file that holds a picture, illustration, etc.
Object name is 11606_2010_1621_Figa_HTML.gif
Dataset complexity, cost, and time to acquire the data and obtain institutional review board (IRB) approval are critical considerations for junior researchers, who are new to secondary analysis, have few financial resources, and limited time to demonstrate productivity. Table 4 illustrates the complexity and cost of large datasets across a range of high value datasets used by generalist researchers. Dataset complexity increases by number of subjects, file structure (e.g., single versus multiple records per individual), and complexity of the survey design. Many publicly available datasets are free, and others can cost tens of thousands of dollars to obtain. Time to acquire the datasets and obtain IRB board approval vary. Some datasets can be downloaded from the web, others require multiple layers of permission and security, and in some cases data must be analyzed in a central data processing center. If the project requires linking new data to an existing database, this linkage will add to the time needed to complete the project and probably require enhanced data security. One advantage of most secondary studies using publicly available datasets is the rapid time to IRB approval. Many publicly available large datasets contain de-identified data and are therefore eligible for expedited review or exempt status. If you can download the dataset from the web, it is probably exempt, but your local IRB must make this determination.Linking datasets can be a powerful method for examining an issue by providing multiple perspectives of patient experience. Many datasets, including SEER, for example, can be linked to the Area Resource File to examine regional variation in practice patterns. However, linking datasets together increases the complexity and cost of data management. A new researcher might consider first conducting a study only on the initial database, and then conducting their next study using the linked database. For some new investigators, this approach can progressively advance programming skills and build confidence while demonstrating productivity.
Table 3
Online Compendia of Secondary Datasets
Table 4
Examples of High Value Datasets

Get to Know your Dataset
Case Continued The fellow’s primary mentor encourages him to closely examine the accuracy of the primary predictor for his study—race and ethnicity—as reported in SEER-Medicare. The fellow has a breakthrough when he finds an entire issue of the journal Medical Care dedicated to SEER-Medicare, including a whole chapter on the accuracy of coding of sociodemographic factors.9In an analysis of primary data you select the patients to be studied and choose the study measures. This process gives you a close familiarity with study subjects, and how and what data were collected, that is invaluable in assessing the validity of their measures, the potential bias in measuring associations between predictors and outcome variables (internal validity), and the generalizability of their findings to target populations (external validity). The importance of this familiarity with the strengths and weaknesses of the dataset cannot be overemphasized. Secondary data research requires considerable effort to obtain the same level of familiarity with the data. Therefore, knowing your data in detail is critical. Practically, this objective requires scouring online documentation and technical survey manuals, searching PubMed for validation studies, and closely reading previous studies using your dataset, to answer the following types of questions: Who collected the data, and for what purpose? How did subjects get into your dataset? How were they followed? Do your measures capture what you think they capture?We strongly recommend taking advantage of help offered by the dataset managers, typically described on the dataset’s website. For example, the Research Data Assistance Center (ResDAC) is a dedicated resource for researchers using data from the Centers for Medicare and Medicaid Services (CMS).Assessing the validity of your measures is one of the central challenges of large dataset research. For large survey datasets, a good first step in assessing the validity of your measures is to read the questions as they were asked in the survey. Some questions simply have face validity. Others, unfortunately, were collected in a way that makes the measure meaningless, problematic, or open to a range of interpretations. These ambiguities can occur in how the question was asked or in how the data were recorded into response categories.Another essential step is to search the online documentation and published literature for previous validation studies. A PubMed search using the dataset name or measure name/type and the publication type “validation studies” is a good starting point. The key question for a validity study relates to how and why the question was asked and data were collected (e.g., self-report, chart abstraction, physical measurements, billing claims) in relationship to a gold standard. For example, if you are using claims data you should recognize that the primary purpose of those data was not for research, but for reimbursement. Consequently, claims data are limited by the scope of services that are reimbursable and the accuracy of coding by clinicians completing encounter forms for billing or by coders in the claims departments of hospitals and clinics. Some clinical measures can be assessed by asking subjects if they have the condition of interest, such as self reported diagnosis of hypertension. Self-reported data may be adequate for some research questions (e.g., does a diagnosis of hypertension lead people to exercise more?), but inadequate for others (e.g., the prevalence of hypertension among people with diabetes). Even measured data, such as blood pressure, have limitations in that methods of measurement for a study may differ from methods used to diagnose a disorder in the clinician’s office. In the National Health and Nutrition Examination Survey, for example, subject’s blood pressure is based on the average of several measures in a single visit. This differs from the standard clinical practice of measuring blood pressure at separate office visits before diagnosing hypertension. Rarely do available measures capture exactly what you are trying to study. In our experience measures in existing datasets are often good enough to answer the research question, with proper interpretation to account for what the measures actually assesses and how they differ from the underlying constructs.Finally, we suggest paying close attention to the completeness of measures, and evaluating whether missing data are random or non-random (the latter might result in bias, whereas the former is generally acceptable). Statistical approaches to missing data are beyond the scope of this paper, and most statisticians can help you address this problem appropriately. However, pay close attention to “skip patterns”; some data are missing simply because the survey item is only asked of a subset for which it applies. For example, in the Health and Retirement Study the question about need for assistance with toileting is only asked of subjects who respond that they have difficulty using the toilet. If you were unaware of this skip pattern and attempted to study assistance with toileting, you would be distressed to find over three-quarters of respondents had missing responses for this question (because they reported no difficulty using the toilet).Fellows and other trainees usually do their own computer programming. Although this may be daunting, we encourage this practice so fellows can get a close feel for the data and become more skilled in statistical analysis. Datasets, however, range in complexity (Table 4). In our experience, fellows who have completed introductory training in SAS, STATA, SPSS, or other similar statistical software have been highly successful analyzing datasets of moderate complexity without the on-going assistance of a statistical programmer. However, if you do have a programmer who will do much of the coding, be closely involved and review all data cleaning and statistical output as if you had programmed it yourself. Close attention can reveal all sorts of patterns, problems, and opportunities with the data that are obscured by focusing only on the final outputs prepared by a statistical programmer. Programmers and statisticians are not clinicians; they will often not recognize when the values of variables or patterns of missingness don’t make sense. If estimates seem implausible or do not match previously published estimates, then the analytic plan, statistical code, and measures should be carefully rechecked.Keep in mind that “the perfect may be the enemy of the good.” No one expects perfect measures (this is also true for primary data collection). The closer you are to the data, the more you see the warts—don’t be discouraged by this. The measures need to pass the sniff test, in other words have clinical validity based primarily on judgement that they make sense clinically or scientifically, but also supported where possible by validation procedures, reference to auditing procedures, or in other studies that have independently validated the measures of interest.

Structure your Analysis and Presentation of Findings in a Way that Is Clinically Meaningful
Case continued The fellow finds that Blacks are less likely to receive chemotherapy in the last 2 weeks of life (Blacks 4%, Whites 6%, p < 0.001). He debates the meaning of this statistically significant 2% absolute difference.Often, the main challenge for investigators who are new to secondary data analysis is carefully structuring the analysis and presentation of findings in a way that tells a meaningful story. Based on what you’ve found, what is the story that you want your target audience to understand? When appropriate, it can be useful to conduct carefully planned sensitivity analysis to evaluate the robustness of your primary findings. A sensitivity analysis assesses the effect of variation in assumptions on the outcome of interest. For example, if 10% of subjects did not answer a “yes” or “no” question, you could conduct sensitivity analyses to estimate the effects of excluding missing responses, or categorizing them as all “yes” or all “no.” Because large datasets may contain multiple measures of interests, co-variates, and outcomes, a frequent temptation is to present huge tables with multiple rows and columns. This is a mistake. These tables can be challenging to sort through, and the clinical importance of the story resulting from the analysis can be lost. In our experience, a thoughtful figure often captures the take-home message in a way that is more interpretable and memorable to readers than rows of data tables.You should keep careful track of subjects you decide to exclude from the analysis and why. Editors, reviewers, and readers will want to know this information. The best way to keep track is to construct a flow diagram from the original denominator to the final sample.Don’t confuse statistical significance with clinical importance in large datasets. Due to large sample sizes, associations may be statistically significant but not clinically meaningful. Be mindful of what is meaningful from a clinical or policy perspective. One concern that frequently arises at this stage in large database research is the acceptability of “exploratory” analyses, or the practice of examining associations between multiple factors of interest. On the one hand, exploratory analyses risk finding a significant association by chance alone from testing multiple associations (a false-positive result). On the other hand, the critical issue is not a statistical one, but rather whether the issue is important.10 Exploratory analyses are acceptable if done in a thoughtful way that serves an a priori hypothesis, but not if merely data dredging looking for associations.We recommend consulting with a statistician when using data from a complex survey design (see Table 2) or developing a conceptually advanced study design, for example, using longitudinal data, multilevel modeling with clustered data, or surivival analysis. The value of input (even if informal) from a statistician or other advisor with substantial methodological expertise cannot be overstated.

CONCLUSIONS
Case Conclusion Two years after he began the project the fellow completes the analysis and publishes the paper in a peer-reviewed journal. A 2-year timeline from inception to publication is typical for large database research. Academic potential is commonly assessed by the ability to see a study through to publication in a peer-reviewed journal. This timeline allows a fellow who began a secondary analysis at the start of a 2-year training program to search for a job with an article under review or in press.In conclusion, secondary dataset research has tremendous advantages, including the ability to assess outcomes that would be difficult or impossible to study using primary data collection, such as those involving exceptionally long follow-up times or rare outcomes. For junior investigators, the potential for a shorter time to publication may help secure a job or career development funding. Some of the time “saved” by not collecting data yourself, however, needs to be “spent” becoming familiar with the dataset in intimate detail. Ultimately, the same factors that apply to successful primary data analysis apply to secondary data analysis, including the development of a clear research question, study sample, appropriate measures, and a thoughtful analytic approach.

Acknowledgments
Contributors The authors would like to thank Sei Lee, MD, Mark Freidberg, MD, MPP, and J. Michael McWilliams, MD, PhD, for their input on portions of this manuscript.
Grant Support Dr. Smith is supported by a Research Supplement to Promote Diversity in Health Related Research from the National Institute on Aging (R01AG028481), the National Center for Research Resources UCSF-CTSI (UL1 RR024131), and the National Palliative Care Research Center. Dr. Steinman is supported by the National Institute on Aging and the American Federation for Aging Research (K23 AG030999). An unrestricted grant from the Society of General Internal Medicine (SGIM) supported development of the SGIM Research Dataset Compendium.
Prior Presentations An earlier version of this work was presented as a workshop at the Annual Meeting of the Society of General Internal Medicine in Minneapolis, MN, April 2010.
Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
Conflict of Interest None disclosed.

References
1. Mainous AG, 3rd, Hueston WJ. Using other people’s data: the ins and outs of secondary data analysis. Fam Med. 1997;29(8):568–571. [PubMed]
2. Doolan DM, Froelicher ES. Using an existing data set to answer new research questions: a methodological review. Res Theory Nurs Pract. 2009;23(3):203–215. doi: 10.1891/1541-6577.23.3.203. [PubMed][Cross Ref]
3. Shlipak M, Stehman-Breen C. Observational research databases in renal disease. J Am Soc Nephrol. 2005;16(12):3477–3484. doi: 10.1681/ASN.2005080806. [PubMed] [Cross Ref]
4. Williams BA, Lindquist K, Moody-Ayers SY, Walter LC, Covinsky KE. Functional impairment, race, and family expectations of death. J Am Geriatr Soc. 2006;54(11):1682–1687. doi: 10.1111/j.1532-5415.2006.00941.x. [PubMed] [Cross Ref]
5. Steinman MA, Sands LP, Covinsky KE. Self-restriction of medications due to cost in seniors without prescription coverage. J Gen Intern Med. 2001;16(12):793–799. doi: 10.1046/j.1525-1497.2001.10412.x.[PMC free article] [PubMed] [Cross Ref]
6. Lindenberger EC, Landefeld CS, Sands LP, et al. Unsteadiness reported by older hospitalized patients predicts functional decline. J Am Geriatr Soc. 2003;51(5):621–626. doi: 10.1034/j.1600-0579.2003.00205.x.[PubMed] [Cross Ref]
7. Linder JA, Ma J, Bates DW, Middleton B, Stafford RS. Electronic health record use and the quality of ambulatory care in the United States. Arch Intern Med. 2007;167(13):1400–1405. doi: 10.1001/archinte.167.13.1400. [PubMed] [Cross Ref]
8. Lee SJ, Steinman MA. Tan EJ. Driving Status and Mortality in US Retirees: Volunteering; 2010.[PMC free article] [PubMed]
9. Bach PB, Guadagnoli E, Schrag D, Schussler N, Warren JL. Patient demographic and socioeconomic characteristics in the SEER-Medicare database applications and limitations. Med Care. 2002;40(8):IV-19–25.[PubMed]
10. Browner WS, Newman TB. Are all significant P values created equal? The analogy between diagnostic tests and clinical research. JAMA. 1987;257(18):2459–2463. doi: 10.1001/jama.257.18.2459. [PubMed][Cross Ref]
11. Smith AK, Earle CC, McCarthy EP. Racial and Ethnic Differences in End-of-Life Care in Fee-for-Service Medicare Beneficiaries with Advanced Cancer. J Am Geriatr Soc. Nov 21 2008. [PMC free article][PubMed]
12. Goel MS, Burns RB, Phillips RS, Davis RB, Ngo-Metzger Q, McCarthy EP. Trends in breast conserving surgery among Asian Americans and Pacific Islanders, 1992-2000. J Gen Intern Med. 2005;20(7):604–611. doi: 10.1007/s11606-005-0107-3. [PMC free article] [PubMed] [Cross Ref]
13. Byfield SA, Earle CC, Ayanian JZ, McCarthy EP. Treatment and outcomes of gastric cancer among United States-born and foreign-born Asians and Pacific Islanders. Cancer. 2009;115(19):4595–4605. doi: 10.1002/cncr.24487. [PMC free article] [PubMed] [Cross Ref]
14. Mehrotra A, Zaslavsky AM, Ayanian JZ. Preventive health examinations and preventive gynecological examinations in the United States. Arch Intern Med. 2007;167(17):1876–1883. doi: 10.1001/archinte.167.17.1876. [PubMed] [Cross Ref]
15. Harman JS, Veazie PJ, Lyness JM. Primary care physician office visits for depression by older Americans. J Gen Intern Med. 2006;21(9):926–930. doi: 10.1007/BF02743139. [PMC free article][PubMed] [Cross Ref]
16. Hoffman KE, McCarthy EP, Recklitis CJ, Ng AK. Psychological distress in long-term survivors of adult-onset cancer: results from a national survey. Arch Intern Med. 2009;169(14):1274–1281. doi: 10.1001/archinternmed.2009.179. [PubMed] [Cross Ref]
17. Mohanty SA, Woolhandler S, Himmelstein DU, Bor DH. Diabetes and cardiovascular disease among Asian Indians in the United States. J Gen Intern Med. 2005;20(5):474–478. doi: 10.1111/j.1525-1497.2005.40294.x. [PMC free article] [PubMed] [Cross Ref]
18. Hausmann LR, Jeong K, Bost JE, Ibrahim SA. Perceived discrimination in health care and use of preventive health services. J Gen Intern Med. 2008;23(10):1679–1684. doi: 10.1007/s11606-008-0730-x.[PMC free article] [PubMed] [Cross Ref]
19. Ross JS, Keyhani S, Keenan PS, et al. Use of recommended ambulatory care services: is the Veterans Affairs quality gap narrowing? Arch Intern Med. 2008;168(9):950–958. doi: 10.1001/archinte.168.9.950.[PubMed] [Cross Ref]
20. Ibrahim SA, Kwoh CK, Krishnan E. Factors associated with patients who leave acute-care hospitals against medical advice. Am J Public Health. 2007;97(12):2204–2208. doi: 10.2105/AJPH.2006.100164.[PMC free article] [PubMed] [Cross Ref]
21. Trivedi AN, Sequist TD, Ayanian JZ. Impact of hospital volume on racial disparities in cardiovascular procedure mortality. J Am Coll Cardiol. 2006;47(2):417–424. doi: 10.1016/j.jacc.2005.08.068. [PubMed][Cross Ref]
22. Ginde AA, Liu MC, Camargo CA. Demographic differences and trends of vitamin D insufficiency in the US population, 1988-2004. Arch Intern Med. 2009;169(6):626–632. doi: 10.1001/archinternmed.2008.604.[PMC free article] [PubMed] [Cross Ref]
23. Nguyen NT, Magno CP, Lane KT, Hinojosa MW, Lane JS. Association of hypertension, diabetes, dyslipidemia, and metabolic syndrome with obesity: findings from the National Health and Nutrition Examination Survey, 1999 to 2004. J Am Coll Surg. 2008;207(6):928–934. doi: 10.1016/j.jamcollsurg.2008.08.022. [PubMed] [Cross Ref]
24. Lee SJ, Go AS, Lindquist K, Bertenthal D, Covinsky KE. Chronic conditions and mortality among the oldest old. Am J Public Health. 2008;98(7):1209–1214. doi: 10.2105/AJPH.2007.130955. [PMC free article][PubMed] [Cross Ref]
25. Silveira MJ, Kim SY, Langa KM. Advance directives and outcomes of surrogate decision making before death. N Engl J Med. Apr 1;362(13):1211-1218. [PMC free article] [PubMed]
26. Sommers BD. Loss of health insurance among non-elderly adults in Medicaid. J Gen Intern Med. 2009;24(1):1–7. doi: 10.1007/s11606-008-0792-9. [PMC free article] [PubMed] [Cross Ref]
28. Carcaise-Edinboro P, Bradley CJ. Influence of patient-provider communication on colorectal cancer screening. Med Care. 2008;46(7):738–745. doi: 10.1097/MLR.0b013e318178935a. [PubMed] [Cross Ref]
29. Hernandez AF, Shea AM, Milano CA, et al. Long-term outcomes and costs of ventricular assist devices among Medicare beneficiaries. JAMA. 2008;300(20):2398–2406. doi: 10.1001/jama.2008.716.[PMC free article] [PubMed] [Cross Ref]
30. Shea AM, Curtis LH, Hammill BG, DiMartino LD, Abernethy AP, Schulman KA. Association between the Medicare Modernization Act of 2003 and patient wait times and travel distance for chemotherapy. JAMA. 2008;300(2):189–196. doi: 10.1001/jama.300.2.189. [PubMed] [Cross Ref]
31. Madden JM, Graves AJ, Zhang F, et al. Cost-related medication nonadherence and spending on basic needs following implementation of Medicare Part D. JAMA. 2008;299(16):1922–1928. doi: 10.1001/jama.299.16.1922. [PMC free article] [PubMed] [Cross Ref]
32. Tjia J, Briesacher BA, Soumerai SB, et al. Medicare beneficiaries and free prescription drug samples: a national survey. J Gen Intern Med. 2008;23(6):709–714. doi: 10.1007/s11606-008-0568-2.[PMC free article] [PubMed] [Cross Ref]
33. Gilchrist VJ, Stange KC, Flocke SA, McCord G, Bourguet CC. A comparison of the National Ambulatory Medical Care Survey (NAMCS) measurement approach with direct observation of outpatient visits. Med Care. 2004;42(3):276–280. doi: 10.1097/01.mlr.0000114916.95639.af. [PubMed] [Cross Ref]